CN111223570A - Pathological data analysis method, device, equipment and storage medium - Google Patents

Pathological data analysis method, device, equipment and storage medium Download PDF

Info

Publication number
CN111223570A
CN111223570A CN202010005182.7A CN202010005182A CN111223570A CN 111223570 A CN111223570 A CN 111223570A CN 202010005182 A CN202010005182 A CN 202010005182A CN 111223570 A CN111223570 A CN 111223570A
Authority
CN
China
Prior art keywords
pathological
clustering result
sample
clustering
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010005182.7A
Other languages
Chinese (zh)
Inventor
蔡金成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010005182.7A priority Critical patent/CN111223570A/en
Priority to PCT/CN2020/093328 priority patent/WO2021135063A1/en
Publication of CN111223570A publication Critical patent/CN111223570A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Public Health (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of machine learning, and discloses a pathological data analysis method, a pathological data analysis device, pathological data analysis equipment and a storage medium, wherein the method comprises the following steps: acquiring a clustering result of a pathological data sample set; calculating and adjusting the contour coefficient according to the clustering result; determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result; when the clustering result is excellent, acquiring a pathological data sample to be processed; and classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed. The method solves the problem of overhigh time complexity in the clustering result evaluation and calculation process, greatly reduces the data calculation amount in the evaluation and calculation process, greatly improves the efficiency of clustering result evaluation, and can accelerate the judgment of the pathological data clustering result so as to quickly determine the optimal pathological data clustering result.

Description

Pathological data analysis method, device, equipment and storage medium
Technical Field
The invention relates to the field of machine learning, in particular to a pathological data analysis method, a pathological data analysis device, pathological data analysis equipment and a storage medium.
Background
In the medical field, with the development of technology, a hospital management system collects pathological data of a large number of patients. The pathological data can be divided into a plurality of sets by combining a clustering algorithm, and each set corresponds to one disease condition. Thus helping doctors to realize the accurate diagnosis of patients with difficult and complicated diseases.
While clustering is an algorithm that involves unsupervised grouping of data. The clustering algorithm is also called clustering analysis, is a statistical analysis method for researching data classification problems, and is an important means for data mining.
In a given data set, after the data set is divided into different groups by a clustering algorithm, the clustering result needs to be evaluated to evaluate the quality of the clustering result. The contour Coefficient (Silhouette coeffient) is a clustering result evaluation method for evaluating the effect of unsupervised clustering algorithms for use in determining the number of clusters (i.e., groups) during clustering. The contour coefficients combined with the degree of agglomeration (Cohesion) and degree of Separation (Separation) of the clusters evaluate the clustering effect. The value range of the contour coefficient is [ -1,1], and the larger the value is, the better the clustering effect is.
However, the temporal complexity of the contour coefficients is very high, with the temporal complexity being the square of n, i.e., O (n2), where n is the number of samples. In the process of processing a large-scale data set, the calculation amount of the contour coefficient of the clustering result is very large, and the result is difficult to calculate in a short time. Especially, when the number of clusters is determined by using the contour coefficients, the contour coefficients of a plurality of clustering results need to be calculated, and the whole process consumes longer time.
After the pathological data is clustered, a plurality of different clustering results are generally calculated. Because the amount of pathological data is huge, the detection indexes are also many, so that unpredictable errors often occur when the conventional contour coefficients are used for evaluating pathological data clustering results, or the calculation time is too long, and the required evaluation results cannot be obtained in time.
Disclosure of Invention
Therefore, it is necessary to provide a pathological data analysis method for solving the problem of too high time complexity in the clustering result evaluation and calculation process, improving the calculation speed of clustering result evaluation, and quickly determining the quality of the clustering result, so as to classify pathological data samples according to the clustering result and obtain the pathological analysis data required to be obtained.
A method of pathological data analysis, comprising:
acquiring a clustering result of a pathological data sample set, wherein the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
calculating the central point of each cluster according to the clustering result;
calculating the distance between a pathological sample point i and the central point of each cluster;
calculating an adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster, wherein the calculation formula is as follows:
Figure BDA0002354998660000021
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
calculating the average number of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result;
determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result;
when the clustering result is excellent, acquiring a pathological data sample to be processed;
classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed.
A pathological data analysis device comprising:
the system comprises an acquisition result module, a comparison module and a display module, wherein the acquisition result module is used for acquiring a clustering result of a pathological data sample set, the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
the central point calculation module is used for calculating the central point of each cluster according to the clustering result;
the distance calculation module is used for calculating the distance between a pathological sample point i and the center point of each cluster;
a sample point coefficient calculating module, configured to calculate an adjustment contour coefficient of the pathological sample point i according to a distance between the pathological sample point i and a center point of each cluster, where the calculation formula is as follows:
Figure BDA0002354998660000031
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
a result coefficient calculating module, configured to calculate an average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
the result evaluation module is used for determining the advantages and disadvantages of the clustering results according to the adjustment contour coefficients of the clustering results;
the sample obtaining module is used for obtaining a pathological data sample to be processed when the clustering result is excellent;
and the sample analysis module is used for classifying the pathological data samples to be processed according to the clustering result and generating pathological analysis data corresponding to the pathological data samples to be processed.
A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the above-mentioned pathology data analysis method when executing said computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the above-mentioned pathology data analysis method.
According to the pathological data analysis method, the pathological data analysis device, the computer equipment and the storage medium, the clustering result of the pathological data sample set is obtained, the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold value, so that the result obtained by clustering analysis is obtained; and calculating the center point of each cluster according to the clustering result so as to determine the center point position of each cluster. And calculating the distance between the pathological sample point i and the central point of each cluster, wherein the calculated amount is greatly reduced because only the distance between the pathological sample point i and the central point of the cluster is calculated, but not the distances between the pathological sample point i and all other pathological sample points i. And calculating the adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster to obtain the adjustment contour coefficient of a single pathological sample point i, wherein the calculation amount is less than that of the method before improvement. And calculating the average of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result, wherein the calculation speed is high due to the mean calculation. And determining the quality of the clustering result according to the adjustment profile coefficient of the clustering result, wherein the quality of the clustering result can be quickly judged because the adjustment profile coefficient of the clustering result can be quickly calculated, and the higher the adjustment profile coefficient of the clustering result is, the more accurate the clustering result is. And when the clustering result is excellent, acquiring the pathological data samples to be processed so as to classify the pathological data samples by using the clustering result. Classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed to generate valuable data to prompt pathological risks of patients. The method solves the problem of overhigh time complexity in the clustering result evaluation and calculation process, greatly reduces the data calculation amount in the evaluation and calculation process, greatly improves the efficiency of clustering result evaluation, and can accelerate the judgment of the pathological data clustering result so as to quickly determine the optimal pathological data clustering result.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of the method for analyzing pathological data according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for analyzing pathological data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a calculation path for comparing before and after refinement;
FIG. 4 is a flow chart of a method for analyzing pathological data according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method for analyzing pathological data according to an embodiment of the present invention;
FIG. 6 is a flow chart of a method for analyzing pathological data according to an embodiment of the present invention;
FIG. 7 is a flow chart of a method for analyzing pathological data according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a pathological data analysis device according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The pathological data analysis method provided by the embodiment can be applied to the application environment shown in fig. 1, in which the client communicates with the server through a network. The client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.
In an embodiment, as shown in fig. 2, a pathological data analysis method is provided, which is described by taking the application of the method to the server side in fig. 1 as an example, and includes the following steps:
s10, obtaining a clustering result of a pathological data sample set, wherein the clustering result divides the pathological data sample set into a plurality of clusters, each cluster is composed of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is larger than a preset number threshold;
s20, calculating the center point of each cluster according to the clustering result;
s30, calculating the distance between the pathological sample point i and the central point of each cluster
S40, calculating the adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster, wherein the calculation formula is as follows:
Figure BDA0002354998660000061
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
s50, calculating the average of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result;
s60, determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result;
s70, when the clustering result is excellent, acquiring a pathological data sample to be processed;
and S80, classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed.
In this embodiment, the clustering result may be a result obtained after the pathological data sample set performs a clustering task. The clustering result of the pathological data sample set can be obtained by a method based on partitioning and clustering based on coacervation hierarchy, such as K-means, aggregative and the like. The preset quantity threshold value can be set according to actual needs, and can be set to be 5 thousands, 10 or other values. Here, each pathology sample i in the pathology data sample set includes a plurality of detection indexes, such as a first detection index, a second detection index, … …. The pathological sample i may be considered as a point in a multidimensional space. Specifically, the spatial dimension of each pathological sample point i is the same before clustering. That is, the pathological sample points i in the pathological data sample set contain the same number of detection indicators. The clustering result divides the pathological data sample set into a plurality of clusters, and each cluster has one or more pathological sample points i. Clusters here may be in the meaning of groups or subsets. Usually, the disease species corresponding to the same cluster are the same.
Since the value of the pathological sample point i is known, it can be expressed in the form of coordinates, such as (x)i,yi). Thus, the cluster center point c can be solved. The coordinate value of the center point c of the cluster is equal to the average value of the coordinate values of all the sample points of the cluster. For example, cluster N is denoted as { i }1,i2……inEach sample can be represented as (x)i,yi) The coordinates of the center point c of the cluster may then be:
Figure BDA0002354998660000081
after solving the cluster center point c, the distance of each pathology sample point i from the cluster center point c can be calculated. If the number of clusters is k, k distances can be calculated for each pathological sample point i, wherein the k distances include an intra-cluster distance (the distance between the pathological sample point i and the intra-cluster central point) and m-1 extra-cluster distances (the distance between the pathological sample point i and the extra-cluster central point).
Then, the adjustment contour coefficient of the sample point can be calculated according to the distance between the sample point and the center point of the cluster. The adjustment contour coefficient of the pathological sample point i is calculated by the following formula:
Figure BDA0002354998660000082
in the above formula, sc(i) Adjusted contour coefficient, a, representing pathological sample point ic(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) The distance of the cluster center closest to the pathological sample point i is indicated.
In the process of solving, bc(i) Which is the minimum of k-1 extra-cluster distances. The adjustment contour coefficients of the sample points can thus be solved. The calculated adjustment contour coefficient of the pathological sample point i is a numerical value with the value range of [ -1,1]。
The adjusted contour coefficients of all the sample points can be calculated according to the formula in the previous step, and then the average of the adjusted contour coefficients of all the sample points is calculated, so that the adjusted contour coefficients of the clustering result can be obtained. Similarly, the adjustment contour coefficient of the clustering result is a numerical value with a value range of [ -1,1 ].
After the adjustment contour coefficient is calculated, the quality of the clustering result can be determined according to the adjustment contour coefficient. The larger the value, the better the clustering effect of the clustering result. The cluster result can be ranked according to the assigned numerical range, such as (0.5, 1) is good, (0, 0.5) is general, and [ -1, 0] is poor.
The time complexity of adjusting the contour coefficients is represented by O (n) compared to the original contour coefficients2) Reducing the calculation amount to O (n) greatly reduces the calculation amount required for evaluating the clustering result. In the processing process of the large-scale data set, the rapid evaluation of a plurality of clustering results can be realized so as to determine the advantages and disadvantages of the clustering results.
By using the pathological data analysis method provided by the embodiment, the judgment on the pathological data clustering result can be accelerated, so that the optimal pathological data clustering result can be quickly determined.
After the optimal clustering result of the pathological data is determined, the pathological data samples to be processed can be obtained, then the pathological data samples to be processed are classified according to the clustering result, and the pathological analysis data corresponding to the pathological data samples to be processed are generated. In some cases, the pathology analysis data may be a pathology risk cue report for the patient.
In order to compare the difference between the original contour coefficient and the adjusted contour coefficient of the present embodiment, a schematic diagram of the calculation path as shown in fig. 3 is provided. FIG. 3-a shows the path used to calculate the degree of coagulation (distance of pathological sample point i from the sample point within the cluster) before improvement; FIG. 3-b shows the path used to calculate the degree of separation (distance of pathological sample point i from the sample points outside the cluster) before improvement; FIG. 3-c shows the path for calculating the degree of coagulation (distance of pathological sample point i from the sample point within the cluster) after modification; fig. 3-d shows the path used to calculate the degree of separation (distance of pathological sample point i from the sample point outside the cluster) after refinement.
In an application example, the original contour coefficient calculation method and the contour coefficient adjustment method are respectively used for evaluating the clustering result of the same pathological data sample set, and the results are shown in table 1.
TABLE 1 calculation time consumption of different evaluation methods for processing clustering results of the same pathology data sample set
Figure BDA0002354998660000091
Figure BDA0002354998660000101
The configuration of the server for calculating the test results of table 1 is: 20-core CPU, maximum speed 2.39 GHz; 256G memory, speed: 2400 MHz.
From the accurate precision analysis, compared with the original contour coefficient, when the compactness of the contour coefficient in a calculation cluster is adjusted, the average distance from the pathological sample point i of each cluster to the center point of the cluster is adopted instead of the average distance between every two samples in the cluster, so that the time consumption and the space cost for calculating the sample distance matrix can be greatly reduced, a large amount of calculation resources are saved, and the running speed is improved. Taking the sample set P as an example, the calculation time is reduced from the original 22871.75766 seconds to the improved 2.728480302, and the calculation efficiency is improved by 8382.6 times. But also in the accuracy of the distance calculation.
The method provided by the embodiment is also suitable for other sample sets with large data processing amount and high dimensionality, such as the financial data processing field, the medicine data analysis field, the image data identification field and the like.
In steps S10-S80, obtaining a clustering result of a pathology data sample set, where the clustering result divides the pathology data sample set into a plurality of clusters, each cluster is composed of a plurality of pathology sample points i, and the number of the pathology sample points i in the pathology data sample set is greater than a preset number threshold, so as to obtain a result obtained by clustering analysis; and calculating the center point of each cluster according to the clustering result so as to determine the center point position of each cluster. And calculating the distance between the pathological sample point i and the central point of each cluster, wherein the calculated amount is greatly reduced because only the distance between the pathological sample point i and the central point of the cluster is calculated, but not the distances between the pathological sample point i and all other pathological sample points i. And calculating the adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster to obtain the adjustment contour coefficient of a single pathological sample point i, wherein the calculation amount is less than that of the method before improvement. And calculating the average of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result, wherein the calculation speed is high due to the mean calculation. And determining the quality of the clustering result according to the adjustment profile coefficient of the clustering result, wherein the quality of the clustering result can be quickly judged because the adjustment profile coefficient of the clustering result can be quickly calculated, and the higher the adjustment profile coefficient of the clustering result is, the more accurate the clustering result is. And when the clustering result is excellent, acquiring the pathological data samples to be processed so as to classify the pathological data samples by using the clustering result. Classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed to generate valuable data to prompt pathological risks of patients.
Optionally, as shown in fig. 4, after step S50, the method further includes:
s51, calculating the adjustment contour coefficients of the plurality of clustering results;
and S52, determining the clustering result with the highest adjustment contour coefficient as the optimal clustering result of the pathological data sample set.
In this embodiment, since the calculation amount of the adjustment contour coefficient is greatly reduced, the computer can calculate the adjustment contour coefficients of a plurality of clustering results in a short time. And then determining the optimal clustering result according to the size of the adjusted contour coefficient. The larger the value of the adjustment contour coefficient is, the better the clustering effect of the clustering result is, so that the clustering result with the highest adjustment contour coefficient can be determined as the optimal clustering result of the pathological data sample set.
In steps S51-S52, the adjusted contour coefficients of the plurality of clustering results are calculated to quickly calculate the adjusted contour coefficients of the plurality of clustering results. And determining the clustering result with the highest adjustment contour coefficient as the optimal clustering result of the pathological data sample set, wherein the optimal clustering result can be quickly determined due to the high calculation speed of the adjustment contour coefficient of the clustering result.
Optionally, as shown in fig. 5, after step S50, the method further includes:
s53, judging whether the adjustment contour coefficient of the clustering result is larger than a preset coefficient threshold value;
and S54, if the adjustment contour coefficient of the clustering result is larger than a preset coefficient threshold value, determining the clustering result as the optimal clustering result of the pathological data sample set.
In some cases, an expected value, i.e. a preset coefficient threshold value, may be set, and when the adjusted contour coefficient is greater than the preset coefficient threshold value, the clustering result may be determined as a preferred clustering result of the pathology data sample set. For example, in one example, the predetermined coefficient threshold may be set to 0.5.
In steps S53-S54, it is determined whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold, so as to compare the calculated adjusted contour coefficient of the clustering result with the preset coefficient threshold. And if the adjustment contour coefficient of the clustering result is greater than the preset coefficient threshold, determining the clustering result as the optimal clustering result of the pathological data sample set, and selecting the adjustment contour coefficient of the clustering result greater than the preset coefficient threshold as the optimal clustering result of the pathological data sample set.
Optionally, as shown in fig. 6, step S10 further includes, before:
s11, acquiring the pathological data sample set;
s12, calculating the clustering result of the pathology data sample set based on a K-Means clustering algorithm.
In this embodiment, K-Means is a clustering analysis algorithm for iterative solution. The calculation process is as follows: first, the number to be clustered is determined, and their respective center points are randomly initialized. To determine the number to cluster, it is preferable to quickly look at the data and attempt to identify any different groupings. The center point is a vector of the same length as each vector of data points; classifying each data point by calculating the distance between the current point and the center of each group, and then classifying the data points into the group of the center closest to the current point; calculating the average value of all points in each class as a new cluster center based on the result after iteration; the iterations repeat these steps, or until the group center does not vary much (less than a set threshold) between iterations. Alternatively, the random initialization group center may be selected several times and then the initialization center point for the best result may be selected.
The advantage of Kmeans is that the speed is very fast, since only the distance between the point and the centre of the group needs to be calculated, with a small amount of calculation, the time complexity of which is o (n).
In steps S11-S12, the pathology data sample set is acquired to obtain a pathology data sample set to be processed. Calculating the clustering result of the pathology data sample set based on a K-Means clustering algorithm to obtain a clustering result needing to be evaluated.
Optionally, as shown in fig. 7, step S10 further includes, before:
s11, acquiring the pathological data sample set;
s13, calculating the clustering result of the pathology data sample set based on a coacervation hierarchical clustering algorithm.
The coacervation hierarchical clustering algorithm is to combine two most similar data points by calculating the similarity between every two data points and iterate the process repeatedly until the set requirement of the number of clusters is met. The smaller the distance, the higher the similarity. The distance may be a euclidean distance or the like.
The specific steps of the Agglomerative include: firstly, each sample is taken as a class, and the distance between every two classes is calculated; forming a new category by combining two categories with the minimum distance (most similar); recalculating the distance between each category; iterating the two steps until a cluster is formed; the process of the agglomerative hierarchical clustering is to establish a tree, a threshold value, namely the number of clusters formed, can be set according to requirements, and when the number of categories is equal to the threshold value, the iteration can be terminated.
In steps S11, S13, the pathology data sample set is acquired to obtain a pathology data sample set to be processed. And calculating the clustering result of the pathological data sample set based on a coacervation hierarchical clustering algorithm to obtain the clustering result to be evaluated.
In the embodiment, the distance between a sample point and the center point of each cluster is calculated by acquiring the center point of each cluster after clustering and according to the center point of each cluster; calculating an adjustment contour coefficient of the sample point according to the distance between the sample point and the central point of the cluster; and calculating the average number of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering results, and determining the advantages and disadvantages of the clustering results according to the adjustment contour coefficients of the clustering results. The embodiment solves the problem of overhigh time complexity in the clustering result evaluation and calculation process, greatly reduces the data calculation amount in the evaluation and calculation process, greatly improves the efficiency of clustering result evaluation, and can accelerate the judgment of the pathological data clustering result so as to quickly determine the optimal pathological data clustering result.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, a pathological data analysis device is provided, which corresponds to the pathological data analysis method in the above embodiments one to one. As shown in fig. 8, the pathological data analysis apparatus includes an acquisition result module 10, a central point calculation module 20, a distance calculation module 30, a sample point coefficient calculation module 40, a result coefficient calculation module 50, a result evaluation module 60, an acquisition sample module 70, and a sample analysis module 80. The functional modules are explained in detail as follows:
an obtaining result module 10, configured to obtain a clustering result of a pathological data sample set, where the clustering result divides the pathological data sample set into a plurality of clusters, each cluster is composed of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
a central point calculating module 20, configured to calculate a central point of each cluster according to the clustering result;
a distance calculating module 30, configured to calculate a distance between a pathological sample point i and a center point of each cluster;
a sample point coefficient calculating module 40, configured to calculate an adjustment contour coefficient of the pathological sample point i according to a distance between the pathological sample point i and a center point of each cluster, where the calculation formula is as follows:
Figure BDA0002354998660000151
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
a result coefficient calculating module 50, configured to calculate an average of the adjusted contour coefficients of all the pathological sample points i, so as to obtain the adjusted contour coefficient of the clustering result;
a result evaluation module 60, configured to determine the quality of the clustering result according to the adjusted contour coefficient of the clustering result;
an obtaining sample module 70, configured to obtain a pathological data sample to be processed when the clustering result is excellent;
and the sample analysis module 80 is configured to classify the pathological data samples to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data samples to be processed.
Optionally, the pathological data analysis device further includes:
the multi-result calculating module is used for calculating the adjustment contour coefficients of the clustering results;
and the optimal result determining module is used for determining the clustering result with the highest adjustment contour coefficient as the optimal clustering result of the pathological data sample set.
Optionally, the pathological data analysis device further includes:
the coefficient judgment module is used for judging whether the adjustment contour coefficient of the clustering result is greater than a preset coefficient threshold value or not;
and the optimal result determining module is used for determining the clustering result as the optimal clustering result of the pathological data sample set if the adjustment contour coefficient of the clustering result is greater than a preset coefficient threshold value.
Optionally, the pathological data analysis device further includes:
a sample set acquisition module for acquiring the pathological data sample set;
the first clustering calculation module is used for calculating the clustering result of the pathology data sample set based on a K-Means clustering algorithm.
Optionally, the pathological data analysis device further includes:
a sample set acquisition module for acquiring the pathological data sample set;
and the second clustering calculation module is used for calculating the clustering result of the pathological data sample set based on a coacervation hierarchical clustering algorithm.
For specific limitations of the pathological data analysis device, reference may be made to the above limitations of the pathological data analysis method, which are not described herein again. The modules in the pathological data analysis device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data related to pathological data clustering result evaluation. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a pathology data analysis method.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
acquiring a clustering result of a pathological data sample set, wherein the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
calculating the central point of each cluster according to the clustering result;
calculating the distance between a pathological sample point i and the central point of each cluster;
calculating an adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster, wherein the calculation formula is as follows:
Figure BDA0002354998660000171
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
calculating the average number of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result;
determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result;
when the clustering result is excellent, acquiring a pathological data sample to be processed;
classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a clustering result of a pathological data sample set, wherein the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
calculating the central point of each cluster according to the clustering result;
calculating the distance between a pathological sample point i and the central point of each cluster;
calculating an adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster, wherein the calculation formula is as follows:
Figure BDA0002354998660000181
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
calculating the average number of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result;
determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result;
when the clustering result is excellent, acquiring a pathological data sample to be processed;
classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method of pathological data analysis, comprising:
acquiring a clustering result of a pathological data sample set, wherein the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
calculating the central point of each cluster according to the clustering result;
calculating the distance between a pathological sample point i and the central point of each cluster;
calculating an adjustment contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the central point of each cluster, wherein the calculation formula is as follows:
Figure FDA0002354998650000011
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Representing the distance between the pathological sample point i and the center point of the cluster where the pathological sample point i is located; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
calculating the average number of the adjustment contour coefficients of all the pathological sample points i to obtain the adjustment contour coefficients of the clustering result;
determining the quality of the clustering result according to the adjustment contour coefficient of the clustering result;
when the clustering result is excellent, acquiring a pathological data sample to be processed;
classifying the pathological data samples to be processed according to the clustering result, and generating pathological analysis data corresponding to the pathological data samples to be processed.
2. The pathological data analysis method according to claim 1, wherein the calculating an average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result further comprises:
calculating the adjustment contour coefficients of a plurality of clustering results;
and determining the clustering result with the highest adjustment contour coefficient as the optimal clustering result of the pathological data sample set.
3. The pathological data analysis method according to claim 1, wherein the calculating an average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result further comprises:
judging whether the adjustment contour coefficient of the clustering result is larger than a preset coefficient threshold value or not;
and if the adjustment contour coefficient of the clustering result is greater than a preset coefficient threshold value, determining the clustering result as the optimal clustering result of the pathological data sample set.
4. The pathological data analysis method of claim 1, wherein before obtaining the clustering result that divides the pathological data sample set into a number of clusters, the method comprises:
acquiring the pathological data sample set;
calculating the clustering result of the pathology data sample set based on a K-Means clustering algorithm.
5. The pathological data analysis method of claim 1, wherein before obtaining the clustering result that divides the pathological data sample set into a number of clusters, the method comprises:
acquiring the pathological data sample set;
calculating the clustering result of the pathology data sample set based on a coacervation hierarchical clustering algorithm.
6. A pathological data analysis device, comprising:
the system comprises an acquisition result module, a comparison module and a display module, wherein the acquisition result module is used for acquiring a clustering result of a pathological data sample set, the clustering result divides the pathological data sample set into a plurality of clusters, each cluster consists of a plurality of pathological sample points i, and the number of the pathological sample points i in the pathological data sample set is greater than a preset number threshold;
the central point calculation module is used for calculating the central point of each cluster according to the clustering result;
the distance calculation module is used for calculating the distance between a pathological sample point i and the center point of each cluster;
a sample point coefficient calculating module, configured to calculate an adjustment contour coefficient of the pathological sample point i according to a distance between the pathological sample point i and a center point of each cluster, where the calculation formula is as follows:
Figure FDA0002354998650000031
in the above formula, sc(i) An adjustment contour coefficient representing a pathological sample point i; a isc(i) Point i representing pathological sample and method for producing the sameDistance of the center point of the cluster; bc(i) Representing the distance between the central point of the cluster closest to the pathological sample point i and the pathological sample point i;
a result coefficient calculating module, configured to calculate an average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
the result evaluation module is used for determining the advantages and disadvantages of the clustering results according to the adjustment contour coefficients of the clustering results;
the sample obtaining module is used for obtaining a pathological data sample to be processed when the clustering result is excellent;
and the sample analysis module is used for classifying the pathological data samples to be processed according to the clustering result and generating pathological analysis data corresponding to the pathological data samples to be processed.
7. The pathological data analysis device of claim 6, further comprising:
the multi-result calculating module is used for calculating the adjustment contour coefficients of the clustering results;
and the optimal result determining module is used for determining the clustering result with the highest adjustment contour coefficient as the optimal clustering result of the pathological data sample set.
8. The pathological data analysis device of claim 6, further comprising:
the coefficient judgment module is used for judging whether the adjustment contour coefficient of the clustering result is greater than a preset coefficient threshold value or not;
and the optimal result determining module is used for determining the clustering result as the optimal clustering result of the pathological data sample set if the adjustment contour coefficient of the clustering result is greater than a preset coefficient threshold value.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the pathology data analysis method according to any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the pathology data analysis method according to any one of claims 1 to 5.
CN202010005182.7A 2020-01-03 2020-01-03 Pathological data analysis method, device, equipment and storage medium Pending CN111223570A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010005182.7A CN111223570A (en) 2020-01-03 2020-01-03 Pathological data analysis method, device, equipment and storage medium
PCT/CN2020/093328 WO2021135063A1 (en) 2020-01-03 2020-05-29 Pathological data analysis method and apparatus, and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010005182.7A CN111223570A (en) 2020-01-03 2020-01-03 Pathological data analysis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111223570A true CN111223570A (en) 2020-06-02

Family

ID=70830971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010005182.7A Pending CN111223570A (en) 2020-01-03 2020-01-03 Pathological data analysis method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111223570A (en)
WO (1) WO2021135063A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373564B (en) * 2023-12-08 2024-03-01 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136836A (en) * 2019-03-27 2019-08-16 周凡 A kind of disease forecasting method based on physical examination report clustering

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016022438A1 (en) * 2014-08-08 2016-02-11 Icahn School Of Medicine At Mount Sinai Automatic disease diagnoses using longitudinal medical record data
CN107609588B (en) * 2017-09-12 2020-08-18 大连大学 Parkinson patient UPDRS score prediction method based on voice signals

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136836A (en) * 2019-03-27 2019-08-16 周凡 A kind of disease forecasting method based on physical examination report clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG FEI 等: "An Analysis of the Application of Simplified Silhouette to the Evaluation of k-means Clustering Validity", MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, vol. 10358, 2 July 2017 (2017-07-02), pages 291 - 305, XP047419482, DOI: 10.1007/978-3-319-62416-7_21 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319A (en) * 2020-06-11 2020-10-02 佳都新太科技股份有限公司 Clustering result evaluation method and device based on large-scale samples

Also Published As

Publication number Publication date
WO2021135063A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
US20210295162A1 (en) Neural network model training method and apparatus, computer device, and storage medium
Bicego et al. Similarity-based clustering of sequences using hidden Markov models
CN110427970B (en) Image classification method, apparatus, computer device and storage medium
CN111476270B (en) Course information determining method, device, equipment and storage medium based on K-means algorithm
WO2021082426A1 (en) Human face clustering method and apparatus, computer device, and storage medium
CN109271917B (en) Face recognition method and device, computer equipment and readable storage medium
US11062120B2 (en) High speed reference point independent database filtering for fingerprint identification
Eltibi et al. Initializing k-means clustering algorithm using statistical information
CN110931090A (en) Disease data processing method and device, computer equipment and storage medium
CN111223570A (en) Pathological data analysis method, device, equipment and storage medium
CN115188485A (en) User demand analysis method and system based on intelligent medical big data
CN111209929A (en) Access data processing method and device, computer equipment and storage medium
CN109727295B (en) Electromagnetic image extraction method, electromagnetic image extraction device, computer equipment and storage medium
CN111985336A (en) Face image clustering method and device, computer equipment and storage medium
CN108388869B (en) Handwritten data classification method and system based on multiple manifold
CN112287036A (en) Outlier detection method based on spectral clustering
CN111489262A (en) Policy information detection method and device, computer equipment and storage medium
CN114328922B (en) Selective text clustering integration method based on spectrogram theory
CN112800138B (en) Big data classification method and system
Kojadinovic Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages
CN109978066B (en) Rapid spectral clustering method based on multi-scale data structure
CN113159211A (en) Method, computing device and computer storage medium for similar image retrieval
Costa et al. A symbolic approach to gene expression time series analysis
CN116662415B (en) Intelligent matching method and system based on data mining
Kumar et al. Color image segmentation via improved K-means algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40020264

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination