WO2021135063A1 - Pathological data analysis method and apparatus, and device and storage medium - Google Patents

Pathological data analysis method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2021135063A1
WO2021135063A1 PCT/CN2020/093328 CN2020093328W WO2021135063A1 WO 2021135063 A1 WO2021135063 A1 WO 2021135063A1 CN 2020093328 W CN2020093328 W CN 2020093328W WO 2021135063 A1 WO2021135063 A1 WO 2021135063A1
Authority
WO
WIPO (PCT)
Prior art keywords
pathological
clustering result
sample
clustering
pathological data
Prior art date
Application number
PCT/CN2020/093328
Other languages
French (fr)
Chinese (zh)
Inventor
蔡金成
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135063A1 publication Critical patent/WO2021135063A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • This application relates to the field of machine learning in the field of artificial intelligence, and in particular to a pathological data analysis method, device, equipment, and storage medium.
  • the hospital's management system collects a large number of patients' pathological data. These pathological data can be combined with a clustering algorithm to divide the pathological data into multiple sets, and each set corresponds to a condition. This can help doctors realize the diagnosis of patients with intractable diseases.
  • the clustering algorithm is an algorithm that involves unsupervised grouping of data.
  • Clustering algorithm also known as cluster analysis, is a statistical analysis method for studying data classification problems, and it is also an important means of data mining.
  • Silhouette Coefficient is a clustering result evaluation method, used to evaluate the effect of unsupervised clustering algorithm, so as to determine the number of clusters (ie, grouping) in the clustering process.
  • the profile coefficient combines the cohesion and separation of the cluster to evaluate the clustering effect.
  • the value range of the contour coefficient is [-1,1]. The larger the value, the better the clustering effect.
  • the time complexity of the contour coefficient is very high, and its time complexity is the square of n, that is, O(n2), where n is the number of samples.
  • n is the number of samples.
  • the calculation amount of the contour coefficients of the clustering results is very large, and it is difficult to calculate the results in a short time.
  • the contour coefficient is used to determine the number of clusters, the contour coefficients of multiple clustering results need to be calculated, and the whole process takes longer.
  • a pathological data analysis method including:
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • a pathological data analysis device including:
  • the obtaining result module is used to obtain the clustering result of the pathological data sample set.
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological data The number of pathological sample points i in the sample set is greater than the preset number threshold;
  • a central point calculation module configured to calculate the central point of each of the clusters according to the clustering result
  • the distance calculation module is used to calculate the distance between the pathological sample point i and the center point of each cluster;
  • the sample point coefficient calculation module is used to calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • the result coefficient calculation module is used to calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
  • the result evaluation module is used to determine the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result
  • the sample obtaining module is used to obtain a sample of pathological data to be processed when the clustering result is excellent;
  • the sample analysis module is configured to classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • the invention solves the problem of high time complexity in the evaluation and calculation process of clustering results, greatly reduces the amount of data calculation in the evaluation and calculation process, greatly improves the efficiency of the evaluation of the clustering results, and can accelerate the judgment of the pathological data clustering results , In order to quickly determine the best pathological data clustering results.
  • FIG. 1 is a schematic diagram of an application environment of a pathological data analysis method in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application
  • Figure 3 is a schematic diagram of a calculation path used to compare before and after improvement
  • FIG. 4 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application.
  • Fig. 6 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of a pathological data analysis device in an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • the pathological data analysis method provided in this embodiment can be applied in an application environment as shown in FIG. 1, where the client communicates with the server through the network.
  • the client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with an independent server or a server cluster composed of multiple servers.
  • a pathological data analysis method is provided.
  • the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • S60 Determine the quality of the clustering result according to the adjusted contour coefficient of the clustering result
  • the clustering result may be the result obtained after performing the clustering task on the pathological data sample set.
  • the clustering results of the pathological data sample set can be obtained through methods based on partition and agglomerative hierarchical clustering, such as K-means, Agglomerative, etc.
  • the preset number threshold can be set according to actual needs, for example, it can be set to 50,000, 10 or other values.
  • each pathological sample i in the pathological data sample set includes multiple detection indexes, such as a first detection index, a second detection index,...
  • the pathological sample i can be regarded as a point in a multi-dimensional space. In particular, before the pathological data sample set is clustered, the spatial dimension of each pathological sample point i is the same.
  • the pathological sample point i in the pathological data sample set contains the same number of detection indicators.
  • the clustering result divides the pathological data sample set into several clusters, and each cluster has one or more pathological sample points i.
  • cluster can mean grouping or subset. Normally, the disease types corresponding to the same cluster are the same.
  • the center point c of the cluster can be solved.
  • the coordinate value of the center point c of the cluster is equal to the average value of the coordinate values of all sample points of the cluster.
  • the cluster N is represented as ⁇ i 1 , i 2 &i n ⁇ , and each sample can be represented as (x i , y i ), the coordinates of the center point c of the cluster can be:
  • the distance between each pathological sample point i and the center point c of the cluster can be calculated. If the number of clusters is k, then k distances can be calculated for each pathological sample point i, including an intra-cluster distance (the distance between the pathological sample point i and the center point in the cluster) and m-1 distances outside the cluster (The distance between the pathological sample point i and the center point outside the cluster).
  • the adjusted contour coefficient of the sample point can be calculated according to the distance between the sample point and the center point of the cluster.
  • the adjusted contour coefficient of pathological sample point i is calculated by the following formula:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the centers of clusters.
  • b c (i) is the smallest value among the k-1 distances outside the cluster.
  • the adjusted contour coefficient of the sample points can be solved.
  • the calculated adjusted contour coefficient of pathological sample point i is a value, and its value range is [-1,1].
  • the adjusted contour coefficient of all sample points can be calculated according to the formula in the previous step, and then the average of the adjusted contour coefficients of all sample points can be calculated to obtain the adjusted contour coefficient of the clustering result.
  • the adjusted contour coefficient of the clustering result is a value, and its value range is [-1,1].
  • the pros and cons of the clustering result can be determined according to the adjusted contour coefficient.
  • the time complexity of adjusting contour coefficients is reduced from O(n 2 ) to O(n), which greatly reduces the amount of calculation required to evaluate the clustering results.
  • multiple clustering results can be quickly evaluated to determine the pros and cons of the clustering results.
  • Using the pathological data analysis method provided in this embodiment can speed up the determination of the pathological data clustering result, so as to quickly determine the best pathological data clustering result.
  • the pathological data samples to be processed can be obtained, and then the pathological data samples to be processed are classified according to the above-mentioned clustering results, and pathological analysis corresponding to the pathological data samples to be processed is generated data.
  • the pathological analysis data may be the patient's pathological risk prompt report.
  • FIG. 3 a schematic diagram of the calculation path as shown in FIG. 3 is provided.
  • Figure 3-a shows the path used to calculate the degree of aggregation (the distance between pathological sample point i and the sample point in the cluster) before the improvement
  • Figure 3-b shows the path used to calculate the degree of separation (pathological sample point i and the cluster) before the improvement
  • Figure 3-c shows the path used to calculate the degree of cohesion (the distance between pathological sample point i and the sample point in the cluster) after the improvement
  • Figure 3-d shows the improved path for Calculate the path of the degree of separation (the distance between the pathological sample point i and the sample point outside the cluster).
  • the original contour coefficient calculation method and the adjusted contour coefficient method were used to evaluate the clustering results of the same pathological data sample set. The results are shown in Table 1.
  • Table 1 Calculation time consumption of different evaluation methods for processing clustering results of the same pathological data sample set
  • the configuration of the server used to calculate the test results in Table 1 is: 20-core CPU, maximum speed 2.39GHz; 256G memory, speed: 2400MHz.
  • the average distance from the pathological sample point i of each cluster to the center point of the cluster is used instead of the two in the cluster.
  • the average distance between samples which can greatly reduce the time consumption and space overhead in calculating the sample distance matrix, save a lot of computing resources, and improve the running speed.
  • the calculation time is reduced from 22871.75766 seconds to the improved 2.728480302, and the calculation efficiency is increased by 8382.6 times.
  • the accuracy of distance calculation has also decreased.
  • the method provided in this embodiment is also applicable to other sample sets with a large amount of processed data and high dimensionality, such as the financial data processing field, the drug data analysis field, and the image data recognition field.
  • a clustering result of the pathological data sample set is obtained, and the clustering result divides the pathological data sample set into several clusters, the clusters are composed of a plurality of pathological sample points i, and the pathological data
  • the number of pathological sample points i in the sample set is greater than the preset number threshold to obtain the result obtained by the cluster analysis; the center point of each cluster is calculated according to the clustering result to determine the center point position of each cluster.
  • Calculate the distance between pathological sample point i and the center point of each cluster Since only the distance between pathological sample point i and the cluster center point is calculated, instead of the distance between pathological sample point i and all other pathological sample points i, the calculation is greatly reduced the amount.
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster to obtain the adjusted contour coefficient of a single pathological sample point i, and the amount of calculation is less than the method before the improvement. Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result. Because it is an average operation, the calculation speed is relatively fast.
  • the pros and cons of the clustering results are determined according to the adjusted contour coefficients of the clustering results. Since the adjusted contour coefficients of the clustering results can be quickly calculated, the pros and cons of the clustering results can be quickly determined, and the adjusted contours of the clustering results The higher the coefficient, the more accurate the clustering result.
  • a pathological data sample to be processed is obtained, so as to use the clustering result to classify the pathological data sample.
  • the pathological data samples to be processed are classified, and pathological analysis data corresponding to the pathological data samples to be processed are generated, so as to generate valuable data to indicate the pathological risk of the patient.
  • the method further includes:
  • S52 Determine the clustering result with the highest adjusted contour coefficient as the optimal clustering result of the pathological data sample set.
  • the computer can calculate the adjusted contour coefficients of multiple clustering results in a relatively short time. Then the optimal clustering result is determined according to the size of the adjusted contour coefficient. Since the larger the value of the adjusted contour coefficient, the better the clustering effect of the clustering result. Therefore, the clustering result with the highest adjusted contour coefficient can be determined as the optimal clustering result of the pathological data sample set.
  • the adjusted contour coefficients of the multiple clustering results are calculated to quickly calculate the adjusted contour coefficients of the multiple clustering results.
  • the clustering result with the highest adjusted contour coefficient is determined as the optimal clustering result of the pathological data sample set. Since the adjusted contour coefficient of the clustering result has a fast calculation speed, the optimal clustering result can be quickly determined.
  • the method further includes:
  • S53 Determine whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold
  • the clustering result is determined as a preferred clustering result of the pathological data sample set.
  • an expected value that is, a preset coefficient threshold
  • the preset coefficient threshold may be set to 0.5.
  • steps S53-S54 it is determined whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold, and the calculated adjusted contour coefficient of the clustering result is compared with the preset coefficient threshold. If the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold, the clustering result is determined as the preferred clustering result of the pathological data sample set to select the adjusted contour coefficient of the clustering result to be greater than the preset coefficient
  • the threshold is used as the preferred clustering result of the pathological data sample set.
  • the method further includes:
  • K-Means is an iterative solution clustering analysis algorithm.
  • the calculation process is as follows: First, determine the number of clusters to be clustered, and initialize their respective center points randomly. In order to determine the number of clusters, it is best to quickly look at the data and try to identify any different groupings.
  • the center point is a vector with the same length as the vector of each data point; by calculating the distance between the current point and the center of each group, each data point is classified, and then classified into the group with the closest center; based on iteration After the result, calculate the average of all points in each category as the new cluster center; repeat these steps iteratively, or until the group center does not change much between iterations (less than a set threshold). In addition, you can choose to initialize the center of the group randomly several times, and then select the initialization center point with the best result.
  • Kmeans The advantage of Kmeans is that it is very fast, because it only needs to calculate the distance between the point and the center of the group, the amount of calculation is small, and its time complexity is o(n).
  • the pathological data sample set is obtained to obtain the pathological data sample set to be processed.
  • the clustering result of the pathological data sample set is calculated based on the K-Means clustering algorithm to obtain the clustering result that needs to be evaluated.
  • the method further includes:
  • the agglomerative hierarchical clustering algorithm is to combine the two most similar data points by calculating the similarity between the two data points, and iterate this process repeatedly until the set number of clusters is met.
  • the distance can be a measurement method such as Euclidean distance.
  • the specific steps of Agglomerative include: First, treat each sample as one category and calculate the distance between the two categories; combine the two categories with the smallest distance (most similar) into one category to form a new category; recalculate each category The distance between; iterate the last two steps until a cluster is formed; the process of agglomerative hierarchical clustering is to build a tree, and a threshold can be set as needed, that is, the number of clusters formed. When the number of categories is equal to this threshold, Then the iteration can be terminated.
  • the pathological data sample set is obtained to obtain the pathological data sample set to be processed.
  • the clustering result of the pathological data sample set is calculated based on the agglomerative hierarchical clustering algorithm to obtain the clustering result that needs to be evaluated.
  • the center point of each cluster after clustering is obtained and the distance between the sample point and the center point of the cluster is calculated according to the center point of each cluster; the distance between the sample point and the center point of the cluster is calculated according to the distance between the sample point and the center point of the cluster
  • the adjusted contour coefficient of the sample point calculate the average of the adjusted contour coefficients of all the pathological sample points i, obtain the adjusted contour coefficient of the clustering result, and determine the cluster according to the adjusted contour coefficient of the clustering result
  • the pros and cons of the class results.
  • This embodiment solves the problem of high time complexity in the evaluation and calculation process of the clustering results, greatly reduces the amount of data calculation in the evaluation and calculation process, greatly improves the efficiency of the evaluation of the clustering results, and can accelerate the evaluation of the pathological data clustering results. Judgment to quickly determine the best pathological data clustering results.
  • a pathological data analysis device corresponds to the pathological data analysis method in the above-mentioned embodiment in a one-to-one correspondence.
  • the pathological data analysis device includes a result acquisition module 10, a center point calculation module 20, a distance calculation module 30, a sample point coefficient calculation module 40, a result coefficient calculation module 50, a result evaluation module 60, and a sample acquisition module 70 ⁇ sample analysis module 80.
  • the detailed description of each functional module is as follows:
  • the obtaining result module 10 is used to obtain a clustering result of a pathological data sample set.
  • the clustering result divides the pathological data sample set into several clusters.
  • the clusters are composed of multiple pathological sample points i.
  • the number of pathological sample points i in the data sample set is greater than the preset number threshold;
  • the central point calculation module 20 is configured to calculate the central point of each of the clusters according to the clustering result
  • the distance calculation module 30 is used to calculate the distance between the pathological sample point i and the center point of each cluster;
  • the sample point coefficient calculation module 40 is configured to calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • the result coefficient calculation module 50 is configured to calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
  • the result evaluation module 60 is configured to determine the quality of the clustering result according to the adjusted contour coefficient of the clustering result
  • the sample obtaining module 70 is configured to obtain a pathological data sample to be processed when the clustering result is excellent;
  • the sample analysis module 80 is configured to classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  • the pathological data analysis device further includes:
  • Multi-result calculation module used to calculate the adjusted contour coefficient of multiple clustering results
  • the optimal result determining module is used to determine the clustering result with the highest adjusted contour coefficient as the optimal clustering result of the pathological data sample set.
  • the pathological data analysis device further includes:
  • a coefficient judgment module for judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold
  • the preferred result determining module is configured to determine the clustering result as the preferred clustering result of the pathological data sample set if the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold.
  • the pathological data analysis device further includes:
  • the sample set acquisition module is used to acquire the pathological data sample set
  • the first clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the K-Means clustering algorithm.
  • the pathological data analysis device further includes:
  • the sample set acquisition module is used to acquire the pathological data sample set
  • the second clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the agglomerative hierarchical clustering algorithm.
  • Each module in the above-mentioned pathological data analysis device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer device is used to store the data involved in the evaluation of the pathological data clustering result.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a pathological data analysis method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • one or more computer-readable storage media storing computer-readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. Storage medium.
  • the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the following steps are implemented:
  • the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
  • the adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
  • s c (i) represents the adjusted contour coefficient of pathological sample point i
  • a c (i) represents the distance between pathological sample point i and the center point of its cluster
  • b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
  • a person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile computer.
  • a readable storage medium or a volatile readable storage medium when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments.
  • any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A method and apparatus for analysis of pathological data, and a device and a storage medium, relating to the field of artificial intelligence. The method comprises: acquiring a clustering result of a pathological data sample set (S10); calculating an adjustment silhouette coefficient according to the clustering result, and determining the quality of the clustering result according to the adjustment silhouette coefficient of the clustering result (S60); when the clustering result is good, acquiring pathological data samples to be processed (S70); and classifying, according to the clustering result, the pathological data samples to be processed, and generating pathological analysis data corresponding to the pathological data samples to be processed (S80). By means of the method, the problem of excessive time complexity during a clustering result evaluation calculation process is solved, the data calculation amount during the evaluation calculation process is greatly reduced, the clustering result evaluation efficiency is greatly improved, and the determination of a pathological data clustering result can be accelerated, so as to quickly determine the optimal pathological data clustering result.

Description

病理数据分析方法、装置、设备及存储介质Pathological data analysis method, device, equipment and storage medium
本申请要求于2020年1月3日提交中国专利局、申请号为202010005182.7,发明名称为“病理数据分析方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on January 3, 2020, the application number is 202010005182.7, and the invention title is "pathological data analysis method, device, equipment and storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能领域的机器学习领域,尤其涉及一种病理数据分析方法、装置、设备及存储介质。This application relates to the field of machine learning in the field of artificial intelligence, and in particular to a pathological data analysis method, device, equipment, and storage medium.
背景技术Background technique
在医学领域,随着技术的发展,医院的管理系统收集了大量病人的病理数据。这些病理数据可以结合聚类算法,将病理数据划分为多个集合,每个集合对应一种病情。这样可帮助医生实现对属于疑难杂症的病人进行确诊。In the medical field, with the development of technology, the hospital's management system collects a large number of patients' pathological data. These pathological data can be combined with a clustering algorithm to divide the pathological data into multiple sets, and each set corresponds to a condition. This can help doctors realize the diagnosis of patients with intractable diseases.
而聚类算法是一种涉及对数据进行无监督分组的一种算法。聚类算法又称聚类分析,是研究数据分类问题的一种统计分析方法,同时也是数据挖掘的一种重要的手段。The clustering algorithm is an algorithm that involves unsupervised grouping of data. Clustering algorithm, also known as cluster analysis, is a statistical analysis method for studying data classification problems, and it is also an important means of data mining.
在给定的数据集中,通过聚类算法将数据集划分成不同的组后,需要对聚类结果进行评价,以评估聚类结果的好坏。轮廓系数(Silhouette Coefficient)是一种聚类结果评价方法,用于评估无监督聚类算法的效果,以便在聚类过程中用于簇(即分组)的个数的确定。轮廓系数结合聚类的凝聚度(Cohesion)和分离度(Separation)对聚类效果进行评价。轮廓系数的取值范围为[-1,1],值越大,说明聚类效果越好。In a given data set, after the data set is divided into different groups by the clustering algorithm, the clustering results need to be evaluated to evaluate the quality of the clustering results. Silhouette Coefficient is a clustering result evaluation method, used to evaluate the effect of unsupervised clustering algorithm, so as to determine the number of clusters (ie, grouping) in the clustering process. The profile coefficient combines the cohesion and separation of the cluster to evaluate the clustering effect. The value range of the contour coefficient is [-1,1]. The larger the value, the better the clustering effect.
然而,轮廓系数的时间复杂度非常高,其时间复杂度为n的平方,即O(n2),其中n为样本数。在大规模数据集的处理过程中,聚类结果的轮廓系数计算量非常大,很难在短时间内计算出结果。特别是利用轮廓系数来确定簇的个数时,需要计算多个聚类结果的轮廓系数,整个过程消耗时间更长。However, the time complexity of the contour coefficient is very high, and its time complexity is the square of n, that is, O(n2), where n is the number of samples. In the processing of large-scale data sets, the calculation amount of the contour coefficients of the clustering results is very large, and it is difficult to calculate the results in a short time. Especially when the contour coefficient is used to determine the number of clusters, the contour coefficients of multiple clustering results need to be calculated, and the whole process takes longer.
发明人发现,在病理数据进行聚类计算后,通常会计算出多个不同的聚类结果。由于病理数据的数量十分庞大,检测指标也有很多,导致用现有的轮廓系数评估病理数据聚类结果常常出现不可预见的错误,或者计算耗时过长,无法及时得到需要的评估结果。The inventor found that after clustering calculation of pathological data, multiple different clustering results are usually calculated. Due to the huge amount of pathological data and many detection indicators, the existing contour coefficients used to evaluate the clustering results of pathological data often have unforeseen errors, or the calculation takes too long to obtain the required evaluation results in time.
申请内容Application content
基于此,有必要针对上述技术问题,提供一种病理数据分析方法,以解决聚类结果评估计算过程中时间复杂度过高的问题,提高聚类结果评估的计算速度,可以快速确定聚类结果的优劣,进而根据聚类结果对病理数据样本进行分类,获得所需要得到的病理分析数据。Based on this, it is necessary to provide a pathological data analysis method for the above technical problems to solve the problem of high time complexity in the evaluation and calculation process of the clustering results, improve the calculation speed of the evaluation of the clustering results, and quickly determine the clustering results. Then classify the pathological data samples according to the clustering results to obtain the required pathological analysis data.
一种病理数据分析方法,包括:A pathological data analysis method, including:
获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000001
Figure PCTCN2020093328-appb-000001
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
一种病理数据分析装置,包括:A pathological data analysis device, including:
获取结果模块,用于获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;The obtaining result module is used to obtain the clustering result of the pathological data sample set. The clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological data The number of pathological sample points i in the sample set is greater than the preset number threshold;
中心点计算模块,用于根据所述聚类结果计算各个所述簇的中心点;A central point calculation module, configured to calculate the central point of each of the clusters according to the clustering result;
距离计算模块,用于计算病理样本点i与各个所述簇的中心点的距离;The distance calculation module is used to calculate the distance between the pathological sample point i and the center point of each cluster;
样本点系数计算模块,用于根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The sample point coefficient calculation module is used to calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000002
Figure PCTCN2020093328-appb-000002
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
结果系数计算模块,用于计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;The result coefficient calculation module is used to calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
结果评价模块,用于根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;The result evaluation module is used to determine the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
获取样本模块,用于在所述聚类结果为优时,获取待处理的病理数据样本;The sample obtaining module is used to obtain a sample of pathological data to be processed when the clustering result is excellent;
样本分析模块,用于根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。The sample analysis module is configured to classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000003
Figure PCTCN2020093328-appb-000003
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000004
Figure PCTCN2020093328-appb-000004
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
有益效果Beneficial effect
本发明解决了聚类结果评估计算过程中时间复杂度过高的问题,大大减少了评估计算过程中的数据计算量,大大提高聚类结果评估的效率,可以加速对病理数据聚类结果的判定,以快速确定最佳的病理数据聚类结果。The invention solves the problem of high time complexity in the evaluation and calculation process of clustering results, greatly reduces the amount of data calculation in the evaluation and calculation process, greatly improves the efficiency of the evaluation of the clustering results, and can accelerate the judgment of the pathological data clustering results , In order to quickly determine the best pathological data clustering results.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中病理数据分析方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a pathological data analysis method in an embodiment of the present application;
图2是本申请一实施例中病理数据分析方法的一流程示意图;FIG. 2 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application;
图3是用于比较改进前与改进后的计算路径示意图;Figure 3 is a schematic diagram of a calculation path used to compare before and after improvement;
图4是本申请一实施例中病理数据分析方法的一流程示意图;FIG. 4 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application;
图5是本申请一实施例中病理数据分析方法的一流程示意图;FIG. 5 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application;
图6是本申请一实施例中病理数据分析方法的一流程示意图;Fig. 6 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application;
图7是本申请一实施例中病理数据分析方法的一流程示意图;FIG. 7 is a schematic flowchart of a pathological data analysis method in an embodiment of the present application;
图8是本申请一实施例中病理数据分析装置的一结构示意图;Fig. 8 is a schematic structural diagram of a pathological data analysis device in an embodiment of the present application;
图9是本申请一实施例中计算机设备的一示意图。Fig. 9 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请 中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本实施例提供的病理数据分析方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务端进行通信。其中,客户端包括但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The pathological data analysis method provided in this embodiment can be applied in an application environment as shown in FIG. 1, where the client communicates with the server through the network. Among them, the client includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented with an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种病理数据分析方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In an embodiment, as shown in FIG. 2, a pathological data analysis method is provided. The method is applied to the server in FIG. 1 as an example for description, including the following steps:
S10、获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;S10. Obtain a clustering result of the pathological data sample set, where the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathology of the pathological data sample set The number of sample points i is greater than the preset number threshold;
S20、根据所述聚类结果计算各个所述簇的中心点;S20: Calculate the center point of each of the clusters according to the clustering result;
S30、计算病理样本点i与各个所述簇的中心点的距离S30. Calculate the distance between the pathological sample point i and the center point of each cluster
S40、根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:S40. Calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000005
Figure PCTCN2020093328-appb-000005
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
S50、计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;S50: Calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
S60、根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;S60: Determine the quality of the clustering result according to the adjusted contour coefficient of the clustering result;
S70、在所述聚类结果为优时,获取待处理的病理数据样本;S70. When the clustering result is excellent, obtain a pathological data sample to be processed;
S80、根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。S80. Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
本实施例中,聚类结果可以是病理数据样本集执行聚类任务后获得的结果。可通过基于划分、基于凝聚层次聚类的方法,如K-means,Agglomerative等,获得病理数据样本集的聚类结果。预设数量阈值可结合实际需要进行设置,如可以设置为5万、10或其他数值。在此处,病理数据样本集中的每个病理样本i包括多个检测指标,如第一检测指标、第二检测指标、……。病理样本i可以视为多维空间中的一个点。特别的,病理数据样本集在聚类前,每个病理样本点i的空间维度是相同的。也就是说,病理数据样本集中的病理样本点i含有相同数量的检测指标。聚类结果将病理数据样本集划分为若干个簇,每个簇有一个或多个病理样本点i。簇在此处可以是分组或子集的含义。通常情况下,同一个簇对应的疾病病种类型是相同的。In this embodiment, the clustering result may be the result obtained after performing the clustering task on the pathological data sample set. The clustering results of the pathological data sample set can be obtained through methods based on partition and agglomerative hierarchical clustering, such as K-means, Agglomerative, etc. The preset number threshold can be set according to actual needs, for example, it can be set to 50,000, 10 or other values. Here, each pathological sample i in the pathological data sample set includes multiple detection indexes, such as a first detection index, a second detection index,... The pathological sample i can be regarded as a point in a multi-dimensional space. In particular, before the pathological data sample set is clustered, the spatial dimension of each pathological sample point i is the same. In other words, the pathological sample point i in the pathological data sample set contains the same number of detection indicators. The clustering result divides the pathological data sample set into several clusters, and each cluster has one or more pathological sample points i. Here, cluster can mean grouping or subset. Normally, the disease types corresponding to the same cluster are the same.
由于病理样本点i的值是已知的,可以用坐标的形式表示,如(x i,y i)。因而,可以求解出簇的中心点c。簇的中心点c的坐标值等于其所在簇的所有样本点的坐标值的平均值。如,簇N表示为{i 1,i 2……i n},每个样本可表示为(x i,y i),则簇的中心点c的坐标可以为: Since the value of the pathological sample point i is known, it can be expressed in the form of coordinates, such as (x i , y i ). Therefore, the center point c of the cluster can be solved. The coordinate value of the center point c of the cluster is equal to the average value of the coordinate values of all sample points of the cluster. For example, the cluster N is represented as {i 1 , i 2 ……i n }, and each sample can be represented as (x i , y i ), the coordinates of the center point c of the cluster can be:
Figure PCTCN2020093328-appb-000006
Figure PCTCN2020093328-appb-000006
在求解出簇的中心点c之后,可以计算每个病理样本点i与簇的中心点c的距离。若簇的个数为k,则每个病理样本点i可以计算出k个距离,其中,包括一个簇内距离(病理样本点i与簇内中心点的距离)和m-1个簇外距离(病理样本点i与簇外中心点的距离)。After the center point c of the cluster is solved, the distance between each pathological sample point i and the center point c of the cluster can be calculated. If the number of clusters is k, then k distances can be calculated for each pathological sample point i, including an intra-cluster distance (the distance between the pathological sample point i and the center point in the cluster) and m-1 distances outside the cluster (The distance between the pathological sample point i and the center point outside the cluster).
然后,可以根据样本点与簇的中心点的距离计算出样本点的调整轮廓系数。病理样本点i的调整轮廓系数由如下公式计算得到:Then, the adjusted contour coefficient of the sample point can be calculated according to the distance between the sample point and the center point of the cluster. The adjusted contour coefficient of pathological sample point i is calculated by the following formula:
Figure PCTCN2020093328-appb-000007
Figure PCTCN2020093328-appb-000007
上式中,s c(i)表示病理样本点i的调整轮廓系数,a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇间中心的距离。 In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i, a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the centers of clusters.
在求解的过程中,b c(i)即为k-1个簇外距离中的最小值。由此可以求解出样本点的调整轮廓系数。计算出的病理样本点i的调整轮廓系数为一数值,其取值范围为[-1,1]。 In the process of solving, b c (i) is the smallest value among the k-1 distances outside the cluster. Thus, the adjusted contour coefficient of the sample points can be solved. The calculated adjusted contour coefficient of pathological sample point i is a value, and its value range is [-1,1].
可以根据上一步骤中的公式计算所有样本点的调整轮廓系数,然后计算所有样本点的调整轮廓系数的平均数,即可得到聚类结果的调整轮廓系数。同样的,聚类结果的调整轮廓系数为一数值,其取值范围为[-1,1]。The adjusted contour coefficient of all sample points can be calculated according to the formula in the previous step, and then the average of the adjusted contour coefficients of all sample points can be calculated to obtain the adjusted contour coefficient of the clustering result. Similarly, the adjusted contour coefficient of the clustering result is a value, and its value range is [-1,1].
计算出调整轮廓系数后,可根据调整轮廓系数确定聚类结果的优劣。值越大,说明聚类结果的聚类效果越好。可以设置指定的数值范围对聚类结果的优劣进行等级划分,如(0.5,1]为优,(0,0.5]为一般,[-1,0]为差。After calculating the adjusted contour coefficient, the pros and cons of the clustering result can be determined according to the adjusted contour coefficient. The larger the value, the better the clustering effect of the clustering result. You can set a specified numerical range to classify the pros and cons of the clustering results, such as (0.5,1] for excellent, (0,0.5] for general, and [-1,0] for poor.
与原有的轮廓系数相比,调整轮廓系数的时间复杂度由O(n 2)减低到O(n),大大降低了评估聚类结果需要的计算量。在大规模数据集的处理过程中,可实现对多个聚类结果进行快速评估,以确定聚类结果的优劣。 Compared with the original contour coefficients, the time complexity of adjusting contour coefficients is reduced from O(n 2 ) to O(n), which greatly reduces the amount of calculation required to evaluate the clustering results. In the process of large-scale data set processing, multiple clustering results can be quickly evaluated to determine the pros and cons of the clustering results.
使用本实施例提供的病理数据分析方法,可以加速对病理数据聚类结果的判定,以快速确定最佳的病理数据聚类结果。Using the pathological data analysis method provided in this embodiment can speed up the determination of the pathological data clustering result, so as to quickly determine the best pathological data clustering result.
在确定最佳的病理数据的聚类结果后,可获取待处理的病理数据样本,然后根据上述聚类结果对待处理的病理数据样本进行分类,并生成与待处理的病理数据样本对应的病理分析数据。在一些情况下,病理分析数据可以是患者的病理风险提示报告。After determining the best pathological data clustering results, the pathological data samples to be processed can be obtained, and then the pathological data samples to be processed are classified according to the above-mentioned clustering results, and pathological analysis corresponding to the pathological data samples to be processed is generated data. In some cases, the pathological analysis data may be the patient's pathological risk prompt report.
为了便于比较原有轮廓系数与本实施例的调整轮廓系数的不同,提供了如图3的计算路径的示意图。图3-a表示改进前,用于计算凝聚度(病理样本点i与簇内的样本点的距离)的路径;图3-b表示改进前,用于计算分离度(病理样本点i与簇外的样本点的距离)的路径;图3-c表示改进后,用于计算凝聚度(病理样本点i与簇内的样本点的距离)的路径;图3-d表示改进后,用于计算分离度(病理样本点i与簇外的样本点的距离)的路径。In order to facilitate the comparison between the original contour coefficient and the adjusted contour coefficient of this embodiment, a schematic diagram of the calculation path as shown in FIG. 3 is provided. Figure 3-a shows the path used to calculate the degree of aggregation (the distance between pathological sample point i and the sample point in the cluster) before the improvement; Figure 3-b shows the path used to calculate the degree of separation (pathological sample point i and the cluster) before the improvement Figure 3-c shows the path used to calculate the degree of cohesion (the distance between pathological sample point i and the sample point in the cluster) after the improvement; Figure 3-d shows the improved path for Calculate the path of the degree of separation (the distance between the pathological sample point i and the sample point outside the cluster).
在一应用实例中,分别使用原有的轮廓系数计算方法和调整轮廓系数方法对同一个病理数据样本集的聚类结果进行评估,结果如表1所示。In an application example, the original contour coefficient calculation method and the adjusted contour coefficient method were used to evaluate the clustering results of the same pathological data sample set. The results are shown in Table 1.
表1不同评估方法对同一病理数据样本集的聚类结果进行处理的计算耗时Table 1 Calculation time consumption of different evaluation methods for processing clustering results of the same pathological data sample set
Figure PCTCN2020093328-appb-000008
Figure PCTCN2020093328-appb-000008
Figure PCTCN2020093328-appb-000009
Figure PCTCN2020093328-appb-000009
用于计算表1测试结果的服务器的配置为:20核CPU,最大速度2.39GHz;256G内存,速度:2400MHz。The configuration of the server used to calculate the test results in Table 1 is: 20-core CPU, maximum speed 2.39GHz; 256G memory, speed: 2400MHz.
从精准精度分析,与原有轮廓系数相比,调整轮廓系数在计算簇内的紧密度时,采用的是各簇的病理样本点i到该簇中心点的平均距离,而不是簇内两两样本之间的平均距离,这样可以大大降低在计算样本距离矩阵的时间消耗和空间开销,节省了大量的计算资源,提高运行速度。以样本集P为例,计算耗时从原来的22871.75766秒降低到改进后的2.728480302,计算效率提升了8382.6倍。但是在距离的计算精度上也有所下降。From the analysis of precision and accuracy, compared with the original contour coefficient, when adjusting the compactness of the contour coefficient in the calculation of the cluster, the average distance from the pathological sample point i of each cluster to the center point of the cluster is used instead of the two in the cluster. The average distance between samples, which can greatly reduce the time consumption and space overhead in calculating the sample distance matrix, save a lot of computing resources, and improve the running speed. Taking the sample set P as an example, the calculation time is reduced from 22871.75766 seconds to the improved 2.728480302, and the calculation efficiency is increased by 8382.6 times. However, the accuracy of distance calculation has also decreased.
本实施例提供的方法还适用于其他处理数据量大,维度高的样本集,如金融数据处理领域、药物数据分析领域、图像数据识别领域等。The method provided in this embodiment is also applicable to other sample sets with a large amount of processed data and high dimensionality, such as the financial data processing field, the drug data analysis field, and the image data recognition field.
步骤S10-S80中,获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值,以获得聚类分析获得的结果;根据所述聚类结果计算各个所述簇的中心点,以确定每个簇的中心点位置。计算病理样本点i与各个所述簇的中心点的距离,由于只计算病理样本点i与簇中心点的距离,而不是病理样本点i与其他所有病理样本点i的距离,大大减少了计算量。根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,以获得单个病理样本点i的调整轮廓系数,计算量比改进前的方法少。计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数,由于是求均值运算,计算速度比较快。根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣,由于可以快速计算出聚类结果的调整轮廓系数,因而可以快速判定聚类结果的优劣,聚类结果的调整轮廓系数越高,则该聚类结果越准确。在所述聚类结果为优时,获取待处理的病理数据样本,以使用聚类结果对病理数据样本进行分类。根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据,以生成有价值的数据,提示患者存在的病理风险。In steps S10-S80, a clustering result of the pathological data sample set is obtained, and the clustering result divides the pathological data sample set into several clusters, the clusters are composed of a plurality of pathological sample points i, and the pathological data The number of pathological sample points i in the sample set is greater than the preset number threshold to obtain the result obtained by the cluster analysis; the center point of each cluster is calculated according to the clustering result to determine the center point position of each cluster. Calculate the distance between pathological sample point i and the center point of each cluster. Since only the distance between pathological sample point i and the cluster center point is calculated, instead of the distance between pathological sample point i and all other pathological sample points i, the calculation is greatly reduced the amount. The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster to obtain the adjusted contour coefficient of a single pathological sample point i, and the amount of calculation is less than the method before the improvement. Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result. Because it is an average operation, the calculation speed is relatively fast. The pros and cons of the clustering results are determined according to the adjusted contour coefficients of the clustering results. Since the adjusted contour coefficients of the clustering results can be quickly calculated, the pros and cons of the clustering results can be quickly determined, and the adjusted contours of the clustering results The higher the coefficient, the more accurate the clustering result. When the clustering result is excellent, a pathological data sample to be processed is obtained, so as to use the clustering result to classify the pathological data sample. According to the clustering results, the pathological data samples to be processed are classified, and pathological analysis data corresponding to the pathological data samples to be processed are generated, so as to generate valuable data to indicate the pathological risk of the patient.
可选的,如图4所示,步骤S50之后还包括:Optionally, as shown in FIG. 4, after step S50, the method further includes:
S51、计算多个聚类结果的调整轮廓系数;S51: Calculate the adjusted contour coefficients of multiple clustering results;
S52、将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。S52: Determine the clustering result with the highest adjusted contour coefficient as the optimal clustering result of the pathological data sample set.
本实施例中,由于调整轮廓系数的计算量大为降低,计算机可以在较短时间内计算出多个聚类结果的调整轮廓系数。然后根据调整轮廓系数的大小来确定最优聚类结果。由于调整轮廓系数的值越大,说明聚类结果的聚类效果越好,因而可以将调整轮廓系数最高的聚类结果确定为病理数据样本集的最优聚类结果。In this embodiment, since the calculation amount of the adjusted contour coefficient is greatly reduced, the computer can calculate the adjusted contour coefficients of multiple clustering results in a relatively short time. Then the optimal clustering result is determined according to the size of the adjusted contour coefficient. Since the larger the value of the adjusted contour coefficient, the better the clustering effect of the clustering result. Therefore, the clustering result with the highest adjusted contour coefficient can be determined as the optimal clustering result of the pathological data sample set.
步骤S51-S52中,计算多个聚类结果的调整轮廓系数,以快速计算出多个聚类结果的调整轮廓系数。将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果,由于聚类结果的调整轮廓系数计算速度快,可以快速确定最优聚类结果。In steps S51-S52, the adjusted contour coefficients of the multiple clustering results are calculated to quickly calculate the adjusted contour coefficients of the multiple clustering results. The clustering result with the highest adjusted contour coefficient is determined as the optimal clustering result of the pathological data sample set. Since the adjusted contour coefficient of the clustering result has a fast calculation speed, the optimal clustering result can be quickly determined.
可选的,如图5所示,步骤S50之后还包括:Optionally, as shown in FIG. 5, after step S50, the method further includes:
S53、判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;S53: Determine whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
S54、若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。S54. If the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold, the clustering result is determined as a preferred clustering result of the pathological data sample set.
在一些情况下,可以设置一个期望值,即预设系数阈值,当调整轮廓系数大于预设系数阈值时,则可确定该聚类结果为病理数据样本集的优选聚类结果。例如,在一实例中,预设系数阈值可以设置为0.5。In some cases, an expected value, that is, a preset coefficient threshold, can be set. When the adjusted contour coefficient is greater than the preset coefficient threshold, it can be determined that the clustering result is the preferred clustering result of the pathological data sample set. For example, in an example, the preset coefficient threshold may be set to 0.5.
步骤S53-S54中,判断所述聚类结果的调整轮廓系数是否大于预设系数阈值,以比较计算出的聚类结果的调整轮廓系数与预设系数阈值的大小。若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果,以选取聚类结果的调整轮廓系数大于预设系数阈值作为病理数据样本集的优选聚类结果。In steps S53-S54, it is determined whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold, and the calculated adjusted contour coefficient of the clustering result is compared with the preset coefficient threshold. If the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold, the clustering result is determined as the preferred clustering result of the pathological data sample set to select the adjusted contour coefficient of the clustering result to be greater than the preset coefficient The threshold is used as the preferred clustering result of the pathological data sample set.
可选的,如图6所示,步骤S10之前还包括:Optionally, as shown in FIG. 6, before step S10, the method further includes:
S11、获取所述病理数据样本集;S11. Obtain the pathological data sample set;
S12、基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。S12: Calculate the clustering result of the pathological data sample set based on the K-Means clustering algorithm.
本实施例中,K-Means是一种迭代求解的聚类分析算法。其计算过程如下:首先,确定要聚类的数量,并随机初始化它们各自的中心点。为了确定要聚类的数量,最好快速查看数据并尝试识别任何不同的分组。中心点是与每个数据点向量长度相同的向量;通过计算当前点与每个组中心之间的距离,对每个数据点进行分类,然后归到与距离最近的中心的组中;基于迭代后的结果,计算每一类内,所有点的平均值,作为新簇中心;迭代重复这些步骤,或者直到组中心在迭代之间变化不大(小于一个设置的阈值)。此外,还可以选择随机初始化组中心数次,然后选择最佳结果的初始化中心点。In this embodiment, K-Means is an iterative solution clustering analysis algorithm. The calculation process is as follows: First, determine the number of clusters to be clustered, and initialize their respective center points randomly. In order to determine the number of clusters, it is best to quickly look at the data and try to identify any different groupings. The center point is a vector with the same length as the vector of each data point; by calculating the distance between the current point and the center of each group, each data point is classified, and then classified into the group with the closest center; based on iteration After the result, calculate the average of all points in each category as the new cluster center; repeat these steps iteratively, or until the group center does not change much between iterations (less than a set threshold). In addition, you can choose to initialize the center of the group randomly several times, and then select the initialization center point with the best result.
Kmeans的优点是速度非常快,因为只需要计算点和组中心之间的距离,计算量少,其时间复杂度为o(n)。The advantage of Kmeans is that it is very fast, because it only needs to calculate the distance between the point and the center of the group, the amount of calculation is small, and its time complexity is o(n).
步骤S11-S12中,获取所述病理数据样本集,以获得待处理的病理数据样本集。基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果,以获得需要评估的聚类结果。In steps S11-S12, the pathological data sample set is obtained to obtain the pathological data sample set to be processed. The clustering result of the pathological data sample set is calculated based on the K-Means clustering algorithm to obtain the clustering result that needs to be evaluated.
可选的,如图7所示,步骤S10之前还包括:Optionally, as shown in FIG. 7, before step S10, the method further includes:
S11、获取所述病理数据样本集;S11. Obtain the pathological data sample set;
S13、基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。S13: Calculate the clustering result of the pathological data sample set based on the agglomerative hierarchical clustering algorithm.
凝聚层次聚类算法是通过计算两两数据点间的相似性,对最相似的两个数据点进行组合,并反复迭代这一过程,直到满足设定的簇的个数要求。距离越小,相似度越高。距离可以是欧式距离等度量方法。The agglomerative hierarchical clustering algorithm is to combine the two most similar data points by calculating the similarity between the two data points, and iterate this process repeatedly until the set number of clusters is met. The smaller the distance, the higher the similarity. The distance can be a measurement method such as Euclidean distance.
Agglomerative的具体步骤包括:首先,把每个样本都当成一类,计算两两类别之间的距离;把距离最小(最相似)的两个类别组成一类,形成新的类别;重新计算各个类别之间的距离;迭代上两个步骤直到形成一个簇;凝聚层次聚类的过程是建立了一棵树,可以根据需要设置一个阈值,即形成簇的个数,当类别数等于这个阈值时,则迭代可以终止。The specific steps of Agglomerative include: First, treat each sample as one category and calculate the distance between the two categories; combine the two categories with the smallest distance (most similar) into one category to form a new category; recalculate each category The distance between; iterate the last two steps until a cluster is formed; the process of agglomerative hierarchical clustering is to build a tree, and a threshold can be set as needed, that is, the number of clusters formed. When the number of categories is equal to this threshold, Then the iteration can be terminated.
步骤S11、S13中,获取所述病理数据样本集,以获得待处理的病理数据样本集。基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果,以获得需要评估的聚类结果。In steps S11 and S13, the pathological data sample set is obtained to obtain the pathological data sample set to be processed. The clustering result of the pathological data sample set is calculated based on the agglomerative hierarchical clustering algorithm to obtain the clustering result that needs to be evaluated.
本实施例通过获取聚类后各簇的中心点并根据各个所述簇的中心点计算样本点与所述簇的中心点的距离;根据所述样本点与所述簇的中心点的距离计算所述样本点的调整轮廓系数;计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数,根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣。本实施例解决了聚类结果评估计算过程中时间复杂度过高的问题,大大减少了评估计算过程中的数据计算量,大大提高聚类结果评估的效率,可以加速对病理数据聚类结果的判定,以快速确定最佳的病理数据聚类结果。In this embodiment, the center point of each cluster after clustering is obtained and the distance between the sample point and the center point of the cluster is calculated according to the center point of each cluster; the distance between the sample point and the center point of the cluster is calculated according to the distance between the sample point and the center point of the cluster The adjusted contour coefficient of the sample point; calculate the average of the adjusted contour coefficients of all the pathological sample points i, obtain the adjusted contour coefficient of the clustering result, and determine the cluster according to the adjusted contour coefficient of the clustering result The pros and cons of the class results. This embodiment solves the problem of high time complexity in the evaluation and calculation process of the clustering results, greatly reduces the amount of data calculation in the evaluation and calculation process, greatly improves the efficiency of the evaluation of the clustering results, and can accelerate the evaluation of the pathological data clustering results. Judgment to quickly determine the best pathological data clustering results.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在一实施例中,提供一种病理数据分析装置,该病理数据分析装置与上述实施例中病理数据分析方法一一对应。如图8所示,该病理数据分析装置包括获取结果模块10、中心点计算模块20、距离计算模块30、样本点系数计算模块40、结果系数计算模块50、结果评价模块60、获取样本模块70和样本分析模块80。各功能模块详细说明如下:In one embodiment, a pathological data analysis device is provided, and the pathological data analysis device corresponds to the pathological data analysis method in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 8, the pathological data analysis device includes a result acquisition module 10, a center point calculation module 20, a distance calculation module 30, a sample point coefficient calculation module 40, a result coefficient calculation module 50, a result evaluation module 60, and a sample acquisition module 70和sample analysis module 80. The detailed description of each functional module is as follows:
获取结果模块10,用于获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;The obtaining result module 10 is used to obtain a clustering result of a pathological data sample set. The clustering result divides the pathological data sample set into several clusters. The clusters are composed of multiple pathological sample points i. The number of pathological sample points i in the data sample set is greater than the preset number threshold;
中心点计算模块20,用于根据所述聚类结果计算各个所述簇的中心点;The central point calculation module 20 is configured to calculate the central point of each of the clusters according to the clustering result;
距离计算模块30,用于计算病理样本点i与各个所述簇的中心点的距离;The distance calculation module 30 is used to calculate the distance between the pathological sample point i and the center point of each cluster;
样本点系数计算模块40,用于根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The sample point coefficient calculation module 40 is configured to calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000010
Figure PCTCN2020093328-appb-000010
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
结果系数计算模块50,用于计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;The result coefficient calculation module 50 is configured to calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
结果评价模块60,用于根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;The result evaluation module 60 is configured to determine the quality of the clustering result according to the adjusted contour coefficient of the clustering result;
获取样本模块70,用于在所述聚类结果为优时,获取待处理的病理数据样本;The sample obtaining module 70 is configured to obtain a pathological data sample to be processed when the clustering result is excellent;
样本分析模块80,用于根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。The sample analysis module 80 is configured to classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
可选的,病理数据分析装置还包括:Optionally, the pathological data analysis device further includes:
多结果计算模块,用于计算多个聚类结果的调整轮廓系数;Multi-result calculation module, used to calculate the adjusted contour coefficient of multiple clustering results;
确定最优结果模块,用于将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。The optimal result determining module is used to determine the clustering result with the highest adjusted contour coefficient as the optimal clustering result of the pathological data sample set.
可选的,病理数据分析装置还包括:Optionally, the pathological data analysis device further includes:
系数判断模块,用于判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;A coefficient judgment module for judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
确定优选结果模块,用于若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。The preferred result determining module is configured to determine the clustering result as the preferred clustering result of the pathological data sample set if the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold.
可选的,病理数据分析装置还包括:Optionally, the pathological data analysis device further includes:
样本集获取模块,用于获取所述病理数据样本集;The sample set acquisition module is used to acquire the pathological data sample set;
第一聚类计算模块,用于基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。The first clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the K-Means clustering algorithm.
可选的,病理数据分析装置还包括:Optionally, the pathological data analysis device further includes:
样本集获取模块,用于获取所述病理数据样本集;The sample set acquisition module is used to acquire the pathological data sample set;
第二聚类计算模块,用于基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。The second clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the agglomerative hierarchical clustering algorithm.
关于病理数据分析装置的具体限定可以参见上文中对于病理数据分析方法的限定,在此不再赘述。上述病理数据分析装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the pathological data analysis device, please refer to the above definition of the pathological data analysis method, which will not be repeated here. Each module in the above-mentioned pathological data analysis device can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储病理数据聚类结果评估所涉及的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种病理数据分析方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读 存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 9. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium. The database of the computer device is used to store the data involved in the evaluation of the pathological data clustering result. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions are executed by the processor to realize a pathological data analysis method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000011
Figure PCTCN2020093328-appb-000011
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
在一个实施例中,提供了一个或多个存储有计算机可读指令的计算机可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。可读存储介质上存储有计算机可读指令,计算机可读指令被一个或多个处理器执行时实现以下步骤:In one embodiment, one or more computer-readable storage media storing computer-readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. Storage medium. The readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the following steps are implemented:
获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
Figure PCTCN2020093328-appb-000012
Figure PCTCN2020093328-appb-000012
上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、 数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种病理数据分析方法,其中,包括:A pathological data analysis method, which includes:
    获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
    根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
    计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
    根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
    Figure PCTCN2020093328-appb-100001
    Figure PCTCN2020093328-appb-100001
    上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
    计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
    根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
    在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
    根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  2. 如权利要求1所述的病理数据分析方法,其中,所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,还包括:3. The pathological data analysis method according to claim 1, wherein said calculating the average number of adjusted contour coefficients of all pathological sample points i, and obtaining the adjusted contour coefficients of the clustering result, further comprises:
    计算多个聚类结果的调整轮廓系数;Calculate the adjusted contour coefficients of multiple clustering results;
    将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。The clustering result with the highest adjusted contour coefficient is determined as the optimal clustering result of the pathological data sample set.
  3. 如权利要求1所述的病理数据分析方法,其中,所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,还包括:3. The pathological data analysis method according to claim 1, wherein said calculating the average number of adjusted contour coefficients of all pathological sample points i, and obtaining the adjusted contour coefficients of the clustering result, further comprises:
    判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;Judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
    若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。If the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold, the clustering result is determined as the preferred clustering result of the pathological data sample set.
  4. 如权利要求1所述的病理数据分析方法,其中,所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,包括:The pathological data analysis method according to claim 1, wherein the obtaining a clustering result, before the clustering result divides the pathological data sample set into several clusters, comprises:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on the K-Means clustering algorithm.
  5. 如权利要求1所述的病理数据分析方法,其中,所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,包括:The pathological data analysis method according to claim 1, wherein the obtaining a clustering result, before the clustering result divides the pathological data sample set into several clusters, comprises:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on an agglomerated hierarchical clustering algorithm.
  6. 一种病理数据分析装置,其中,包括:A pathological data analysis device, which includes:
    获取结果模块,用于获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;The obtaining result module is used to obtain the clustering result of the pathological data sample set. The clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological data The number of pathological sample points i in the sample set is greater than the preset number threshold;
    中心点计算模块,用于根据所述聚类结果计算各个所述簇的中心点;A central point calculation module, configured to calculate the central point of each of the clusters according to the clustering result;
    距离计算模块,用于计算病理样本点i与各个所述簇的中心点的距离;The distance calculation module is used to calculate the distance between the pathological sample point i and the center point of each cluster;
    样本点系数计算模块,用于根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The sample point coefficient calculation module is used to calculate the adjusted contour coefficient of the pathological sample point i according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
    Figure PCTCN2020093328-appb-100002
    Figure PCTCN2020093328-appb-100002
    上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
    结果系数计算模块,用于计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;The result coefficient calculation module is used to calculate the average of the adjusted contour coefficients of all the pathological sample points i, and obtain the adjusted contour coefficient of the clustering result;
    结果评价模块,用于根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;The result evaluation module is used to determine the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
    获取样本模块,用于在所述聚类结果为优时,获取待处理的病理数据样本;The sample obtaining module is used to obtain a sample of pathological data to be processed when the clustering result is excellent;
    样本分析模块,用于根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。The sample analysis module is configured to classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  7. 如权利要求6所述的病理数据分析装置,其中,还包括:The pathological data analysis device according to claim 6, further comprising:
    多结果计算模块,用于计算多个聚类结果的调整轮廓系数;Multi-result calculation module, used to calculate the adjusted contour coefficient of multiple clustering results;
    确定最优结果模块,用于将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。The optimal result determining module is used to determine the clustering result with the highest adjusted contour coefficient as the optimal clustering result of the pathological data sample set.
  8. 如权利要求6所述的病理数据分析装置,其中,还包括:The pathological data analysis device according to claim 6, further comprising:
    系数判断模块,用于判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;A coefficient judgment module for judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
    确定优选结果模块,用于若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。The preferred result determining module is configured to determine the clustering result as the preferred clustering result of the pathological data sample set if the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold.
  9. 如权利要求6所述的病理数据分析装置,其中,还包括:The pathological data analysis device according to claim 6, further comprising:
    样本集获取模块,用于获取所述病理数据样本集;The sample set acquisition module is used to acquire the pathological data sample set;
    第一聚类计算模块,用于基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。The first clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the K-Means clustering algorithm.
  10. 如权利要求6所述的病理数据分析装置,其中,还包括:The pathological data analysis device according to claim 6, further comprising:
    样本集获取模块,用于获取所述病理数据样本集;The sample set acquisition module is used to acquire the pathological data sample set;
    第二聚类计算模块,用于基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。The second clustering calculation module is configured to calculate the clustering result of the pathological data sample set based on the agglomerative hierarchical clustering algorithm.
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
    根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
    计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
    根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
    Figure PCTCN2020093328-appb-100003
    Figure PCTCN2020093328-appb-100003
    上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
    计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
    根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
    在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
    根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  12. 如权利要求11所述的计算机设备,其中,在所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein, after said calculating the average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result, the processor executes the The following steps are also implemented when the computer-readable instructions are:
    计算多个聚类结果的调整轮廓系数;Calculate the adjusted contour coefficients of multiple clustering results;
    将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。The clustering result with the highest adjusted contour coefficient is determined as the optimal clustering result of the pathological data sample set.
  13. 如权利要求11所述的计算机设备,其中,在所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein, after said calculating the average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result, the processor executes the The following steps are also implemented when the computer-readable instructions are:
    判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;Judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
    若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。If the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold, the clustering result is determined as the preferred clustering result of the pathological data sample set.
  14. 如权利要求11所述的计算机设备,其中,在所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein, before the clustering result is obtained, and the clustering result divides the pathological data sample set into a plurality of clusters, the processor further executes the computer-readable instruction To achieve the following steps:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on the K-Means clustering algorithm.
  15. 如权利要求11所述的计算机设备,其中,在所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 11, wherein, before the clustering result is obtained, and the clustering result divides the pathological data sample set into a plurality of clusters, the processor further executes the computer-readable instruction To achieve the following steps:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on an agglomerated hierarchical clustering algorithm.
  16. 一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取病理数据样本集的聚类结果,所述聚类结果将所述病理数据样本集划分为若干个簇,所述簇由多个病理样本点i组成,所述病理数据样本集的病理样本点i的数量大于预设数量阈值;Obtain a clustering result of the pathological data sample set, the clustering result divides the pathological data sample set into several clusters, the clusters are composed of multiple pathological sample points i, and the pathological sample points of the pathological data sample set The number of i is greater than the preset number threshold;
    根据所述聚类结果计算各个所述簇的中心点;Calculating the center point of each of the clusters according to the clustering result;
    计算病理样本点i与各个所述簇的中心点的距离;Calculate the distance between the pathological sample point i and the center point of each cluster;
    根据所述病理样本点i与各个所述簇的中心点的距离计算所述病理样本点i的调整轮廓系数,计算公式如下:The adjusted contour coefficient of the pathological sample point i is calculated according to the distance between the pathological sample point i and the center point of each cluster, and the calculation formula is as follows:
    Figure PCTCN2020093328-appb-100004
    Figure PCTCN2020093328-appb-100004
    上式中,s c(i)表示病理样本点i的调整轮廓系数;a c(i)表示病理样本点i与其所在簇的中心点的距离;b c(i)表示与病理样本点i最近的簇的中心点与病理样本点i的距离; In the above formula, s c (i) represents the adjusted contour coefficient of pathological sample point i; a c (i) represents the distance between pathological sample point i and the center point of its cluster; b c (i) represents the closest pathological sample point i The distance between the center point of the cluster and the pathological sample point i;
    计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数;Calculate the average of the adjusted contour coefficients of all pathological sample points i, and obtain the adjusted contour coefficients of the clustering result;
    根据所述聚类结果的调整轮廓系数确定所述聚类结果的优劣;Determining the pros and cons of the clustering result according to the adjusted contour coefficient of the clustering result;
    在所述聚类结果为优时,获取待处理的病理数据样本;When the clustering result is excellent, obtaining a sample of pathological data to be processed;
    根据所述聚类结果对所述待处理的病理数据样本进行分类,并生成与所述待处理的病理数据样本对应的病理分析数据。Classify the pathological data sample to be processed according to the clustering result, and generate pathological analysis data corresponding to the pathological data sample to be processed.
  17. 如权利要求16所述的可读存储介质,其中,在所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 16, wherein, after said calculating the average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result, the computer readable When the instruction is executed by one or more processors, the one or more processors further execute the following steps:
    计算多个聚类结果的调整轮廓系数;Calculate the adjusted contour coefficient of multiple clustering results;
    将调整轮廓系数最高的聚类结果确定为所述病理数据样本集的最优聚类结果。The clustering result with the highest adjusted contour coefficient is determined as the optimal clustering result of the pathological data sample set.
  18. 如权利要求16所述的可读存储介质,其中,在所述计算所有所述病理样本点i的调整轮廓系数的平均数,获得所述聚类结果的调整轮廓系数之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 16, wherein, after said calculating the average of the adjusted contour coefficients of all the pathological sample points i to obtain the adjusted contour coefficients of the clustering result, the computer readable When the instruction is executed by one or more processors, the one or more processors further execute the following steps:
    判断所述聚类结果的调整轮廓系数是否大于预设系数阈值;Judging whether the adjusted contour coefficient of the clustering result is greater than a preset coefficient threshold;
    若所述聚类结果的调整轮廓系数大于预设系数阈值,则将所述聚类结果确定为所述病理数据样本集的优选聚类结果。If the adjusted contour coefficient of the clustering result is greater than the preset coefficient threshold, the clustering result is determined as the preferred clustering result of the pathological data sample set.
  19. 如权利要求16所述的可读存储介质,其中,在所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium of claim 16, wherein, before the clustering result is obtained, and the clustering result divides the pathological data sample set into several clusters, the computer-readable instructions are executed by one or more clusters. When the processor executes, the one or more processors further execute the following steps:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于K-Means聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on the K-Means clustering algorithm.
  20. 如权利要求16所述的可读存储介质,其中,在所述获取聚类结果,所述聚类结果将病理数据样本集划分为若干个簇之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium of claim 16, wherein, before the clustering result is obtained, and the clustering result divides the pathological data sample set into several clusters, the computer-readable instructions are executed by one or more clusters. When the processor executes, the one or more processors further execute the following steps:
    获取所述病理数据样本集;Acquiring the pathological data sample set;
    基于凝聚层次聚类算法计算所述病理数据样本集的所述聚类结果。The clustering result of the pathological data sample set is calculated based on an agglomerated hierarchical clustering algorithm.
PCT/CN2020/093328 2020-01-03 2020-05-29 Pathological data analysis method and apparatus, and device and storage medium WO2021135063A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010005182.7A CN111223570A (en) 2020-01-03 2020-01-03 Pathological data analysis method, device, equipment and storage medium
CN202010005182.7 2020-01-03

Publications (1)

Publication Number Publication Date
WO2021135063A1 true WO2021135063A1 (en) 2021-07-08

Family

ID=70830971

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/093328 WO2021135063A1 (en) 2020-01-03 2020-05-29 Pathological data analysis method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN111223570A (en)
WO (1) WO2021135063A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373564A (en) * 2023-12-08 2024-01-09 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111738319B (en) * 2020-06-11 2021-09-10 佳都科技集团股份有限公司 Clustering result evaluation method and device based on large-scale samples

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228507A1 (en) * 2014-08-08 2017-08-10 Icahn School Of Medicine At Mount Sinai Automatic disease diagnoses using longitudinal medical record data
CN107609588A (en) * 2017-09-12 2018-01-19 大连大学 A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
CN110136836A (en) * 2019-03-27 2019-08-16 周凡 A kind of disease forecasting method based on physical examination report clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228507A1 (en) * 2014-08-08 2017-08-10 Icahn School Of Medicine At Mount Sinai Automatic disease diagnoses using longitudinal medical record data
CN107609588A (en) * 2017-09-12 2018-01-19 大连大学 A kind of disturbances in patients with Parkinson disease UPDRS score Forecasting Methodologies based on voice signal
CN110136836A (en) * 2019-03-27 2019-08-16 周凡 A kind of disease forecasting method based on physical examination report clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MARTIN ATZMUELLER, ALVIN CHIN, FREDERIK JANSSEN, IMMANUEL SCHWEIZER, CHRISTOPH TRATTNER: "ICIAP: International Conference on Image Analysis and Processing, 17th International Conference, Naples, Italy, September 9-13, 2013. Proceedings", vol. 10358 Chap.21, 2 July 2017, SPRINGER, Berlin, Heidelberg, ISBN: 978-3-642-17318-9, article WANG FEI; FRANCO-PENYA HECTOR-HUGO; KELLEHER JOHN D.; PUGH JOHN; ROSS ROBERT: "An Analysis of the Application of Simplified Silhouette to the Evaluation ofk-means Clustering Validity", pages: 291 - 305, XP047419482, 032548, DOI: 10.1007/978-3-319-62416-7_21 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117373564A (en) * 2023-12-08 2024-01-09 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment
CN117373564B (en) * 2023-12-08 2024-03-01 北京百奥纳芯生物科技有限公司 Method and device for generating binding ligand of protein target and electronic equipment

Also Published As

Publication number Publication date
CN111223570A (en) 2020-06-02

Similar Documents

Publication Publication Date Title
WO2021082426A1 (en) Human face clustering method and apparatus, computer device, and storage medium
WO2021151325A1 (en) Method and apparatus for triage model training based on medical knowledge graphs, and device
Gu et al. Structural minimax probability machine
WO2021004112A1 (en) Anomalous face detection method, anomaly identification method, device, apparatus, and medium
WO2022042123A1 (en) Image recognition model generation method and apparatus, computer device and storage medium
WO2022126971A1 (en) Density-based text clustering method and apparatus, device, and storage medium
WO2020015075A1 (en) Facial image comparison method and apparatus, computer device, and storage medium
US9524449B2 (en) Generation of visual pattern classes for visual pattern recognition
WO2021003938A1 (en) Image classification method and apparatus, computer device and storage medium
WO2021135063A1 (en) Pathological data analysis method and apparatus, and device and storage medium
US11775610B2 (en) Flexible imputation of missing data
WO2020215560A1 (en) Auto-encoding neural network processing method and apparatus, and computer device and storage medium
WO2022077863A1 (en) Visual positioning method, and method for training related model, related apparatus, and device
US20170061257A1 (en) Generation of visual pattern classes for visual pattern regonition
Li et al. A general framework for association analysis of heterogeneous data
WO2023108995A1 (en) Vector similarity calculation method and apparatus, device and storage medium
WO2020134819A1 (en) Method for searching face, and related device
Kwedlo A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering
WO2015001416A1 (en) Multi-dimensional data clustering
US20200082213A1 (en) Sample processing method and device
CN108388869B (en) Handwritten data classification method and system based on multiple manifold
WO2021068524A1 (en) Image matching method and apparatus, computer device, and storage medium
WO2023133055A1 (en) Simplifying convolutional neural networks using aggregated representations of images
Baidari et al. A criterion for deciding the number of clusters in a dataset based on data depth
CN114417095A (en) Data set partitioning method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909685

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909685

Country of ref document: EP

Kind code of ref document: A1