CN110633719B

CN110633719B - Micro-droplet data classification method

Info

Publication number: CN110633719B
Application number: CN201810646767.XA
Authority: CN
Inventors: 朱修锐; 郭永; 荆高山; 祝令香; 苏世圣; 付明珠; 王勇斗
Original assignee: Beijing Targeting One Biotechnology Co ltd; Tsinghua University
Current assignee: Beijing Targeting One Biotechnology Co ltd; Tsinghua University
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2022-05-20
Anticipated expiration: 2038-06-21
Also published as: CN110633719A8; CN110633719A

Abstract

The invention provides a classification method of micro-droplet data, which comprises the following steps: inputting micro-droplet data and micro-droplet classification morphological parameters; dividing the micro-droplet data into grids, wherein all the grids form a grid map; dividing the grid map into at least one region according to the data density difference of the micro-droplets in each grid; setting a reference point combination according to the classification morphological parameters, comparing the reference point combination with the center of each area, and determining the optimal classification morphological parameters; and classifying the micro-droplet data according to the number of the regions. The classification method can automatically, objectively and accurately perform unsupervised and self-adaptive classification on various common micro-droplet data.

Description

Micro-droplet data classification method

Technical Field

The invention relates to the technical field of micro-droplets, in particular to a classification method of micro-droplet data.

Background

The micro-droplet technology can uniformly divide a traditional reaction system into hundreds to millions of micro-reactors (the common scale is tens of thousands to millions of micro-reactors), so that high-flux single-molecule analysis is realized, and the micro-droplet technology has wide application in the fields of rare molecule detection, absolute molecule number quantification and the like, and comprises micro-droplet digital polymerase chain reaction, micro-droplet digital enzyme-linked immunosorbent assay, high-flux single-cell analysis based on micro-droplets and the like.

Since each micro-droplet generates an n-dimensional vector during the detection process, each dimension of the n-dimensional vector is usually corresponding to a fluorescence channel. After each reaction, hundreds to millions of data points (each of which is referred to as a member of the microdroplet data) are obtained. The number of target molecules in a sample generally corresponds to the number of data points with higher fluorescence values in certain fluorescence dimensions, and thus it is desirable to automatically, objectively, and accurately define which members of the microdroplet data have higher fluorescence values.

Researchers have devised a number of classification methods in order to achieve automatic, objective, and accurate interpretation of microdroplet data. These classification methods can be classified into either supervised or unsupervised categories. The supervised classification algorithm requires learning using existing microdroplet data. Milbury et al suggested that existing microdroplet data could be used to delineate two-dimensional regions (such as the T790 and L858 sites of EGFR) as classification thresholds for subsequent experiments. Similarly, Jones et al propose that quality control with known template copy numbers can be used to define the location of individual classes, excluding the effect of scatter (rain data) as a threshold for subsequent classes. Since the supervised classification algorithm needs to be adapted to the microdroplet fluorescence data for learning, the trained classification algorithm is generally only adapted to the same or similar reactions as the learned data. In contrast, unsupervised classification algorithms do not require training using data. Common unsupervised classification algorithms include: k-means method, k-center method, and expectation-maximization method based on joint probability density function, etc., which have a common disadvantage: the number (classification determining number) and position of the initial values have a great influence on the classification effect, and in order to avoid the influence, the initial values are usually randomly selected for many times, but the random selection of the initial values can cause the repeatability of the classification result to be reduced, and different interpretation results can be given to the same group of data in practical application (particularly clinical application). In order to avoid the influence of using initial values on the interpretation of the micro-droplet data, an unsupervised classification method independent of the initial values can be considered, wherein the commonly used methods include a density search method and a spectral clustering method, the common defects of the two methods are that the consumption of computing resources is overlarge, the former method needs to carry out recursive search, all data must be recorded and stored before each recursive convergence, the latter method needs to calculate the distance between every two data points and calculate the characteristic value of a distance matrix, and for the micro-droplet data, the distance matrix is difficult to be thinned, the storage space is large, and the characteristic value is difficult to be calculated, so that the two methods are not suitable for the classification of large-scale micro-droplet data (such as the micro-droplet data with more than ten thousand members). In addition, research groups have proposed unsupervised classification methods specifically for microdroplet data. Trypsteen et al proposed a method for classifying one-dimensional microdroplet data by estimating the bounds of negative data points using the Robertson-Cryer model to achieve classification of negative and positive data points. Attalii et al propose a classification method for two-dimensional micro-droplet data, which firstly identifies negative data in fluorescence data, then judges and removes scattered points in the data, and finally classifies the remaining data into two classifications of single positive and double positive. Although this method achieves the discrimination of scatter in an unsupervised manner, it can only distinguish data of up to 3 classes. Lau et al proposed a method for two-dimensional micro-droplet data classification, called CALICO, which first divides the data into grids, then determines the number of classifications based on the connected domains in the grids and classifies the fluorescence data by connected domain. This method uses density as an indicator of classification, which is closest to the human interpretation method of fluorescence data (fluorescence data seen by a human is "gridded" by pixels of a display) among all unsupervised methods, but classifying fluorescence data only by connected domain is susceptible to scatter, thereby misjudging a plurality of classifications as one classification. In summary, although unsupervised classification algorithms do not require training using data, they have difficulty solving the following problems simultaneously: (1) classifying the at most two-dimensional fluorescence data; (2) automatically determining the number of classifications; (3) the influence caused by scattered points is avoided, and one classification is changed into a plurality of classifications, or the classifications are mutually communicated due to the scattered points.

In summary, there is still a need for a more efficient unsupervised adaptive classification method for automatically, objectively and accurately discriminating microdroplet data, which needs to be adapted to several common situations in microdroplet applications: (1) clearly classified microdroplet data (normal experiments will produce such data), (2) microdroplet data with density fluctuations (non-uniform reaction efficiency will produce such data), (3) microdroplet data with scatter (errors in the detection system will produce such data), (4) microdroplet data with incompletely classified morphology (negative samples or samples with partial negative indicators will produce such data), (5) microdroplet data with rare data classifications (samples containing rare target molecules will produce such data), and other more general microdroplet data.

Disclosure of Invention

The invention provides a classification method of micro-droplet data, which comprises the following steps: step 1: inputting microdroplet data and classification morphological parameters of microdroplets, wherein the microdroplet data consists of a plurality of data members, the dimensionality of each data member is the same, or the dimensionality of each data member is the same after transformation, and the classification morphological parameters are the quantity parameters and the relative position parameters of the microdroplets classified by the microdroplets; step 2: dividing the micro-droplet data into grids, wherein all the grids form a grid map; and step 3: dividing the grid map into at least one region according to the data density difference of the micro-droplets in each grid, wherein the data density of the micro-droplets in each grid is the number of the members of the micro-droplet data in each grid or the number of the members of the micro-droplet data after being converted; and 4, step 4: setting a reference point combination according to the classification morphological parameters, comparing the reference point combination with the center of each region, and determining the optimal classification morphological parameters, wherein the reference point combination meets the following conditions: firstly, the relative position between any two reference points is one relative position parameter in the classification morphological parameters; secondly, the number of the reference points in the reference point combination does not exceed the number parameter of the classification morphological parameters; and step 5: classifying the micro-droplet data according to the number of the areas, and if the number of the areas in the step 3 is not more than the number parameter of the optimal classification morphological parameters in the step 4, directly classifying the micro-droplet data according to the areas; and if the number of the regions after the step 3 is finished is larger than the number parameter of the optimal classification morphological parameters in the step 4, selecting and/or combining the regions after the step 3 is finished until the number of the regions after selection and/or combination does not exceed the number parameter of the optimal classification morphological parameters in the step 4, and classifying the micro-droplet data according to the regions after selection and/or combination.

In one embodiment, the classification morphology parameters in step 1 are given row and column numbers of micro-droplets, and the classification morphology parameters are located on equally spaced checkerboard grid points, wherein the number of the grid points is a number parameter of the classification morphology parameters, and the relative positions of the grid points are relative position parameters of the classification morphology parameters.

In one embodiment, the grid in step 2 is a checkerboard grid, preferably a checkerboard grid satisfying the following conditions: firstly, the side lengths of grids in each dimension of the chessboard grids are the same; second, the respective boundaries of the grid map are determined by the maxima and minima of the microdroplet data in each dimension; thirdly, the number of divisions per dimension of the checkerboard grid is a given positive integer constant or increases with the number of different values of the projection of the microdroplet data on that dimension, preferably the logarithm of the number of different values of the projection of the microdroplet data on that dimension, and rounded up.

In one embodiment, the method of dividing the grid atlas into at least one region is: if there is a unique peak grid within the grid map, the grid map is divided into one region, otherwise at least one boundary grid is found, such that the remaining grids, referred to as internal grids, are divided into regions satisfying the following conditions: firstly, non-adjacent peak grids are respectively internal grids of different areas, and secondly, the sum of the data densities in boundary grids is minimum, wherein the peak grid refers to a grid of which the data density in one grid is not less than that in any adjacent grid; the method is preferably: the data density in the grids is the number of the micro-droplet data members, the opposite number of the data density in each grid is firstly obtained, and then the grids are classified into regions by using a watershed algorithm.

In one embodiment, the method for determining the optimal classification morphological parameter in step 4 includes: and sequentially inspecting possible reference point combinations, calculating the distance from the center of each area to the nearest reference point, weighting according to the number of data members in each area, summing, and taking the reference point combination with the minimum nearest distance sum as an optimal classification morphological parameter, wherein the number parameter is the number of the reference points in the selected reference point combination, and the position parameter is the possible relative position in the selected reference point combination.

In one embodiment, the combining of step 5 may be performed by: calculating separability values between adjacent regions, and combining the regions with the separability values lower than the separability threshold; wherein the separability value is calculated by a separability function which is a function that varies monotonically with both the data density in the adjacent regions and the data density at the boundary between the adjacent regions, preferably a function that varies monotonically with the data density in the adjacent regions and varies monotonically with the data density at the boundary between the adjacent regions; more preferably the separability function is:

wherein s is a separability number, d₁₂Maximum data density for the boundary between two adjacent regions, d₁And d₂The maximum data density of the grids in the two adjacent areas respectively; the separability threshold is a value determined from any one of the value ranges of the separability function, and is preferably an average of a maximum value and a minimum value of the separability function.

In one embodiment, the selecting method in step 5 is: the first step is bidirectional selection: finding the center of the area with the center closest to each reference point in the optimal classification morphological parameters, and if the reference points closest to the center of the area are also the reference points, respectively taking the area as the area affiliated to the reference points, wherein the area is called as a main area; the second step is unidirectional selection: for each reference point for which the main area is not found, if the area whose center is closest to the reference point is not determined as the main area, the area is taken as the main area belonging to the reference point, and each main area belongs to a different classification.

In one embodiment, the main region selection method is as follows: if the number of the regions after the step 3 is finished is larger than the number parameter of the optimal classification morphological parameters in the step 4, optionally increasing the judgment of a scattered region before selecting the main region, wherein the scattered region is defined as: the data density of all grids in the region is lower than the region with a scatter threshold value under a given confidence coefficient, wherein the confidence coefficient is a real number which is greater than or equal to 0 and less than 1, and is preferably 0.95; the scatter threshold is the largest positive integer meeting the following conditions: the probability that the data density generated by the determined number of scattered points in any grid according to the given probability distribution model is not higher than the positive integer is not less than the confidence coefficient of the scattered point density, wherein the number of the scattered points is determined according to the given probability distribution model and the grid number proportion of the grid map spectrum with the data density being more than 0, and the probability distribution model is preferably in uniform distribution.

In one embodiment, the scattered region is judged, and different main region selection methods are executed according to the judgment result of the scattered region: and if the number of the regions which are not judged as scattered regions is larger than the number parameter of the optimal classification form parameter, selecting the main region from the regions which are not judged as scattered regions, and otherwise, selecting the main region from all the regions.

In one embodiment, if the main area belonging to each reference point in the optimal classification shape cannot be selected from all the areas which are not judged as scattered areas, the main area is reselected from all the areas.

In one embodiment, one merging method described in step 5 is: if the area which is not selected as the main area exists, calculating the center distance of the adjacent areas pairwise, constructing a distance function, wherein the distance function is a function which is monotonically increased or decreased along with the center distance, preferably equal to the center distance, then searching each pair of adjacent areas containing at least one area which is not selected as the main area, merging the pair of areas with the minimum or maximum center distance, if the pair of adjacent areas comprises the main area, the merged area is the main area, otherwise, the merged area is not the main area, and finally repeating the merging process until all the areas are the main area.

In one embodiment, the combining of step 5 may be performed by: placing microdroplet data into a classification representing a region if a microdroplet data member is located in an internal grid of the region; if one micro-droplet data member is positioned in the boundary grid, calculating the probability that the micro-droplet data member belongs to each relevant adjacent area according to the probability distribution of the micro-droplet data in the adjacent area relevant to the boundary grid, and placing the data member in the classification represented by the area with the maximum probability; preferably, the probability distribution of the microdroplet data within the region is a normal distribution.

The invention discloses a classification method of micro-droplet data, which can perform unsupervised self-adaptive classification on various micro-droplet data, wherein the micro-droplet data comprises the following steps: clearly classified micro-droplet data, micro-droplet data with density fluctuation, micro-droplet data with scatter points, micro-droplet data with incomplete classified morphology, micro-droplet data with rare data classification and the like. The method has the advantages of accurate classification, suitability for various micro-droplet data, no need of training by using the data and the like.

Drawings

FIG. 1 is a diagram of the results of the classification process of the first embodiment: (a) a data graph of embodiment one; (b) carrying out grid division on micro-droplet data, (c) dividing a grid map into different region maps by using a watershed algorithm, (d) determining an optimal classification form parameter map, (e) merging the region maps according to a separability value, (f) judging a scattered region (the center of the scattered region is represented by a diamond shape) map, (g) selecting a main region map, (h) merging the region maps which are not selected as the main region, and (i) carrying out classification of the micro-droplet data according to the merged main region;

fig. 2 is a process diagram of determining optimal classification morphological parameters by data according to the first embodiment: on the premise that the classification morphological parameters are 2 rows and 2 columns of chessboard distribution, traversing (a-d)4 different row and column number combinations, respectively setting corresponding reference points for each combination, wherein each reference point is represented by a flag symbol in the figure, the center of an ellipse at the bottom of the flag is the position of the reference point, and the same is carried out below, wherein the classification morphological parameters shown by the reference points in the figure d are the optimal classification morphological parameters;

FIG. 3 is a process diagram of merging regions according to separability values according to the first embodiment: (a) a region map before the start of the process, which map is the same as in fig. 1c, (b) a pair of adjacent regions shown in an ellipse have a separability value of about 0.18 map, (c) a pair of adjacent region maps shown in an ellipse of the combined map b, (d) a pair of adjacent regions shown in an ellipse have a separability value of about 0.29 map, (e) a pair of adjacent region maps shown in an ellipse of the combined map d, which map is the same as in fig. 1 e;

FIG. 4 is a process diagram of picking a main area by data according to the first embodiment: (a) (b-c) selecting a process diagram of a main area under the condition that the area is not judged as a scattered area by priority, wherein the round points in the diagram b are the main areas selected after the first step of bidirectional selection, the round points in the diagram c are the main areas selected after the first step of bidirectional selection and the second step of unidirectional selection in sequence, so far, the main areas belonging to each reference point in the optimal classification form are selected, and the diagram is the same as the diagram in fig. 1 g;

FIG. 5 is a process diagram of the first embodiment in which data is merged into a region not selected as the main region: (a) a region map before the process starts, which is the same as fig. 1g, (b) a map in which a pair of adjacent regions shown in an ellipse includes a region not selected as a main region and the center distance of both is about 3.94, (c) the pair of adjacent regions shown in the ellipse of fig. b are merged, and the merged region is a map of the main region, (d) a map in which a pair of adjacent regions shown in an ellipse includes a region not selected as a main region and the center distance of both is about 9.14, (e) the pair of adjacent regions shown in the ellipse of fig. d are merged, and the merged region is the main region, so far, all the regions are maps of the main region, which is the same as fig. 1 h;

FIG. 6 is a diagram of the results of the classification process of example two: (a) data graphs of example two; (b) carrying out a grid division graph on the micro-droplet data, (c) dividing a grid map into different region graphs by using a watershed algorithm, (d) determining an optimal classification morphological parameter graph, and (e) classifying the micro-droplet data according to the regions;

FIG. 7 is a diagram of the results in the classification process of the third embodiment: (a) data graphs of example three; (b) carrying out grid division drawing on the micro-droplet data, (c) dividing a grid map into different region maps by using a watershed algorithm, (d) determining an optimal classification morphological parameter map, (e) merging the region maps according to separability values, and (f) carrying out classification drawing on the micro-droplet data according to the merged regions;

FIG. 8 is a diagram showing the results of the classification process in the fourth embodiment: (a) data graphs for example four; (b) carrying out a grid division graph on the micro-droplet data, (c) dividing the grid map into different region maps by using a watershed algorithm, (d) determining an optimal classification morphological parameter map, (e) judging a scattered region map, and (f) carrying out a classification graph on the micro-droplet data according to regions which are not judged to be scattered regions;

FIG. 9 is a diagram showing the results of the classification process in the fifth embodiment: (a) data graphs for example five; (b) carrying out a grid division graph on the micro-droplet data, (c) dividing the grid map into different region maps by using a watershed algorithm, (d) determining an optimal classification morphological parameter map, (e) selecting a main region map, (f) merging the region maps which are not selected as the main region, and (g) classifying the micro-droplet data according to the merged region;

fig. 10 is a diagram of the results in the classification process of the sixth embodiment: (a) data graphs for example six; (b) carrying out grid division drawing on micro-droplet data, (c) dividing a grid map into different region maps by using a watershed algorithm, (d) determining an optimal classification morphological parameter map, (e) judging a scattered region map, (f) selecting a main region map, (g) merging the region maps which are not selected as the main region, (h) carrying out classification drawing on the micro-droplet data according to the merged region;

FIG. 11 is a diagram illustrating a six-tap process according to an embodiment: (a) the area map (same as fig. 10e) and the reference point map of the optimal classification morphological parameters before the process starts, (b) under the condition that the area which is not judged as a scattered area is preferentially considered, 3 main area maps are selected in total, namely, the main area which belongs to the reference point at the upper right corner in the optimal classification morphological parameters shown in the image a cannot be found, therefore, (c) the main area maps are re-selected from all the areas, and 4 main areas are found, and the maps are the same as fig. 10 f.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the present invention will be further described below with reference to the following examples, and it is obvious that the described examples are only a part of the examples of the present application, and not all examples. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The first embodiment is as follows: classification process of micro-droplet data

In this embodiment, a complete implementation flow of the method for classifying droplet data is described using a droplet data.

Step 1: inputting micro-droplet data and micro-droplet classification morphological parameters

The input microdroplet data (fig. 1a) contained a total of 50000 members, each member having a dimension of 2, corresponding to the fluorescence values of the two channels. The classification morphological parameters of the micro-droplets are 2 rows and 2 columns of checkerboard distribution, wherein the number parameter is 4, the reference point a at the lower left corner is taken as a reference, and the relative position parameters are a reference point B right above the reference point a, a reference point D right above the reference point a, and a reference point C (i.e. the position of the center of the ellipse at the bottom of each flag in fig. 2D) located right of the reference point B and right above the reference point D.

Step 2: dividing the micro-droplet data into grids, wherein all the grids form a grid map

The minimum value and the maximum value of the data on each dimension are determined and respectively used as the boundary of each dimension division. The number of divisions for each dimension is calculated according to the number of different values of the projection of the data on the dimension, in this embodiment, the number of different values of the projection of the data on each dimension is 50000, and then the number of divisions for each dimension is: log of₂50000]Therefore, the droplet data is divided into 17 × 17 rectangular grids (fig. 1b) with the minimum value and the maximum value in each dimension as boundaries, and all 17 × 17 grids form a grid mapSpectra.

And step 3: dividing the grid map into regions

And counting the number of the micro-droplet data members in each grid to be used as the data density in each grid. Taking the opposite number of the data density in each grid, dividing the grid map into 8 regions (fig. 1c) by using a watershed algorithm, wherein the grids forming the region boundaries in the grid map are called boundary grids, the rest grids in the grid map are respectively internal grids of different regions, and the watershed algorithm can ensure that: each non-contiguous peak grid is an internal grid of a different region, and the sum of the data densities within the boundary grids is minimal.

And 4, step 4: determining optimal classification morphological parameters

In the step, all possible row and column number combinations are traversed to determine the optimal classification form, and 4 conditions are provided: 1 row and 1 column, 1 row and 2 columns, 2 rows and 1 column and 2 rows and 2 columns, respectively, setting a reference point (in fig. 2, the center of an ellipse at the bottom of the flag is the position of the reference point), wherein the number parameter of the reference points is equal to the product of the row number and the column number, and the relative position parameter is: if the number of rows or columns in a dimension is 1, the projection of all the reference points in the dimension is located at the center of the range of the projection of the micro-droplet data in the dimension (i.e. the average of the minimum and maximum values); if the number of rows or columns of a dimension is 2, the projection of all the reference points on the dimension is located at the minimum or maximum of the projection of the micro-droplet data on the dimension; if the number of rows or columns of a dimension is m (m ≧ 3), the projection of all fiducial points on that dimension is at the minimum, maximum, or m-1 isocenter of the projection of the droplet data on that dimension. According to the above rule, the relative position parameters of the datum points in the 4 row-column combination cases are shown in FIGS. 2 a-d.

Respectively calculating the distance from the center of each region to the nearest reference point, weighting the number of data members in the region, and summing (called as weighted distance sum, unit is grid side length) to obtain: when the row-column array is combined into 1 row and 1 column, the weighted distance sum is 3.91E 5; when the row-column array combination is 1 row and 2 columns, the weighted distance sum is 3.28E 5; when the row-column combination is 2 rows and 1 columns, the weighted distance sum is 2.76E 5; when the row and column combination is 2 rows and 2 columns, the weighted distance sum is 1.92E 5. Since the weighted distance sum is minimum when the classification morphology is 2 × 2, the classification morphology parameters corresponding to 2 rows and 2 columns are the optimal classification morphology parameters (fig. 2 d).

And 5: classifying micro-drop data

Since the number (8) of the regions after the step 3 is completed is greater than the number (4) of the optimal classification morphological parameters in the step 4, the regions after the step 3 is completed need to be selected and merged.

First, regions are merged based on the separability value. The separability function is defined as

Wherein d is₁₂Maximum data density at the boundary between two adjacent regions, d₁And d₂Respectively, the maximum data density in the two adjacent regions. The value range of the separability function is [0, 1 ]]Accordingly, the separability threshold is set to 0.5. Calculated according to the separability function, two regions (d) shown in the circle of FIG. 3b₁₂＝28，d₁＝36，d₂32) the separability value is lowest, 0.18, the threshold value 0.5 is not reached, the two areas are merged (fig. 3 c); after this merging, the two regions (d) shown in the circle of FIG. 3d₁₂＝1333，d₁＝2058，d₂1720) the separability value is lowest, 0.29, the threshold value 0.5 is not reached, and the two regions are merged (fig. 3 e); after the merging is completed, the separability values between all the regions are not less than the threshold value of 0.5, and the merging process based on the separability values is ended (fig. 1 e).

Then, since the number of regions after the completion of step 3 is greater than the number parameter of the optimal classification morphological parameters in step 4, the judgment of the scatter region is optionally increased. Observing the input micro-droplet data (fig. 1a), it can be found that a series of scatter points exist in the upper left corner and the upper right corner, and especially the scatter points in the upper right corner form an independent "scatter region", so that it is necessary to increase the judgment of the scatter region in this embodiment. From the data density within each grid, one can obtainO＝130/17²The intra-grid data density of 45% is not 0(O is the ratio of the number of grids occupied by the droplet data). And (3) if the probability of each scatter point appearing at any position in the grid is the same (namely, the probability meets the uniform distribution), the probability distribution of the number of the scatter points in each grid meets the binomial distribution. And the total grid number is G-17²If the maximum likelihood estimation of the number of scattered points R satisfies the equation:

(1-1/G)^R＝1-O

the solution is that R is lg (1-0)/lg (1-1/G) which is approximately equal to 172. Under the condition that the confidence coefficient alpha of the scatter point density is 0.95, the judgment threshold value T of the scatter area is as follows:

wherein n is a random variable, the probability distribution of the random variable satisfies B (R, 1/G), namely, the degree is R, and the success rate of each time is 1/G of binomial distribution. The confidence of the density of the scatter points is set to 0.95, and T is solved to be 5, so that the area in which the data density of all grids in the area does not exceed the threshold 5 is judged as a scatter area. In this embodiment, 1 area is determined as a scatter area, and the data density of all grids in the area is less than or equal to 1 and is located at the upper right corner in fig. 1 f. In order to distinguish the area determined as a scatter from other areas, the center of the area determined as a scatter is indicated by a diamond shape, and the center of the area not determined as a scatter is indicated by a cross (fig. 1 f).

Second, the master region is selected from the region. Due to the addition of the optional step of scattered region judgment, the area which is not judged as the scattered region is preferentially considered in the process of selecting the main region. Considering all the areas which are not judged as scattered areas, firstly, carrying out a first step of bidirectional selection, namely: if the center of the area which is closest to a given reference point A (located at the lower left) and is not judged to be a scattered area is a (located at the lower left), and the reference point which is closest to the center a of the area is also A, the area where the center a of the area is located is marked as a main area which is subordinate to the reference point A, in the embodiment (fig. 4a), by means of the method of the two-way selection, 3 main areas (the area marked with the origin in fig. 4B) are selected, and are respectively subordinate to the reference point A located at the lower left, the reference point D located at the lower right and the reference point B located at the upper left; then, a second step of unidirectional selection is carried out, namely: if the reference point C (located at the upper right) does not find the main area subordinate to the reference point, and the area center C (located at the center) nearest to the reference point, which is not judged to be a scattered area, is not the main area, the area in which the area center C is located is marked as the main area subordinate to the reference point C, and in this embodiment, the reference point C located at the upper right does not find the "main area", and the area nearest to the reference point, which is not judged to be a scattered area, the area center C of which is located at (4105, 5013) -is not the main area, so the area in which the center C is located is marked as the main area subordinate to the reference point C located at the upper right (the area located at the center is marked with a dot in fig. 4C). At this time, the main area (fig. 1g) belonging to each reference point in the optimal classification form has been picked out from all the areas not judged as the scattered areas, and the process of picking out the main area is ended.

Then, since 2 regions in the grid map are not selected as the main region (the region centered on the diamond or the fork in fig. 5 a), region merging is required. The central distance (unit is the grid side length) of the adjacent areas is calculated pairwise, and the numerical value of the distance function is set to be equal to the central distance (namely, the numerical value is increased along with the central distance of the areas). Of each pair of contiguous regions comprising a region not selected as the primary region, the value of the distance function of the pair of contiguous regions shown in the circle of fig. 5b is the smallest, 4.91, thus merging the pair of contiguous regions (fig. 5 c); after this merging, the value of the distance function of the pair of adjacent regions shown in the circle of fig. 5d is minimum, 9.14, so that the pair of adjacent regions are merged (fig. 5 e); after the merging is completed, all the regions are the main regions (fig. 1h), and the region merging is finished.

Finally, the micro-drop data is classified. If a micro-droplet data member is positioned in the internal grid of a main area (numbered i), setting the area number i as a classification result of the data member; if one member of the micro-droplet data is positioned in the boundary grid of a certain area, the micro-droplet data in each main area is set to be in normal distribution, and for each relevant adjacent area (numbered i), the co-ordination of the micro-droplet data in the area (without the boundary) is calculatedAnd a variance matrix which constructs a binary normal distribution function p (x | i) together with the region center. The number n of data points of each region (without boundary)_iDividing by the total number of data points of all regions (without boundaries) as prior probability, further constructing mixed normal distribution p (x) - Σ_in_ip(x|i)/∑_in_i. Then, the posterior probability p (i | x) ═ n of each member of the micro-droplet data is further calculated_ip(x|i)/p(x)∑_in_iThe region number at which the posterior probability of all the relevant adjacent regions is the maximum

The classification result of the micro-droplet data member is set. The result of classification of all the members of the microdroplet data can be used for mapping and outputting the microdroplet data (fig. 1 i).

Example two: classification process of clearly classified micro-droplet data

This example illustrates the applicability of the method to clearly classified microdroplet data (the most common microdroplet data).

The micro-droplet data (figure 6a) with clear classification is divided into grids through a step 2 (figure 6b), a grid map is divided into 4 areas through a step 3 (figure 6c), and the optimal classification morphological parameter is determined to be 2 rows and 2 columns of chessboard distribution (figure 6d) through the step 4, which is equal to the number of the areas after the step 3 is completed, so that the micro-droplet data is directly classified according to the areas (figure 6 e).

Example three: classification process of micro-droplet data with density fluctuation

This example focuses on the role of the merge region in processing micro-drop data with density fluctuations.

Micro-droplet data with density fluctuation (fig. 7a) is divided into grids through a step 2 (fig. 7b), a grid map is divided into 6 regions through a step 3 (fig. 7c), an optimal classification morphological parameter is determined by a step 4 to be 2 rows and 2 columns of chessboard distribution (fig. 7d), as the number of the regions (6) after the step 3 is completed is larger than the number parameter (4) of the optimal classification morphological parameter in the step 4, merging of the regions is required to be performed, the regions are merged according to separability values, the merged regions are changed into 4 regions (fig. 7e), and then the micro-droplet data are classified according to the regions (fig. 7 f).

Example four: classification process of micro-droplet data with scattered points

This example focuses on the role of the judgment of the scatter region in processing the micro-droplet data with scatter.

Micro-droplet data with scatter (fig. 8a), the micro-droplet data is divided into grids (fig. 8b) through step 2, the grid map is divided into 9 areas (fig. 8c) through step 3, the optimal classification morphological parameter is determined as 2 rows and 2 columns of chessboard distribution (fig. 8d) through step 4, since the number of regions (9) after step 3 is completed is greater than the number parameter (4) of the optimal classification morphological parameters in step 4, therefore, merging and selecting regions need to be performed, and it is first observed that there are more obvious scatter points at the upper left, upper right and lower right in the micro-droplet data (fig. 8a), so that scatter judgment is optionally added, 5 scatter areas are found (fig. 8e), then, the main area is selected from the areas (4) which are not determined as scattered areas, all the 4 areas are selected as the main areas, and finally, the micro-droplet data is classified according to the areas (fig. 8 f).

Example five: classification process of droplet data in incomplete classification form

This example focuses on the role of determining optimal classification morphology parameters in processing microdroplet data for non-complete classification morphology.

Micro-droplet data with incomplete classification morphology (fig. 9a), the micro-droplet data is divided into grids through step 2 (fig. 9b), a grid map is divided into 4 areas through step 3 (fig. 9c), step 4 determines that the optimal classification morphology parameters are 1 row and 2 column checkerboard distribution (fig. 9d), because the number of areas (4) after step 3 is completed is larger than the number parameter (2) of the optimal classification morphology parameters in step 4, merging and selecting of areas are required to be performed, a main area is selected from the areas (fig. 9e), areas which are not selected as the main area are merged (fig. 9f), and the micro-droplet data is classified according to the areas (fig. 9 g).

Example six: classification process of microdroplet data with rare data classification

This example focuses on the role of picking primary regions in processing micro-drop data with rare data classifications.

Micro-droplet data with rare data classification (figure 10a), the micro-droplet data is divided into grids through step 2 (figure 10b), a grid map is divided into 6 regions through step 3 (figure 10c), step 4 determines that the optimal classification morphological parameter is 2 rows and 2 columns of chessboard distribution (figure 10d), as the number of the regions (6) after step 3 is completed is larger than the number parameter (4) of the optimal classification morphological parameter in step 4, merging and selection of the regions are required to be performed, firstly, the left middle part of the micro-droplet data (figure 10a) is observed to have more obvious scattered points, so scattered area judgment is optionally increased, 2 scattered areas are found (figure 10e), but the scattered area at the upper right corner comprises the classification formed by rare data members (figure 10a), the region which is not judged as the scattered area is preferentially considered when the main area is selected, and as a result, the region belonging to the classification morphological structure positioned at the lower left, lower part and upper part of the classification morphology are found, The main areas of the lower right and upper right 3 reference points (fig. 11b) need to be re-picked from all areas since the main area belonging to the reference point on the upper right is not found, resulting in 4 main areas being picked (fig. 11c or fig. 10f) including rare data classifications on the upper right corner, then merging areas not picked as main areas (fig. 10g), and finally classifying the micro-drop data according to the areas (fig. 10 h).

In summary, the method for classifying microdroplet data according to the embodiments can accurately perform unsupervised and adaptive classification of various microdroplet data, including: clearly classified microdroplet data (example two), microdroplet data with density fluctuations (example three), microdroplet data with scatter (example four), microdroplet data with an incomplete classification morphology (example five), microdroplet data with rare data classification (example six), and general microdroplet data (example one), etc.

It is to be understood that the invention disclosed is not limited to the particular methodology, protocols, and materials described, as these may vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to limit the scope of the present invention which will be limited only by the appended claims.

Those skilled in the art will also recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.

Claims

1. A method for classifying microdroplet data, the method comprising the steps of:

step 1: inputting microdroplet data and classification morphological parameters of microdroplets, wherein the microdroplet data consists of a plurality of data members, the dimensions of each data member are the same, or the dimensions of each data member are the same after transformation, and the classification morphological parameters are the quantity parameters and the relative position parameters of the microdroplets classified by the microdroplets;

step 2: dividing the micro-droplet data into grids, wherein all the grids form a grid map;

and 3, step 3: dividing the grid map into at least one region according to the data density difference of the micro-droplets in each grid, wherein the data density of the micro-droplets in each grid is the number of the members of the micro-droplet data in each grid or the number of the members of the micro-droplet data after being converted;

and 4, step 4: setting a reference point combination according to the classification morphological parameters, comparing the reference point combination with the center of each region, and determining the optimal classification morphological parameters, wherein the reference point combination meets the following conditions: firstly, the relative position between any two reference points is one relative position parameter in the classification morphological parameters; secondly, the number of the reference points in the reference point combination does not exceed the number parameter of the classification morphological parameters;

and 5: classifying the micro-droplet data according to the number of the areas, and if the number of the areas in the step 3 is not more than the number parameter of the optimal classification morphological parameter in the step 4, directly classifying the micro-droplet data according to the areas; and if the number of the regions in the step 3 is larger than the number parameter of the optimal classification morphological parameters in the step 4, selecting and/or combining the regions in the step 3 until the number of the selected and/or combined regions does not exceed the number parameter of the optimal classification morphological parameters in the step 4, and classifying the micro-droplet data according to the selected and/or combined regions.

2. The method according to claim 1, wherein the classification morphology parameters in step 1 are given row and column numbers of micro-droplets, classification morphology parameters on equally spaced checkerboard grid points, wherein the number of grid points is a number parameter of the classification morphology parameters, and the relative positions between the grid points are relative position parameters of the classification morphology parameters.

3. The method of claim 1, wherein the grid of step 2 is a checkerboard grid.

4. The method of claim 3, wherein said checkerboard grid is a checkerboard grid that satisfies the following condition: firstly, the side lengths of grids in each dimension of the chessboard grids are the same; second, the respective boundaries of the grid map are determined by the maxima and minima of the microdroplet data in each dimension; thirdly, the division number of each dimension of the chessboard grid is a given positive integer constant or is increased progressively with the number of different values projected on the dimension by the micro-droplet data.

5. The method of claim 4, wherein each dimension of said checkerboard grid is divided by a number that is a logarithm of the number of different values of said microdroplet data projected on that dimension, rounded up.

6. The method of claim 1, wherein the grid atlas is divided into at least one region by: if the grid map has the only peak grid, the grid map is divided into an area, otherwise, at least one boundary grid is searched, other grids in the grid map are internal grids of different areas, and the watershed algorithm ensures that: first, the non-adjacent peak grids are respectively the internal grids of different areas, and second, the sum of the data densities in the boundary grids is minimum, wherein the peak grid refers to a grid whose data density in one grid is not less than that in any one adjacent grid.

7. The method according to claim 1, wherein the optimal classification morphological parameter determination method of step 4 is: and sequentially inspecting the reference point combinations, calculating the distance from the center of each area to the nearest reference point, weighting according to the number of data members in each area, summing, and taking the reference point combination with the minimum nearest distance sum as an optimal classification morphological parameter, wherein the number parameter is the number of the reference points in the selected reference point combination, and the relative position parameter is the relative position in the selected reference point combination.

8. The method of claim 1, wherein the combining of step 5 is performed by: calculating separability values between adjacent areas, and combining the areas with the separability values lower than a separability threshold; the separability value is calculated by a separability function which is a function that varies monotonically with both the data density in the adjacent regions and the data density at the boundary between the adjacent regions.

9. The method of claim 8, wherein the separability function is a function that varies monotonically with the data density within the adjacent region and varies monotonically with the data density at the boundary between adjacent regions.

10. The method of claim 9, wherein the separability function is:

wherein

The value of the separability is the value of the separability,

the maximum data density in the boundary grid between two adjacent regions,

and

the maximum data density of the grids in the two adjacent areas respectively; wherein the separability threshold is a value determined by any one of the value ranges of the separability functions.

11. The method of claim 10, wherein the separability threshold is an average of a maximum and a minimum of the separability function.

12. The method of claim 1, wherein the selecting method of step 5 is: the first step is bidirectional selection: finding the center of the area with the center closest to each reference point in the optimal classification morphological parameters, and if the reference points closest to the center of the area are also the reference points, respectively taking the area as the area affiliated to the reference points, wherein the area is called as a main area; the second step is unidirectional selection: for each reference point for which no primary area is found, if the area having the center closest to the reference point is not determined as a primary area, the area is taken as a primary area belonging to the reference point, each primary area belonging to a different classification, respectively.

13. The method of claim 11, wherein the primary zone selection method is: if the number of the areas after the step 3 is finished is larger than the number parameter of the optimal classification morphological parameters in the step 4, increasing the judgment of a scattered area before selecting a main area, wherein the scattered area is defined as: and under the given confidence, the data density of all grids in the region is lower than that of the region with the scatter threshold, and the confidence is a real number which is greater than or equal to 0 and less than 1.

14. The method of claim 13, wherein said confidence level is 0.95; the scatter threshold is the largest positive integer meeting the following conditions: the probability that the data density generated by the determined number of scattered points in any grid according to the given probability distribution model is not higher than the positive integer is not less than the confidence coefficient, wherein the number of the scattered points is determined according to the given probability distribution model and the grid number proportion of the grid spectrum with the data density being higher than 0.

15. The method of claim 14, wherein the probability distribution model is a uniform distribution.

16. The method of claim 13, wherein the scattered region is judged, and different main region selection methods are performed according to the judgment result of the scattered region: and if the number of the regions which are not judged as scattered regions is larger than the number parameter of the optimal classification form parameter, selecting the main region from the regions which are not judged as scattered regions, and otherwise, selecting the main region from all the regions.

17. The method of claim 16, wherein the primary zone selection method is: and if the main area belonging to each reference point in the optimal classification form cannot be selected from all the areas which are not judged as scattered areas, re-selecting the main area from all the areas.

18. The method of claim 12, wherein the merging method of step 5 is: if the area which is not selected as the main area exists, calculating the center distance of the adjacent areas pairwise, constructing a distance function, wherein the distance function is a function which is monotonically increased or decreased along with the center distance, then searching each pair of adjacent areas containing at least one area which is not selected as the main area, merging the pair of areas with the minimum or maximum center distance, if the pair of adjacent areas comprises the main area, the merged area is the main area, otherwise, the merged area is not the main area, and finally repeating the merging process until all the areas are the main areas.

19. The method of claim 18, wherein the distance function is equal to the center distance.

20. The method of claim 18, wherein the combining of step 5 is performed by: placing microdroplet data into a classification representing a region if a microdroplet data member is located in an internal grid of the region; if a member of the microdroplet data is located in the boundary grid, calculating the probability that the member of the microdroplet data belongs to each relevant adjacent area according to the probability distribution of the microdroplet data in the adjacent area relevant to the boundary grid, and placing the member of the microdroplet data in the classification represented by the area with the maximum probability.

21. The method of claim 20, wherein the probability distribution of microdroplet data within the region is a normal distribution.