CN110197193A - A kind of automatic grouping method of multi-parameter stream data - Google Patents
A kind of automatic grouping method of multi-parameter stream data Download PDFInfo
- Publication number
- CN110197193A CN110197193A CN201910204433.1A CN201910204433A CN110197193A CN 110197193 A CN110197193 A CN 110197193A CN 201910204433 A CN201910204433 A CN 201910204433A CN 110197193 A CN110197193 A CN 110197193A
- Authority
- CN
- China
- Prior art keywords
- parameter
- data
- matrix
- stream data
- iteration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
A kind of automatic grouping method of multi-parameter stream data, includes the following steps: step 1: the Multi-parameter data based on t-SNE divides group: input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions Probability pj|i;Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is the principal component parameter matrix after dimensionality reduction;Step 2: use K-means algorithm to cluster principal component parameter matrix data: randomly selecting a cluster center of mass point as initial mass center.
Description
Technical field
It is carried out the present invention relates to a kind of flow cytomery human peripheral blood cell and to haemocyte Multi-parameter data fast
The fast method for dividing group automatically, belongs to field of biological medicine.
Technical background
Flow cytometry (flow cytometry) is that one kind can carry out more ginsengs to the cell of suspension or other particles
Number, the technology quickly analyzed or sorted.With the development of medical domain, the diagnosis of disease is more and more deep, flow cytometer energy
The parameter enough detected is also multiplied, and carrying out fast and accurately analysis to multi-parameter flow cytometry data is to improve clinical diagnosis
The key of efficiency.Flow cytometer includes optical system, flow chamber and liquid stream drive system, photodetector system and signal processing
System four is most of, wherein a part work of signal processing system is to divide a large amount of polychrome multi-parameter stream datas
Analysis, analysis difficulty is big, and tradition to the method for polychrome stream data analysis is used according to the scattering light of cell or the characteristic of fluorescence
The method that special-purpose software uses artificial gating, the process of analysis are rule of thumb to choose two groups of fluorescence signal signature parameters as horizontal
Ordinate draws two-dimentional scatter plot, and the regional scope that target cell type delimited in figure is analyzed, but with cell parameters
Increase, the method for traditional artificial gating can not people's clinical detection needs, be primarily present following problems:
(1) artificial gating lacks objectivity.Expert chooses two from a variety of fluorescent characteristics and draw by experience to be dissipated
Point diagram, and circle door also varies with each individual with the judgement for making cell population, without quantitative criteria.
(2) analysis reproducibility of results is poor.For different data, the artificial gating method picture unified there is no standard
Method.
(3) operator is needed to have specialty background.It is flow cytometer special-purpose software that stream data, which analyzes software, is related to
Medical knowledge is that user does not have, and there are limitations.
(4) feature difference multidimensional data can not be accurately identified.Data analysis is only able to display two dimensional character, and
Difference is found, and the feature of polychrome multi-parameter higher-dimension stream data can only can just be shown in hyperspace.
(5) process is cumbersome, low efficiency, the wasting of resources are huge.Manual analysis process not only consumes manpower, wastes time, and
And analyze result often poor reliability.
For artificial gating there are the shortcomings that, the automatic analysis method of some experts and scholars' flow cytometric data carries out
It explores, but is mostly the method studied to cell automatic cluster, few distributions in view of cell population.For example,
The method K-means algorithm for dividing group automatically for fluidic cell earliest divides sample by the Euclidean distance calculated between sample point
Data realize cluster;Sugar and Sealfon proposes the non-supervisory density profile clustering algorithm based on penetration theory
(unsupervised density contour clustering algorithm) is sought by constructing sample data histogram
Data peak is looked for, the fast clustering analysis of various shapes cell population in stream data is realized;Qian et al. is proposed based on net
Lattice divide and merge (grid-based partitioning and merging) monoid recognizer, and the algorithm is according to data
The monoid cell being randomly distributed in density feature identification 2-D data;The it is proposeds such as Aghae are based on hierarchical clustering thought;There are also height
This mixed model etc..
Summary of the invention
The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study
The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.
The purpose of this patent is achieved through the following technical solutions.
A kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:
Step 1: the Multi-parameter data based on t-SNE divides group:
Input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;
Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions
Probability pj|i;
Enable joint probability distribution pij=(pi|j+pj|i)/2n, random initializtion low-dimensional data;
Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing updates low-dimensional data, repeatedly
The matrix obtained after generation is the principal component parameter matrix after dimensionality reduction;
Step 2: principal component parameter matrix data are clustered using K-means algorithm:
K cluster center of mass point is randomly selected as initial mass center;
For each sample, its class that should belong to is calculated, calculates its each distance into k mass center, then
The nearest class of selected distance is as classification described in sample;For each class, such mass center is recalculated until its is constant
Or vary less, if reaching the number of iterations or monoid mass center does not change, terminate cluster;It repeats the above process until receiving
It holds back to get tag along sort out.
Further to improve, the dimension d is 2 or 3.
Further to improve, the puzzled degree is 30.
Using analysis method of the invention, it is accurate and easy to operate to be as a result more clear.Better than artificial gating and other
Calculation method.
Detailed description of the invention
Fig. 1: t-SNE algorithm process multi-parameter flow cytometry data flow chart;
Fig. 2: K-means algorithm realizes cell cluster
Fig. 3: the distribution map of four kinds of cell populations of fitting;
Fig. 4: cell divides group's strategy and analysis expert result;
The relational graph of Fig. 5: t-SNE algorithm reduction dimension and clustering target;
Fig. 6: the automatic grouping result that the method for the present invention is handled;
Fig. 7: four class cells divide group's accuracy rate to compare using KPCA and t-SNE;
Fig. 8: the method for the present invention Utilization assessment index calculated result.
Specific embodiment
The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study
The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.T-SNE algorithm have it is functional,
The features such as computation complexity is low, effect of visualization is good has been applied to the side such as image detection, malfunction monitoring, voice recognition at present
Face.Using t-SNE algorithm to initial data dimensionality reduction, the characteristic information of data can be more obtained, and extract contribution degree
Highest feature principal component chooses preceding two groups or three groups of number of principal components according to as reference axis, draws and visualize scatter plot.Automatically gathering
In terms of class, using K-means algorithm in conjunction with dimension-reduction algorithm, the automatic cluster of sample data is realized.
T-SNE algorithm is to be located at sample data on one statistical manifold, and sample point is mapped in probability distribution, is made
It is similar as far as possible between two probability distribution in higher-dimension and lower dimensional space.Similarity under lower dimensional space, between two o'clock
Substitution Gaussian Profile is distributed using t to express, and the data in experiment sample include monocyte and broken cell and miscellaneous
Matter monoid is in asymmetric distribution and has hangover, and being characterized using t distribution is influenced smaller, preferable characterize data by exceptional value
Global feature, this also meets cell just and divides group's demand.
Multi-parameter data based on t-SNE divides group's key step as follows:
Step 1: input needs the dimension d (number of principal components chosen dropped to dimensionality reduction multi-parameter stream data, setting
Amount) and loss function parameter puzzlement degree perplexity=30 (default value);
Step 2: initializing sample matrix X, calculates the distance between corresponding matrix, uses fixed puzzlement degree meter
Calculate conditional probability pj|i;
Step 3: joint probability distribution pij=(p is enabledi|j+pj|i)/2n, random initializtion low-dimensional data;
Step 4: starting to optimize, into loop iteration:
The joint probability under low latitudes is calculated, gradient is calculated
Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is new principal component parameter after dimensionality reduction
Matrix.
Cell data after dimensionality reduction are clustered using K-means algorithm, this algorithm measured using Euclidean distance sample with
The similarity of each cluster, specific algorithm are described as follows:
Step 1: K is cluster numbers, randomly selects K cluster center of mass point as initial mass center;
Step 2: following procedure is repeated until convergence, obtains tag along sort:
1. calculating its class that should belong to for each sample, its each distance into k mass center is calculated, so
The new classification of the nearest classification sample of selected distance afterwards
2. for each class, if recalculating such mass center until its is constant or varies less and reaches the number of iterations
Or monoid mass center does not change, and terminates cluster.
Experimental data of the present invention is by the existing U.S. company BD in laboratory (Bect on, Dickinson and Company)
FACSCalibur flow cytometer measure.Human peripheral blood cell includes that lymphocyte, neutrophil cell, monokaryon are white thin
Born of the same parents and smudge cells and its impurity, surface molecular CD3, CD19, CD56 and CD5, respectively with fluorescein isothiocynate (FITC),
Phycoerythrin (PE), allophycocyanin (APC), perdinin-Chlorophyll-protein complexes (PerCP) label, experiment sample
Include 3800 cells.Stream data includes 14 parameters, the respectively arteries and veins of forward scattering light, side scattered light and four color fluorescence
Degree of leaping high (height, H), pulse area (area, A) and pulse width (width, W).It is primarily based on statistical theory fitting four
Then the quasi-total reflection of class cell carries out a point group using algorithm and tests, obtains grouping result.
Claims (3)
1. a kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:
Step 1: the Multi-parameter data based on t-SNE divides group:
Input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;
Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions probability
pj|i;
Enable joint probability distribution pij=(pi|j+pj|i)/2n, random initializtion low-dimensional data;
Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing updates low-dimensional data, iteration knot
The matrix obtained after beam is the principal component parameter matrix after dimensionality reduction;
Step 2: principal component parameter matrix data are clustered using K-means algorithm:
K cluster center of mass point is randomly selected as initial mass center;
For each sample, its class that should belong to is calculated, its each distance into k mass center is calculated, then chooses
Apart from nearest class as classification described in sample;For each class, recalculate such mass center until its is constant or
It varies less, if reaching the number of iterations or monoid mass center does not change, terminates cluster;It repeats the above process until convergence, i.e.,
Obtain tag along sort.
2. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the dimension d is 2
Or 3.
3. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the puzzled degree is
30。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910204433.1A CN110197193A (en) | 2019-03-18 | 2019-03-18 | A kind of automatic grouping method of multi-parameter stream data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910204433.1A CN110197193A (en) | 2019-03-18 | 2019-03-18 | A kind of automatic grouping method of multi-parameter stream data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110197193A true CN110197193A (en) | 2019-09-03 |
Family
ID=67751769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910204433.1A Pending CN110197193A (en) | 2019-03-18 | 2019-03-18 | A kind of automatic grouping method of multi-parameter stream data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110197193A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610212A (en) * | 2019-09-20 | 2019-12-24 | 云南电网有限责任公司电力科学研究院 | Fault classification method and fault classification device for transformer of power distribution network |
CN113188981A (en) * | 2021-04-30 | 2021-07-30 | 天津深析智能科技发展有限公司 | Automatic analysis method of multi-factor cytokine |
CN114545167A (en) * | 2022-02-23 | 2022-05-27 | 四川大学 | Cable terminal partial discharge pulse classification method based on t-SNE algorithm |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096066A (en) * | 2016-08-17 | 2016-11-09 | 盐城工学院 | The Text Clustering Method embedded based on random neighbor |
CN106548205A (en) * | 2016-10-21 | 2017-03-29 | 北京信息科技大学 | A kind of fast automatic point of group of flow cytometry data and circle door method |
CN106548204A (en) * | 2016-11-01 | 2017-03-29 | 北京信息科技大学 | The fast automatic grouping method of Flow cytometry data |
US20180372726A1 (en) * | 2017-05-16 | 2018-12-27 | The Chinese University Of Hong Kong | Integrative single-cell and cell-free plasma rna analysis |
-
2019
- 2019-03-18 CN CN201910204433.1A patent/CN110197193A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106096066A (en) * | 2016-08-17 | 2016-11-09 | 盐城工学院 | The Text Clustering Method embedded based on random neighbor |
CN106548205A (en) * | 2016-10-21 | 2017-03-29 | 北京信息科技大学 | A kind of fast automatic point of group of flow cytometry data and circle door method |
CN106548204A (en) * | 2016-11-01 | 2017-03-29 | 北京信息科技大学 | The fast automatic grouping method of Flow cytometry data |
US20180372726A1 (en) * | 2017-05-16 | 2018-12-27 | The Chinese University Of Hong Kong | Integrative single-cell and cell-free plasma rna analysis |
Non-Patent Citations (1)
Title |
---|
孟晓辰等: "基于t 分布邻域嵌入算法的流式数据自动分群方法", 《生物医学工程学杂志》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110610212A (en) * | 2019-09-20 | 2019-12-24 | 云南电网有限责任公司电力科学研究院 | Fault classification method and fault classification device for transformer of power distribution network |
CN113188981A (en) * | 2021-04-30 | 2021-07-30 | 天津深析智能科技发展有限公司 | Automatic analysis method of multi-factor cytokine |
CN114545167A (en) * | 2022-02-23 | 2022-05-27 | 四川大学 | Cable terminal partial discharge pulse classification method based on t-SNE algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106248559B (en) | A kind of five sorting technique of leucocyte based on deep learning | |
US6944338B2 (en) | System for identifying clusters in scatter plots using smoothed polygons with optimal boundaries | |
CN110197193A (en) | A kind of automatic grouping method of multi-parameter stream data | |
CN104200114B (en) | Flow cytometry data rapid analysis method | |
CN108961208A (en) | A kind of aggregation leucocyte segmentation number system and method | |
Junos et al. | An optimized YOLO‐based object detection model for crop harvesting system | |
US20240044904A1 (en) | System, method, and article for detecting abnormal cells using multi-dimensional analysis | |
JP4521490B2 (en) | Similar pattern search device, similar pattern search method, similar pattern search program, and fraction separation device | |
CN108052886B (en) | A kind of puccinia striiformis uredospore programming count method of counting | |
CA2969912A1 (en) | Automated flow cytometry analysis method and system | |
US20170322137A1 (en) | Method and system for characterizing particles using a flow cytometer | |
CN108509982A (en) | A method of the uneven medical data of two classification of processing | |
CN101981446A (en) | Method and system for analysis of flow cytometry data using support vector machines | |
US9183237B2 (en) | Methods and apparatus related to gate boundaries within a data space | |
CN112347894B (en) | Single plant vegetation extraction method based on transfer learning and Gaussian mixture model separation | |
CN110059568A (en) | Multiclass leucocyte automatic identifying method based on deep layer convolutional neural networks | |
CN112017743B (en) | Automatic generation platform and application of disease risk evaluation report | |
Moraes et al. | A decision-tree approach for the differential diagnosis of chronic lymphoid leukemias and peripheral B-cell lymphomas | |
CN107389536A (en) | Fluidic cell particle classifying method of counting based on density distance center algorithm | |
CN113316713A (en) | Adaptive sorting for particle analyzers | |
CN108257124A (en) | A kind of white blood cell count(WBC) method and system based on image | |
CN106548203A (en) | A kind of fast automatic point of group of multiparameter flow cytometry data and gating method | |
CN108038352A (en) | Combination difference analysis and the method for association rule mining full-length genome key gene | |
CN110163869A (en) | A kind of image repeat element dividing method, smart machine and storage medium | |
CN112348360A (en) | Chinese medicine production process parameter analysis system based on big data technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190903 |