CN110197193A - A kind of automatic grouping method of multi-parameter stream data - Google Patents

A kind of automatic grouping method of multi-parameter stream data Download PDF

Info

Publication number
CN110197193A
CN110197193A CN201910204433.1A CN201910204433A CN110197193A CN 110197193 A CN110197193 A CN 110197193A CN 201910204433 A CN201910204433 A CN 201910204433A CN 110197193 A CN110197193 A CN 110197193A
Authority
CN
China
Prior art keywords
parameter
data
matrix
stream data
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910204433.1A
Other languages
Chinese (zh)
Inventor
孟晓辰
祝连庆
娄小平
董明利
于明鑫
刘锋
宋言明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201910204433.1A priority Critical patent/CN110197193A/en
Publication of CN110197193A publication Critical patent/CN110197193A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Epidemiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A kind of automatic grouping method of multi-parameter stream data, includes the following steps: step 1: the Multi-parameter data based on t-SNE divides group: input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions Probability pj|i;Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is the principal component parameter matrix after dimensionality reduction;Step 2: use K-means algorithm to cluster principal component parameter matrix data: randomly selecting a cluster center of mass point as initial mass center.

Description

A kind of automatic grouping method of multi-parameter stream data
Technical field
It is carried out the present invention relates to a kind of flow cytomery human peripheral blood cell and to haemocyte Multi-parameter data fast The fast method for dividing group automatically, belongs to field of biological medicine.
Technical background
Flow cytometry (flow cytometry) is that one kind can carry out more ginsengs to the cell of suspension or other particles Number, the technology quickly analyzed or sorted.With the development of medical domain, the diagnosis of disease is more and more deep, flow cytometer energy The parameter enough detected is also multiplied, and carrying out fast and accurately analysis to multi-parameter flow cytometry data is to improve clinical diagnosis The key of efficiency.Flow cytometer includes optical system, flow chamber and liquid stream drive system, photodetector system and signal processing System four is most of, wherein a part work of signal processing system is to divide a large amount of polychrome multi-parameter stream datas Analysis, analysis difficulty is big, and tradition to the method for polychrome stream data analysis is used according to the scattering light of cell or the characteristic of fluorescence The method that special-purpose software uses artificial gating, the process of analysis are rule of thumb to choose two groups of fluorescence signal signature parameters as horizontal Ordinate draws two-dimentional scatter plot, and the regional scope that target cell type delimited in figure is analyzed, but with cell parameters Increase, the method for traditional artificial gating can not people's clinical detection needs, be primarily present following problems:
(1) artificial gating lacks objectivity.Expert chooses two from a variety of fluorescent characteristics and draw by experience to be dissipated Point diagram, and circle door also varies with each individual with the judgement for making cell population, without quantitative criteria.
(2) analysis reproducibility of results is poor.For different data, the artificial gating method picture unified there is no standard Method.
(3) operator is needed to have specialty background.It is flow cytometer special-purpose software that stream data, which analyzes software, is related to Medical knowledge is that user does not have, and there are limitations.
(4) feature difference multidimensional data can not be accurately identified.Data analysis is only able to display two dimensional character, and Difference is found, and the feature of polychrome multi-parameter higher-dimension stream data can only can just be shown in hyperspace.
(5) process is cumbersome, low efficiency, the wasting of resources are huge.Manual analysis process not only consumes manpower, wastes time, and And analyze result often poor reliability.
For artificial gating there are the shortcomings that, the automatic analysis method of some experts and scholars' flow cytometric data carries out It explores, but is mostly the method studied to cell automatic cluster, few distributions in view of cell population.For example, The method K-means algorithm for dividing group automatically for fluidic cell earliest divides sample by the Euclidean distance calculated between sample point Data realize cluster;Sugar and Sealfon proposes the non-supervisory density profile clustering algorithm based on penetration theory (unsupervised density contour clustering algorithm) is sought by constructing sample data histogram Data peak is looked for, the fast clustering analysis of various shapes cell population in stream data is realized;Qian et al. is proposed based on net Lattice divide and merge (grid-based partitioning and merging) monoid recognizer, and the algorithm is according to data The monoid cell being randomly distributed in density feature identification 2-D data;The it is proposeds such as Aghae are based on hierarchical clustering thought;There are also height This mixed model etc..
Summary of the invention
The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.
The purpose of this patent is achieved through the following technical solutions.
A kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:
Step 1: the Multi-parameter data based on t-SNE divides group:
Input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;
Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions Probability pj|i
Enable joint probability distribution pij=(pi|j+pj|i)/2n, random initializtion low-dimensional data;
Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing updates low-dimensional data, repeatedly The matrix obtained after generation is the principal component parameter matrix after dimensionality reduction;
Step 2: principal component parameter matrix data are clustered using K-means algorithm:
K cluster center of mass point is randomly selected as initial mass center;
For each sample, its class that should belong to is calculated, calculates its each distance into k mass center, then The nearest class of selected distance is as classification described in sample;For each class, such mass center is recalculated until its is constant Or vary less, if reaching the number of iterations or monoid mass center does not change, terminate cluster;It repeats the above process until receiving It holds back to get tag along sort out.
Further to improve, the dimension d is 2 or 3.
Further to improve, the puzzled degree is 30.
Using analysis method of the invention, it is accurate and easy to operate to be as a result more clear.Better than artificial gating and other Calculation method.
Detailed description of the invention
Fig. 1: t-SNE algorithm process multi-parameter flow cytometry data flow chart;
Fig. 2: K-means algorithm realizes cell cluster
Fig. 3: the distribution map of four kinds of cell populations of fitting;
Fig. 4: cell divides group's strategy and analysis expert result;
The relational graph of Fig. 5: t-SNE algorithm reduction dimension and clustering target;
Fig. 6: the automatic grouping result that the method for the present invention is handled;
Fig. 7: four class cells divide group's accuracy rate to compare using KPCA and t-SNE;
Fig. 8: the method for the present invention Utilization assessment index calculated result.
Specific embodiment
The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.T-SNE algorithm have it is functional, The features such as computation complexity is low, effect of visualization is good has been applied to the side such as image detection, malfunction monitoring, voice recognition at present Face.Using t-SNE algorithm to initial data dimensionality reduction, the characteristic information of data can be more obtained, and extract contribution degree Highest feature principal component chooses preceding two groups or three groups of number of principal components according to as reference axis, draws and visualize scatter plot.Automatically gathering In terms of class, using K-means algorithm in conjunction with dimension-reduction algorithm, the automatic cluster of sample data is realized.
T-SNE algorithm is to be located at sample data on one statistical manifold, and sample point is mapped in probability distribution, is made It is similar as far as possible between two probability distribution in higher-dimension and lower dimensional space.Similarity under lower dimensional space, between two o'clock Substitution Gaussian Profile is distributed using t to express, and the data in experiment sample include monocyte and broken cell and miscellaneous Matter monoid is in asymmetric distribution and has hangover, and being characterized using t distribution is influenced smaller, preferable characterize data by exceptional value Global feature, this also meets cell just and divides group's demand.
Multi-parameter data based on t-SNE divides group's key step as follows:
Step 1: input needs the dimension d (number of principal components chosen dropped to dimensionality reduction multi-parameter stream data, setting Amount) and loss function parameter puzzlement degree perplexity=30 (default value);
Step 2: initializing sample matrix X, calculates the distance between corresponding matrix, uses fixed puzzlement degree meter Calculate conditional probability pj|i
Step 3: joint probability distribution pij=(p is enabledi|j+pj|i)/2n, random initializtion low-dimensional data;
Step 4: starting to optimize, into loop iteration:
The joint probability under low latitudes is calculated, gradient is calculated
Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is new principal component parameter after dimensionality reduction Matrix.
Cell data after dimensionality reduction are clustered using K-means algorithm, this algorithm measured using Euclidean distance sample with The similarity of each cluster, specific algorithm are described as follows:
Step 1: K is cluster numbers, randomly selects K cluster center of mass point as initial mass center;
Step 2: following procedure is repeated until convergence, obtains tag along sort:
1. calculating its class that should belong to for each sample, its each distance into k mass center is calculated, so The new classification of the nearest classification sample of selected distance afterwards
2. for each class, if recalculating such mass center until its is constant or varies less and reaches the number of iterations Or monoid mass center does not change, and terminates cluster.
Experimental data of the present invention is by the existing U.S. company BD in laboratory (Bect on, Dickinson and Company) FACSCalibur flow cytometer measure.Human peripheral blood cell includes that lymphocyte, neutrophil cell, monokaryon are white thin Born of the same parents and smudge cells and its impurity, surface molecular CD3, CD19, CD56 and CD5, respectively with fluorescein isothiocynate (FITC), Phycoerythrin (PE), allophycocyanin (APC), perdinin-Chlorophyll-protein complexes (PerCP) label, experiment sample Include 3800 cells.Stream data includes 14 parameters, the respectively arteries and veins of forward scattering light, side scattered light and four color fluorescence Degree of leaping high (height, H), pulse area (area, A) and pulse width (width, W).It is primarily based on statistical theory fitting four Then the quasi-total reflection of class cell carries out a point group using algorithm and tests, obtains grouping result.

Claims (3)

1. a kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:
Step 1: the Multi-parameter data based on t-SNE divides group:
Input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting;
Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions probability pj|i
Enable joint probability distribution pij=(pi|j+pj|i)/2n, random initializtion low-dimensional data;
Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing updates low-dimensional data, iteration knot The matrix obtained after beam is the principal component parameter matrix after dimensionality reduction;
Step 2: principal component parameter matrix data are clustered using K-means algorithm:
K cluster center of mass point is randomly selected as initial mass center;
For each sample, its class that should belong to is calculated, its each distance into k mass center is calculated, then chooses Apart from nearest class as classification described in sample;For each class, recalculate such mass center until its is constant or It varies less, if reaching the number of iterations or monoid mass center does not change, terminates cluster;It repeats the above process until convergence, i.e., Obtain tag along sort.
2. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the dimension d is 2 Or 3.
3. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the puzzled degree is 30。
CN201910204433.1A 2019-03-18 2019-03-18 A kind of automatic grouping method of multi-parameter stream data Pending CN110197193A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910204433.1A CN110197193A (en) 2019-03-18 2019-03-18 A kind of automatic grouping method of multi-parameter stream data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910204433.1A CN110197193A (en) 2019-03-18 2019-03-18 A kind of automatic grouping method of multi-parameter stream data

Publications (1)

Publication Number Publication Date
CN110197193A true CN110197193A (en) 2019-09-03

Family

ID=67751769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910204433.1A Pending CN110197193A (en) 2019-03-18 2019-03-18 A kind of automatic grouping method of multi-parameter stream data

Country Status (1)

Country Link
CN (1) CN110197193A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610212A (en) * 2019-09-20 2019-12-24 云南电网有限责任公司电力科学研究院 Fault classification method and fault classification device for transformer of power distribution network
CN113188981A (en) * 2021-04-30 2021-07-30 天津深析智能科技发展有限公司 Automatic analysis method of multi-factor cytokine
CN114545167A (en) * 2022-02-23 2022-05-27 四川大学 Cable terminal partial discharge pulse classification method based on t-SNE algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096066A (en) * 2016-08-17 2016-11-09 盐城工学院 The Text Clustering Method embedded based on random neighbor
CN106548205A (en) * 2016-10-21 2017-03-29 北京信息科技大学 A kind of fast automatic point of group of flow cytometry data and circle door method
CN106548204A (en) * 2016-11-01 2017-03-29 北京信息科技大学 The fast automatic grouping method of Flow cytometry data
US20180372726A1 (en) * 2017-05-16 2018-12-27 The Chinese University Of Hong Kong Integrative single-cell and cell-free plasma rna analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096066A (en) * 2016-08-17 2016-11-09 盐城工学院 The Text Clustering Method embedded based on random neighbor
CN106548205A (en) * 2016-10-21 2017-03-29 北京信息科技大学 A kind of fast automatic point of group of flow cytometry data and circle door method
CN106548204A (en) * 2016-11-01 2017-03-29 北京信息科技大学 The fast automatic grouping method of Flow cytometry data
US20180372726A1 (en) * 2017-05-16 2018-12-27 The Chinese University Of Hong Kong Integrative single-cell and cell-free plasma rna analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孟晓辰等: "基于t 分布邻域嵌入算法的流式数据自动分群方法", 《生物医学工程学杂志》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110610212A (en) * 2019-09-20 2019-12-24 云南电网有限责任公司电力科学研究院 Fault classification method and fault classification device for transformer of power distribution network
CN113188981A (en) * 2021-04-30 2021-07-30 天津深析智能科技发展有限公司 Automatic analysis method of multi-factor cytokine
CN114545167A (en) * 2022-02-23 2022-05-27 四川大学 Cable terminal partial discharge pulse classification method based on t-SNE algorithm

Similar Documents

Publication Publication Date Title
CN106248559B (en) A kind of five sorting technique of leucocyte based on deep learning
US6944338B2 (en) System for identifying clusters in scatter plots using smoothed polygons with optimal boundaries
CN110197193A (en) A kind of automatic grouping method of multi-parameter stream data
CN104200114B (en) Flow cytometry data rapid analysis method
CN108961208A (en) A kind of aggregation leucocyte segmentation number system and method
Junos et al. An optimized YOLO‐based object detection model for crop harvesting system
US20240044904A1 (en) System, method, and article for detecting abnormal cells using multi-dimensional analysis
JP4521490B2 (en) Similar pattern search device, similar pattern search method, similar pattern search program, and fraction separation device
CN108052886B (en) A kind of puccinia striiformis uredospore programming count method of counting
CA2969912A1 (en) Automated flow cytometry analysis method and system
US20170322137A1 (en) Method and system for characterizing particles using a flow cytometer
CN108509982A (en) A method of the uneven medical data of two classification of processing
CN101981446A (en) Method and system for analysis of flow cytometry data using support vector machines
US9183237B2 (en) Methods and apparatus related to gate boundaries within a data space
CN112347894B (en) Single plant vegetation extraction method based on transfer learning and Gaussian mixture model separation
CN110059568A (en) Multiclass leucocyte automatic identifying method based on deep layer convolutional neural networks
CN112017743B (en) Automatic generation platform and application of disease risk evaluation report
Moraes et al. A decision-tree approach for the differential diagnosis of chronic lymphoid leukemias and peripheral B-cell lymphomas
CN107389536A (en) Fluidic cell particle classifying method of counting based on density distance center algorithm
CN113316713A (en) Adaptive sorting for particle analyzers
CN108257124A (en) A kind of white blood cell count(WBC) method and system based on image
CN106548203A (en) A kind of fast automatic point of group of multiparameter flow cytometry data and gating method
CN108038352A (en) Combination difference analysis and the method for association rule mining full-length genome key gene
CN110163869A (en) A kind of image repeat element dividing method, smart machine and storage medium
CN112348360A (en) Chinese medicine production process parameter analysis system based on big data technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190903