CN110197193A

CN110197193A - A kind of automatic grouping method of multi-parameter stream data

Info

Publication number: CN110197193A
Application number: CN201910204433.1A
Authority: CN
Inventors: 孟晓辰; 祝连庆; 娄小平; 董明利; 于明鑫; 刘锋; 宋言明
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2019-09-03

Abstract

A kind of automatic grouping method of multi-parameter stream data, includes the following steps: step 1: the Multi-parameter data based on t-SNE divides group: input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting；Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions Probability p_j|i；Into loop iteration: calculating the joint probability under low dimensional, calculate gradient;Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is the principal component parameter matrix after dimensionality reduction；Step 2: use K-means algorithm to cluster principal component parameter matrix data: randomly selecting a cluster center of mass point as initial mass center.

Description

A kind of automatic grouping method of multi-parameter stream data

Technical field

It is carried out the present invention relates to a kind of flow cytomery human peripheral blood cell and to haemocyte Multi-parameter data fast The fast method for dividing group automatically, belongs to field of biological medicine.

Technical background

Flow cytometry (flow cytometry) is that one kind can carry out more ginsengs to the cell of suspension or other particles Number, the technology quickly analyzed or sorted.With the development of medical domain, the diagnosis of disease is more and more deep, flow cytometer energy The parameter enough detected is also multiplied, and carrying out fast and accurately analysis to multi-parameter flow cytometry data is to improve clinical diagnosis The key of efficiency.Flow cytometer includes optical system, flow chamber and liquid stream drive system, photodetector system and signal processing System four is most of, wherein a part work of signal processing system is to divide a large amount of polychrome multi-parameter stream datas Analysis, analysis difficulty is big, and tradition to the method for polychrome stream data analysis is used according to the scattering light of cell or the characteristic of fluorescence The method that special-purpose software uses artificial gating, the process of analysis are rule of thumb to choose two groups of fluorescence signal signature parameters as horizontal Ordinate draws two-dimentional scatter plot, and the regional scope that target cell type delimited in figure is analyzed, but with cell parameters Increase, the method for traditional artificial gating can not people's clinical detection needs, be primarily present following problems:

(1) artificial gating lacks objectivity.Expert chooses two from a variety of fluorescent characteristics and draw by experience to be dissipated Point diagram, and circle door also varies with each individual with the judgement for making cell population, without quantitative criteria.

(2) analysis reproducibility of results is poor.For different data, the artificial gating method picture unified there is no standard Method.

(3) operator is needed to have specialty background.It is flow cytometer special-purpose software that stream data, which analyzes software, is related to Medical knowledge is that user does not have, and there are limitations.

(4) feature difference multidimensional data can not be accurately identified.Data analysis is only able to display two dimensional character, and Difference is found, and the feature of polychrome multi-parameter higher-dimension stream data can only can just be shown in hyperspace.

(5) process is cumbersome, low efficiency, the wasting of resources are huge.Manual analysis process not only consumes manpower, wastes time, and And analyze result often poor reliability.

For artificial gating there are the shortcomings that, the automatic analysis method of some experts and scholars' flow cytometric data carries out It explores, but is mostly the method studied to cell automatic cluster, few distributions in view of cell population.For example, The method K-means algorithm for dividing group automatically for fluidic cell earliest divides sample by the Euclidean distance calculated between sample point Data realize cluster；Sugar and Sealfon proposes the non-supervisory density profile clustering algorithm based on penetration theory (unsupervised density contour clustering algorithm) is sought by constructing sample data histogram Data peak is looked for, the fast clustering analysis of various shapes cell population in stream data is realized；Qian et al. is proposed based on net Lattice divide and merge (grid-based partitioning and merging) monoid recognizer, and the algorithm is according to data The monoid cell being randomly distributed in density feature identification 2-D data；The it is proposeds such as Aghae are based on hierarchical clustering thought；There are also height This mixed model etc..

Summary of the invention

The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.

The purpose of this patent is achieved through the following technical solutions.

A kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:

Step 1: the Multi-parameter data based on t-SNE divides group:

Input needs the dimension d and loss function parameter puzzlement degree dropped to dimensionality reduction multi-parameter stream data, setting；

Sample matrix X is initialized, the distance between corresponding matrix is calculated, uses fixed puzzlement degree design conditions Probability p_j|i；

Enable joint probability distribution pij=(p_i|j+p_j|i)/2n, random initializtion low-dimensional data；

Into loop iteration: calculating the joint probability under low dimensional, calculate gradient；Iteration optimizing updates low-dimensional data, repeatedly The matrix obtained after generation is the principal component parameter matrix after dimensionality reduction；

Step 2: principal component parameter matrix data are clustered using K-means algorithm:

K cluster center of mass point is randomly selected as initial mass center；

For each sample, its class that should belong to is calculated, calculates its each distance into k mass center, then The nearest class of selected distance is as classification described in sample；For each class, such mass center is recalculated until its is constant Or vary less, if reaching the number of iterations or monoid mass center does not change, terminate cluster；It repeats the above process until receiving It holds back to get tag along sort out.

Further to improve, the dimension d is 2 or 3.

Further to improve, the puzzled degree is 30.

Using analysis method of the invention, it is accurate and easy to operate to be as a result more clear.Better than artificial gating and other Calculation method.

Detailed description of the invention

Fig. 1: t-SNE algorithm process multi-parameter flow cytometry data flow chart；

Fig. 2: K-means algorithm realizes cell cluster

Fig. 3: the distribution map of four kinds of cell populations of fitting；

Fig. 4: cell divides group's strategy and analysis expert result；

The relational graph of Fig. 5: t-SNE algorithm reduction dimension and clustering target；

Fig. 6: the automatic grouping result that the method for the present invention is handled；

Fig. 7: four class cells divide group's accuracy rate to compare using KPCA and t-SNE；

Fig. 8: the method for the present invention Utilization assessment index calculated result.

Specific embodiment

The present invention is directed in cell population that is asymmetric and having hangover to be distributed, proposes a kind of t points based on prevalence study The multi-parameter stream data of cloth neighborhood embedded mobile GIS (t-SNE) carries out automatic grouping method.T-SNE algorithm have it is functional, The features such as computation complexity is low, effect of visualization is good has been applied to the side such as image detection, malfunction monitoring, voice recognition at present Face.Using t-SNE algorithm to initial data dimensionality reduction, the characteristic information of data can be more obtained, and extract contribution degree Highest feature principal component chooses preceding two groups or three groups of number of principal components according to as reference axis, draws and visualize scatter plot.Automatically gathering In terms of class, using K-means algorithm in conjunction with dimension-reduction algorithm, the automatic cluster of sample data is realized.

T-SNE algorithm is to be located at sample data on one statistical manifold, and sample point is mapped in probability distribution, is made It is similar as far as possible between two probability distribution in higher-dimension and lower dimensional space.Similarity under lower dimensional space, between two o'clock Substitution Gaussian Profile is distributed using t to express, and the data in experiment sample include monocyte and broken cell and miscellaneous Matter monoid is in asymmetric distribution and has hangover, and being characterized using t distribution is influenced smaller, preferable characterize data by exceptional value Global feature, this also meets cell just and divides group's demand.

Multi-parameter data based on t-SNE divides group's key step as follows:

Step 1: input needs the dimension d (number of principal components chosen dropped to dimensionality reduction multi-parameter stream data, setting Amount) and loss function parameter puzzlement degree perplexity=30 (default value)；

Step 2: initializing sample matrix X, calculates the distance between corresponding matrix, uses fixed puzzlement degree meter Calculate conditional probability p_j|i；

Step 3: joint probability distribution pij=(p is enabled_i|j+p_j|i)/2n, random initializtion low-dimensional data；

Step 4: starting to optimize, into loop iteration:

The joint probability under low latitudes is calculated, gradient is calculated

Iteration optimizing, updates low-dimensional data, and the matrix obtained after iteration is new principal component parameter after dimensionality reduction Matrix.

Cell data after dimensionality reduction are clustered using K-means algorithm, this algorithm measured using Euclidean distance sample with The similarity of each cluster, specific algorithm are described as follows:

Step 1: K is cluster numbers, randomly selects K cluster center of mass point as initial mass center；

Step 2: following procedure is repeated until convergence, obtains tag along sort:

1. calculating its class that should belong to for each sample, its each distance into k mass center is calculated, so The new classification of the nearest classification sample of selected distance afterwards

2. for each class, if recalculating such mass center until its is constant or varies less and reaches the number of iterations Or monoid mass center does not change, and terminates cluster.

Experimental data of the present invention is by the existing U.S. company BD in laboratory (Bect on, Dickinson and Company) FACSCalibur flow cytometer measure.Human peripheral blood cell includes that lymphocyte, neutrophil cell, monokaryon are white thin Born of the same parents and smudge cells and its impurity, surface molecular CD3, CD19, CD56 and CD5, respectively with fluorescein isothiocynate (FITC), Phycoerythrin (PE), allophycocyanin (APC), perdinin-Chlorophyll-protein complexes (PerCP) label, experiment sample Include 3800 cells.Stream data includes 14 parameters, the respectively arteries and veins of forward scattering light, side scattered light and four color fluorescence Degree of leaping high (height, H), pulse area (area, A) and pulse width (width, W).It is primarily based on statistical theory fitting four Then the quasi-total reflection of class cell carries out a point group using algorithm and tests, obtains grouping result.

Claims

1. a kind of automatic grouping method of multi-parameter stream data, which comprises the steps of:

Step 1: the Multi-parameter data based on t-SNE divides group:

Into loop iteration: calculating the joint probability under low dimensional, calculate gradient；Iteration optimizing updates low-dimensional data, iteration knot The matrix obtained after beam is the principal component parameter matrix after dimensionality reduction；

K cluster center of mass point is randomly selected as initial mass center；

For each sample, its class that should belong to is calculated, its each distance into k mass center is calculated, then chooses Apart from nearest class as classification described in sample；For each class, recalculate such mass center until its is constant or It varies less, if reaching the number of iterations or monoid mass center does not change, terminates cluster；It repeats the above process until convergence, i.e., Obtain tag along sort.

2. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the dimension d is 2 Or 3.

3. a kind of automatic grouping method of multi-parameter stream data as described in claim 1, which is characterized in that the puzzled degree is 30。