CN106548204A

CN106548204A - The fast automatic grouping method of Flow cytometry data

Info

Publication number: CN106548204A
Application number: CN201610943348.3A
Authority: CN
Inventors: 张文昌; 祝连庆; 娄小平; 潘志康; 孟晓辰; 刘超; 董明利
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2016-11-01
Filing date: 2016-11-01
Publication date: 2017-03-29

Abstract

The present invention provides a method for fast and automatic grouping of flow cytometry data. The method includes the following steps: Step 1, using principal component analysis to process the lost cell data, including the following sub-steps: 1) Carrying out the sample matrix X Standardize to obtain the normalized matrix X ^* ; 2) Find the correlation coefficient matrix and perform eigendecomposition to obtain the eigenvalues (λ ₁ ≥λ ₂ ≥…≥λ _p ) and their corresponding eigenvectors a ₁ , a ₂ ,…, a _p ; 3) Determine the number k of principal components according to the variance contribution rate of the principal components; 4) According to the eigenvectors U=[λ ₁ ,λ ₂ ...λ _k ] corresponding to the first k principal components, obtain k pairs of sample data The eigenvector matrix W=X ^* U composed of principal component vectors; Step 2, use the improved K-means algorithm to cluster the flow cells to obtain the cluster labels; Step 3, set the principal component with the largest contribution rate as the coordinate axis to draw Scatter diagram; Step 4, realize automatic grouping.

Description

A fast and automatic clustering method for flow cytometry data

技术领域technical field

本发明涉及生物医学检测领域，具体涉及一种对流式细胞仪数据进行快速自动分群的方法。The invention relates to the field of biomedical detection, in particular to a method for quickly and automatically grouping flow cytometry data.

背景技术Background technique

流式细胞仪(Flow Cytometer)已成为进行生物研究及临床诊断最重要的工具，流式细胞术(Flow Cytometry)是一种能够对悬浮的细胞或者其他微粒进行多参数、快速分析或分选的技术。流式细胞仪能够检测单个细胞的多种理化性质，同时从该细胞获得代表细胞体积、粒度的散射光信号(SC)和代表各抗原含量的多种荧光脉冲信号(FL)，并提取信号的峰值、脉宽及面积等特征参数。每个细胞诱导得到散射光和荧光信号以单个事件(event)的形式被记录下来，所有的事件汇聚成被测细胞群完整的流式数据。Flow Cytometry has become the most important tool for biological research and clinical diagnosis. Flow Cytometry is a multi-parameter, rapid analysis or sorting of suspended cells or other particles. technology. Flow cytometry can detect a variety of physical and chemical properties of a single cell, and at the same time obtain scattered light signals (SC) representing cell volume and granularity and various fluorescent pulse signals (FL) representing the content of each antigen from the cell, and extract the signal Characteristic parameters such as peak value, pulse width and area. Scattered light and fluorescence signals induced by each cell are recorded as individual events, and all events are aggregated into complete flow data of the measured cell population.

流式细胞数据分析是流式细胞术中的难点之一，其主要目的是识别和划分样本中的亚群细胞。在进行流式细胞数据分析时，通常使用能够显示两个测量通道参数的二维散点图对得到的数据进行可视化分析，该参数可以为前向散射光(SSC)、侧向散射光(FSC)或荧光信号。但是二维散点图每次只能对两个维度的参数进行分析，由于多参数流式数据维度高，数据量大，若流式数据参数个数为n，随机选择两个参数作为横、纵坐标，能够绘制的散点图数目为通常情况下，在随机选择坐标轴参数绘制的散点图中，细胞亚群的区分并不明显，需要操作者具备较高水平的专业知识并选取特定的参数组合进行分析才能获得较理想的分群结果，过程繁琐、耗时长。Flow cytometry data analysis is one of the difficulties in flow cytometry, and its main purpose is to identify and divide subpopulation cells in the sample. When performing flow cytometry data analysis, the obtained data is usually visualized and analyzed using a two-dimensional scatter diagram that can display two measurement channel parameters, such as forward scattered light (SSC), side scattered light (FSC) ) or fluorescent signal. However, the two-dimensional scatter diagram can only analyze the parameters of two dimensions at a time. Since the multi-parameter streaming data has high dimensions and a large amount of data, if the number of streaming data parameters is n, two parameters are randomly selected as horizontal, The ordinate, the number of scatter plots that can be drawn is Usually, in a scatterplot drawn with randomly selected coordinate axis parameters, the distinction of cell subgroups is not obvious, requiring the operator to have a higher level of professional knowledge and select a specific combination of parameters for analysis to obtain a more ideal grouping As a result, the process is cumbersome and time-consuming.

发明内容Contents of the invention

为了解决上述问题，本发明的目的在于提供一种对流式细胞仪数据进行快速自动分群的方法，所述方法包括以下步骤：步骤一，采用主成分分析法处理流失细胞数据，包括以下子步骤：1)对样本矩阵X进行标准化，得到标准化矩阵X^*；2)求出其相关系数矩阵并进行特征分解，得到特征值(λ₁≥λ₂≥…≥λ_p)和其对应的特征向量a₁，a₂，…，a_p；3)根据主成分方差贡献率确定主成分的个数k；4)根据前k个主成分对应的特征向量U＝[λ₁，λ₂…λ_k]，得到样本数据对k个主成分向量构成的特征向量矩阵W＝X^*U；步骤二，利用改进后的K-means算法对流式细胞进行聚类，得到类群标签；步骤三，设置贡献率最大的主成分作为坐标轴绘制散点图；步骤四，实现自动分群。In order to solve the above problems, the object of the present invention is to provide a method for fast and automatic grouping of flow cytometer data, said method comprising the following steps: Step 1, using principal component analysis to process the lost cell data, including the following sub-steps: 1) Standardize the sample matrix X to obtain the standardized matrix X ^* ; 2) Find the correlation coefficient matrix and perform eigendecomposition to obtain the eigenvalues (λ ₁ ≥ λ ₂ ≥...≥ λ _p ) and their corresponding eigenvectors a ₁ , a ₂ ,..., a _p ; 3) Determine the number k of principal components according to the variance contribution rate of the principal components; 4) According to the eigenvectors U=[λ ₁ , λ ₂ ...λ _k ] corresponding to the first k principal components , to obtain the eigenvector matrix W=X ^* U composed of sample data to k principal component vectors; Step 2, use the improved K-means algorithm to cluster the flow cells to obtain the group labels; Step 3, set the contribution rate to be the largest Draw a scatter diagram with the principal component as the coordinate axis; Step 4, realize automatic grouping.

优选地，所述步骤二具体包括：确定一个数据点作为第一个初始聚类中心，选取与第一个聚类中心距离最大的数据点作为第二个聚类中心，选取距离前两个聚类中心距离最大的数据点为第三个聚类中心，以此类推，最终确定n个初始聚类中心；最后对各个数据点到初始聚类中现聚类。心的距离进行迭代运算实现聚类。Preferably, the step 2 specifically includes: determining a data point as the first initial cluster center, selecting the data point with the largest distance from the first cluster center as the second cluster center, and selecting the distance from the first two cluster centers The data point with the largest distance between the cluster centers is the third cluster center, and so on, and finally determine n initial cluster centers; finally, each data point is clustered in the initial cluster. The distance between centers is iteratively calculated to achieve clustering.

应当理解，前述大体的描述和后续详尽的描述均为示例性说明和It is to be understood that both the foregoing general description and the following detailed description are exemplary illustrations and

解释，并不应当用作对本发明所要求保护内容的限制。explanation, and should not be used as a limitation to the claimed content of the present invention.

附图说明Description of drawings

参考随附的附图，本发明更多的目的、功能和优点将通过本发明实施方式的如下描述得以阐明，其中：With reference to the accompanying drawings, more objects, functions and advantages of the present invention will be clarified through the following description of the embodiments of the present invention, wherein:

图1为本发明的对流式细胞仪数据进行快速自动分群的方法的流程图；Fig. 1 is the flowchart of the method for carrying out rapid automatic grouping to flow cytometry data of the present invention;

图2为利用传统人工分群方法绘制二维散点图得到的结果示意图；Fig. 2 is the result schematic diagram that utilizes traditional artificial grouping method to draw two-dimensional scatter diagram to obtain;

图3为利用本发明的PCA方法处理后得到的主成分的贡献率及累计贡献率；Fig. 3 is the contribution rate and the cumulative contribution rate of the principal component obtained after utilizing the PCA method of the present invention to process;

图4为利用本发明的的方法得到的分群结果示意图。Fig. 4 is a schematic diagram of the grouping results obtained by using the method of the present invention.

具体实施方式detailed description

通过参考示范性实施例，本发明的目的和功能以及用于实现这些目的和功能的方法将得以阐明。然而，本发明并不受限于以下所公开的示范性实施例；可以通过不同形式来对其加以实现。说明书的实质仅仅是帮助相关领域技术人员综合理解本发明的具体细节。The objects and functions of the present invention and methods for achieving the objects and functions will be clarified by referring to the exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below; it can be implemented in various forms. The essence of the description is only to help those skilled in the relevant art comprehensively understand the specific details of the present invention.

在下文中，将参考附图描述本发明的实施例。在附图中，相同的附图标记代表相同或类似的部件，或者相同或类似的步骤。Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

本发明提出将主成分分析法(PCA)运用到多参数流式数据分析中，通过对流式数据进行降维处理及体征提取，利用最能体现不同亚群细胞之间差别的两个主成分变量作为二维散点图的横、纵坐标轴，对样本进行散点图分群分析。The present invention proposes to apply Principal Component Analysis (PCA) to multi-parameter flow data analysis, by performing dimensionality reduction processing and sign extraction on flow data, using two principal component variables that can best reflect the differences between different subgroups of cells As the abscissa and ordinate axes of the two-dimensional scatter diagram, the samples were grouped and analyzed in the scatter diagram.

PCA是一种常用的多元统计分析技术，它根据方差最大化原理，通过线性变换选出较少的重要变量代替原始的多个变量，降低了数据维度并最大化保存数据的有效信息量。PCA算法首先对样本矩阵X进行标准化，得到标准化矩阵X^*；然后求出其相关系数矩阵并进行特征分解，得到特征值(λ₁≥λ₂≥…≥λ_p)和其对应的特征向量a₁，a₂，…，a_p；接下来根据主成分方差贡献率确定主成分的个数k；最后，根据前k个主成分对应的特征向量U＝[λ₁，λ₂…λ_k]，得到样本数据对k个主成分向量构成的特征向量矩阵W＝X^*U。多参数流式细胞数据具有数据量大、维度高等特点，PCA方法能够降低流式细胞数据的维度及冗余信息，选取主成分变量作为新的特征变量，自动设置坐标轴，绘制散点图，实现自动分群。PCA is a commonly used multivariate statistical analysis technique. Based on the principle of variance maximization, PCA selects fewer important variables to replace the original multiple variables through linear transformation, reduces the data dimension and maximizes the effective amount of information saved in the data. The PCA algorithm first standardizes the sample matrix X to obtain the standardized matrix X ^* ; then finds its correlation coefficient matrix and performs eigendecomposition to obtain the eigenvalues (λ ₁ ≥ λ ₂ ≥…≥ λ _p ) and their corresponding eigenvectors a ₁ , a ₂ ,..., a _p ; Next, determine the number k of principal components according to the variance contribution rate of the principal components; finally, according to the eigenvectors U=[λ ₁ , λ ₂ ...λ _k ] corresponding to the first k principal components , to obtain the eigenvector matrix W=X ^* U composed of sample data pairs of k principal component vectors. Multi-parameter flow cytometry data has the characteristics of large data volume and high dimensionality. The PCA method can reduce the dimensionality and redundant information of flow cytometry data, select principal component variables as new feature variables, automatically set coordinate axes, and draw scatter diagrams. Realize automatic grouping.

K-means算法是典型的基于距离进行聚类的算法，该算法快速、简单、效率高。本方法利用改进后的K-means算法实现细胞的自动设门。算法的改进主要表现在初始化聚类中心的位置的确定，传统的K-means聚类算法常常随机选择n个值作为初始聚类中心，导致聚类结果并不稳定。本方法为：先确定一个数据点作为第一个初始聚类中心，然后选取与第一个聚类中心距离最大的数据点作为第二个聚类中心，接下来选取距离前两个聚类中心距离最大的数据点为第三个聚类中心，以此类推，最终确定n个初始聚类中心；最后对各个数据点到初始聚类中心的距离进行迭代运算实现聚类。K-means algorithm is a typical distance-based clustering algorithm, which is fast, simple and efficient. This method uses the improved K-means algorithm to realize the automatic gating of cells. The improvement of the algorithm is mainly reflected in the determination of the position of the initial clustering center. The traditional K-means clustering algorithm often randomly selects n values as the initial clustering center, resulting in unstable clustering results. This method is: first determine a data point as the first initial cluster center, then select the data point with the largest distance from the first cluster center as the second cluster center, and then select the distance from the first two cluster centers The data point with the largest distance is the third cluster center, and so on, and finally determine n initial cluster centers; finally, iteratively calculate the distance from each data point to the initial cluster center to achieve clustering.

本发明方提供的法能够实现流式细胞仪数据自动分群，无需人工设置散点图的坐标轴，通过将处理后得到的前两个或三个贡献率最大的主成分自动设置为坐标轴，便能够实现自动流式细胞数据的自动分群。此外，通过利用改进后的Kmeans聚类算法对处理后的流式数据进行聚类分析，得到流式细胞数据各事件的分类标签，实现不同亚群细胞的圈门。图1为本发明的对流式细胞仪数据进行快速自动分群的方法的流程图。本方法自动分群结果与传统人工分群结果一致，分析时间远远低于人工分析的时间，提高了细胞分群的效率，同时提高了分群结果的可靠性，本方法在多参数流式细胞数据分析中有较好的应用前景，同时能够应用于其他生物医学数据分析领域中。图2为利用传统人工分群方法绘制二维散点图得到的结果示意图。图3为利用本发明的PCA方法处理后得到的主成分的贡献率及累计贡献率。图4为利用本发明的方法得到的分群结果示意图。从图2和图4对比来看，利用本发明的分群效果要由于传动的人工分群方法。The method provided by the present invention can realize the automatic grouping of flow cytometry data, without manually setting the coordinate axes of the scatter diagram, by automatically setting the first two or three principal components with the largest contribution rates obtained after processing as the coordinate axes, Automatic clustering of flow cytometry data can be realized. In addition, by using the improved Kmeans clustering algorithm to cluster and analyze the processed flow cytometry data, the classification labels of each event in the flow cytometry data are obtained, and the gates of different subgroups of cells are realized. Fig. 1 is a flow chart of the method for fast and automatic clustering of flow cytometry data according to the present invention. The automatic clustering results of this method are consistent with the traditional manual clustering results, and the analysis time is much lower than the manual analysis time, which improves the efficiency of cell clustering and improves the reliability of the clustering results. This method is used in multi-parameter flow cytometry data analysis. It has a good application prospect and can be applied to other biomedical data analysis fields. Figure 2 is a schematic diagram of the results obtained by drawing a two-dimensional scatter plot using the traditional manual clustering method. Fig. 3 is the contribution rate and cumulative contribution rate of the principal components obtained after processing by the PCA method of the present invention. Fig. 4 is a schematic diagram of the grouping results obtained by using the method of the present invention. From the comparison of Fig. 2 and Fig. 4, it can be seen that the grouping effect of the present invention is due to the manual grouping method of the transmission.

采用人体外周血淋巴细胞的流式细胞实验数据为处理对象，样本包含4811个细胞以及淋巴细胞的3种表面分化抗原(CD3+、CD19+和CD56+)。每个细胞的流式数据包括11个参数，分别为脉冲高度(FITC-H,PE-H,APC-H),脉冲面积(FSC-A,SSC-A,FITC-A,PE-A,APC-A)和脉冲宽度(FITC-W,PE-W,APC-W)。The flow cytometry data of human peripheral blood lymphocytes were used as the processing object, and the samples contained 4811 cells and 3 surface differentiation antigens (CD3+, CD19+ and CD56+) of lymphocytes. The flow data of each cell includes 11 parameters, namely pulse height (FITC-H, PE-H, APC-H), pulse area (FSC-A, SSC-A, FITC-A, PE-A, APC -A) and pulse width (FITC-W, PE-W, APC-W).

表1贡献率最大的主成分PC0和PC1的特征值和特征向量Table 1 The eigenvalues and eigenvectors of the principal components PC0 and PC1 with the largest contribution rate

Tab.1 Characteristic value and characteristic vector of PC1and PC2Tab.1 Characteristic value and characteristic vector of PC1 and PC2

表2：PCA分群结果准确率Table 2: Accuracy rate of PCA clustering results

结合这里披露的本发明的说明和实践，本发明的其他实施例对于本领域技术人员都是易于想到和理解的。说明和实施例仅被认为是示例性的，本发明的真正范围和主旨均由权利要求所限定。Other embodiments of the invention will be apparent to and understood by those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The description and examples are considered exemplary only, with the true scope and spirit of the invention defined by the claims.

Claims

1. A method for fast automatic grouping of flow cytometry data, said method comprising the following steps:

Step 1, using the principal component analysis method to process the lost cell data, including the following sub-steps:

1) Standardize the sample matrix X to obtain a standardized matrix X ^* ;

2) Calculate its correlation coefficient matrix and perform eigendecomposition to obtain eigenvalues (λ ₁ ≥λ ₂ ≥…≥λ _p ) and their corresponding eigenvectors a ₁ , a ₂ ,…,a _p ;

3) Determine the number k of principal components according to the variance contribution rate of the principal components;

4) According to the eigenvectors U=[λ ₁ , λ ₂ ... λ _k ] corresponding to the first k principal components, the eigenvector matrix W=X ^* U formed by the sample data pair k principal component vectors is obtained;

Step 2, using the improved K-means algorithm to cluster the flow cells to obtain group labels;

Step 3, set the principal component with the largest contribution rate as the coordinate axis to draw a scatter diagram;

Step 4, realizing automatic grouping.

2. The method according to claim 1, said step 2 specifically comprising: determining a data point as the first initial cluster center, selecting the data point with the largest distance from the first cluster center as the second cluster Center, select the data point with the largest distance from the first two cluster centers as the third cluster center, and so on, finally determine n initial cluster centers; finally iterate the distance from each data point to the initial cluster center The operation implements clustering.