CN110718269A - Life science calculation container pack system and method - Google Patents
Life science calculation container pack system and method Download PDFInfo
- Publication number
- CN110718269A CN110718269A CN201910896021.9A CN201910896021A CN110718269A CN 110718269 A CN110718269 A CN 110718269A CN 201910896021 A CN201910896021 A CN 201910896021A CN 110718269 A CN110718269 A CN 110718269A
- Authority
- CN
- China
- Prior art keywords
- module
- data
- sequence data
- sequence
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 18
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 71
- 238000007418 data mining Methods 0.000 claims abstract description 18
- 238000007405 data analysis Methods 0.000 claims abstract description 9
- 238000005065 mining Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 22
- 230000009467 reduction Effects 0.000 claims description 18
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 13
- 238000013500 data storage Methods 0.000 claims description 12
- 238000001228 spectrum Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 9
- 230000007704 transition Effects 0.000 claims description 9
- 238000013479 data entry Methods 0.000 claims description 8
- 230000003068 static effect Effects 0.000 claims description 7
- 238000013501 data transformation Methods 0.000 claims description 6
- 230000001131 transforming effect Effects 0.000 claims description 3
- 241000894007 species Species 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000007726 management method Methods 0.000 description 7
- 108020004414 DNA Proteins 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 239000002253 acid Substances 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 3
- 125000003729 nucleotide group Chemical group 0.000 description 3
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical class CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 3
- NYHBQMYGNKIUIF-UUOKFMHZSA-N Guanosine Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O NYHBQMYGNKIUIF-UUOKFMHZSA-N 0.000 description 2
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000007621 cluster analysis Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- UHDGCWIWMRVCDJ-UHFFFAOYSA-N 1-beta-D-Xylofuranosyl-NH-Cytosine Natural products O=C1N=C(N)C=CN1C1C(O)C(O)C(CO)O1 UHDGCWIWMRVCDJ-UHFFFAOYSA-N 0.000 description 1
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 1
- 102100034330 Chromaffin granule amine transporter Human genes 0.000 description 1
- MIKUYHXYGGJMLM-GIMIYPNGSA-N Crotonoside Natural products C1=NC2=C(N)NC(=O)N=C2N1[C@H]1O[C@@H](CO)[C@H](O)[C@@H]1O MIKUYHXYGGJMLM-GIMIYPNGSA-N 0.000 description 1
- UHDGCWIWMRVCDJ-PSQAKQOGSA-N Cytidine Natural products O=C1N=C(N)C=CN1[C@@H]1[C@@H](O)[C@@H](O)[C@H](CO)O1 UHDGCWIWMRVCDJ-PSQAKQOGSA-N 0.000 description 1
- NYHBQMYGNKIUIF-UHFFFAOYSA-N D-guanosine Natural products C1=2NC(N)=NC(=O)C=2N=CN1C1OC(CO)C(O)C1O NYHBQMYGNKIUIF-UHFFFAOYSA-N 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 101000641221 Homo sapiens Chromaffin granule amine transporter Proteins 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 229960005305 adenosine Drugs 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N adenyl group Chemical class N1=CN=C2N=CNC2=C1N GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- UHDGCWIWMRVCDJ-ZAKLUEHWSA-N cytidine Chemical compound O=C1N=C(N)C=CN1[C@H]1[C@H](O)[C@@H](O)[C@H](CO)O1 UHDGCWIWMRVCDJ-ZAKLUEHWSA-N 0.000 description 1
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical class NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical class O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- 229940029575 guanosine Drugs 0.000 description 1
- 238000011551 log transformation method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 244000005700 microbiome Species 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/231—Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Analytical Chemistry (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Chemical & Material Sciences (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
The invention relates to the technical field of computing container packs, in particular to a life science computing container pack system and a life science computing container pack method. The system comprises a gene data acquisition module, a gene data analysis module and a data calculation container package, wherein the gene data analysis module comprises a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for clustering and analyzing the sequence data. According to the life science calculation container pack system and method, gene data information can be recorded in an all-around mode, a complete gene database is established, deep mining of sequence data is achieved, important data in the sequence data are extracted, a sequence data clustering module is adopted, a set of the sequence data is grouped into a plurality of classes consisting of similar objects and analyzed, and knowledge of the inherent structure of a population is obtained.
Description
Technical Field
The invention relates to the technical field of computing container packs, in particular to a life science computing container pack system and a life science computing container pack method.
Background
The life science is the science of studying the structure, function, occurrence and development laws of organisms (including plants, animals and microorganisms) and is a part of the natural science. Aims to clarify and control life activities, transform nature and provide practical services for agriculture, industry, medicine and the like. However, the existing life science data storage only stores data, life science data cannot be analyzed and clustered, important data in sequence data are difficult to extract, and the understanding of the inherent structure of a population cannot be obtained.
Disclosure of Invention
It is an object of the present invention to provide a life science computing container system and method that addresses one or more of the deficiencies set forth in the background above.
In order to achieve the above object, in one aspect, the present invention provides a life science calculation container package system, which includes a gene data acquisition module, a gene data analysis module, and a data calculation container package, wherein the gene data analysis module includes a sequence data mining module and a sequence data clustering module, the sequence data mining module is configured to mine sequence data, and the sequence data clustering module is configured to perform cluster analysis on the sequence data.
Preferably, the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, wherein the sequence data representation module is used for digitally representing the DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
Preferably, the sequence data clustering module comprises a static clustering module, a time series clustering module or an HMM clustering module.
Preferably, the static clustering module comprises a gene similarity measuring module and a hierarchical clustering calculating module, wherein the gene similarity measuring module is used for measuring the similarity of genes, and the hierarchical clustering calculating module is used for performing hierarchical clustering on the genes.
Preferably, the time series clustering module comprises an original data transformation module and a selection expression spectrum module, wherein the original data transformation module is used for transforming original data, and the selection expression spectrum module is used for selecting expression spectrums.
Preferably, the HMM clustering module includes a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module, and an output probability matrix module, and the hidden state set module is used for including hidden data that cannot be obtained by direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for representing the probability of each hidden state at an initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for representing the probability of outputting a certain output value under a certain state.
Preferably, the gene data acquisition module comprises a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, wherein the scientific naming module is used for entering scientific naming of gene sequences, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
Preferably, the computing container includes a data storage module, a run-on module, a centralized deployment module, a run-on module, a data processing module, and a node status module, where the data storage module is configured to store data, the run-on module is configured to take charge of a Vertical gatekeeper run in the Arda framework, the centralized deployment module is configured to take charge of deployment of a Vertical created by the Arda framework in a cluster, the run-on module is configured to take charge of services such as starting and closing of the Vertical in the operation of the Arda framework, the data processing module is configured to collect, copy, migrate, and output data, and the node status module is configured to take charge of a status of the Vertical run in the Arda framework and a status of each node.
In another aspect, the present invention further provides a life science calculation container pack method, including any one of the above life science calculation container pack systems, comprising the following operation steps:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
Compared with the prior art, the invention has the beneficial effects that:
1. in the life science calculation container pack system and method, a gene data acquisition module is adopted to record scientific naming, source species classification, reference document entry and genome data entry of gene data, so that gene data information can be recorded in an all-around manner, and a complete gene database is established.
2. In the life science calculation container system and method, a sequence data mining module is adopted to finish sequence data representation, sequence data training, sequence data dimension reduction, vector distance included angle calculation and sequence data identification, realize the deep mining of the sequence data and extract important data in the sequence data.
3. In the life science calculation container package system and method, a sequence data clustering module is adopted to group a set of sequence data into a plurality of classes consisting of similar objects, and the classes are analyzed to acquire the knowledge of the inherent structure of the population.
Drawings
FIG. 1 is an overall flow block diagram of the present invention;
FIG. 2 is a diagram of a gene data analysis module according to the present invention;
FIG. 3 is a block diagram of sequence data mining according to the present invention;
FIG. 4 is a diagram of a sequence data clustering module of the present invention;
FIG. 5 is a diagram of a static clustering module of the present invention;
FIG. 6 is a block diagram of a temporal clustering module according to the present invention;
FIG. 7 is a diagram of an HMM clustering module according to the present invention;
FIG. 8 is a block diagram of gene data collection according to the present invention;
FIG. 9 is a diagram of a computing container package of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
In one aspect, the present invention provides a life science calculation container package system, as shown in fig. 1 to fig. 3, including a gene data acquisition module, a gene data analysis module and a data calculation container package, wherein the gene data analysis module includes a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for performing cluster analysis on the sequence data.
In the embodiment, the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, wherein the sequence data representation module is used for digitally representing a DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
Wherein the sequence data mining module is designed based on a popular learning method and is supposed to exist in RdD-dimensional domain in Euclidean space, let f: Y → RDFor a smooth tessellated mapping in this space, let d<D. Data point yiY is generated in a random process and then mapped to form spatial data { x }i=f(yi) Is contained in RDTerm Y ═ Y1,y2,……yNIs a low-dimensional embeded person, y1(i-1, 2, … N) is a low-dimensional embeded vector, X-X (X)1,x2,……xN) For a given high-dimensional observation data set, data points x are observed from a high dimensioniMedium reconstruction f and low dimensional vector yiAre targets for manifold learning.
Wherein the DNA sequence is a sequence formed by 4 base units CGAT combination, namely cytosine nucleotide C (cytidine acid, CMP), guanine nucleotide G (guanosine acid, GMP), adenine nucleotide A (adenosine acid, AMP) and thymine nucleotide T (thymine acid, TMP), C, G, A, T, O is respectively mapped to G according to a nucleotide conversion formulai(i ═ 1,2, … …, N), to achieve digitization of each base, where the english letter "O" indicates no nucleotide, and the nucleotide conversion formula is:
the sequence data dimension reduction module specifically comprises: regarding N DNA sequences with D base units as a D-dimensional sample set X [ X ]1,x2,……xN]Assuming there are enough data points and considering that each data point can be linearly represented by its K neighbors, find neighbors, use ε neighbors, compare each xiAnd xjDistance d ofijThe calculation formula is as follows:
further, in the above-mentioned case,when d isij<ε,xjBelong to xiIs marked as xij(ii) a Otherwise xjNot belonging to xiFor each sample point xi(i is 1,2, … N), and a weight vector W corresponding to the minimum value of Φ (W) is calculated by the lagrange multiplier methodij(j-1, 2, … k), wherein,
substituting the weight vector into function phi (Y), and calculating the corresponding low latitude embedding Y (Y) when phi (Y) takes the minimum value by using the Lagrange multiplier method1,y2,……yN},yi∈RdWherein
among them, the Euclidean distance d between 2 DNA sequence vectorsijReferred to as DNA distance similarity parameter; 2 are provided withThe angle cos θ between the DNA sequence vectorsijCalled DNA angle similarity parameter; euclidean distance d between 2 low-dimensional embedded vectorsij' referred to as DNA low dimensional distance similarity parameter; angle cos theta between 2 low-dimensional embeded vectorsij' referred to as DNA low-dimensional angle similarity parameter, the algorithms are as follows:
the DNA sequence data mining method based on manifold learning comprises the following steps:
1) digitally representing the DNA sequence;
2) and extracting a part of data as training data, and setting a distinguishing mark for the training data.
3) Taking the training data and the sample to be detected as a data set X to perform DNA sequence dimensionality reduction calculation to obtain various different low-dimensional embedded Y;
4) calculating the distance and included angle of different vectors in the low-dimensional human-embedded Y to obtain a DNA low-dimensional distance similarity parameter dij' similarity parameter cos theta with DNA Low dimensional Angleij’;
(5) And (4) identifying and classifying according to the data in the step (4).
Example 2
Referring to fig. 4-7, the sequence data clustering module includes a static clustering module, a time series clustering module, or an HMM clustering module.
In this embodiment, the static clustering module includes a gene similarity measurement module and a hierarchical clustering calculation module, the gene similarity measurement module is used for measuring the similarity of genes, and the hierarchical clustering calculation module is used for performing hierarchical clustering on genes.
Further, for hierarchical clustering, the similarity of two genes is first measured, which is actually a correlation coefficient, GiEqual to the observed data of gene G under condition i, for any two genes X and Y, their similarity score calculation methods are defined as the following formula:
wherein N represents the number of conditions, G represents X or Y, GoffsetRepresents the mean of the observed values of G, thus phiGThe standard deviation of G, S (X, Y) is Pearson correlation coefficient of X and Y, hierarchical clustering is to generate a phylogenetic tree diagram, all elements are gathered in a tree, and the algorithm for hierarchical clustering of genes comprises the following steps:
1) calculating the similarity score of each pair of genes to obtain an upper diagonal similarity matrix, wherein each point in the matrix represents the similarity score of the pair of genes;
2) scanning a similarity matrix, finding out the maximum value in the matrix, and representing that the similarity between the pair of genes is highest;
3) combining the two genes into a node which is a father node of the two genes, and calculating the expression profile of the gene, namely changing the two genes into one gene;
4) repeating 1) -3) until only one element is left, namely the root node.
Specifically, the time sequence clustering module comprises an original data transformation module and an expression spectrum selection module, wherein the original data transformation module is used for transforming original data, and the expression spectrum selection module is used for selecting expression spectrums.
Wherein, firstly, log transformation is carried out on original data, the expression quantity of each time point is compared with the expression quantity of the 1 st time point, then log logarithm is taken, after transformation, the value of the first time point is always 0, and the value of the later time point is generally between-2 and 2, because logarithm is taken, if the value is larger than 4, the result shows that the expression quantity of the time point is too much larger than the expression quantity of the 1 st time point, which is likely to be an error, and for the robustness of the algorithm, such a gene can be screened out, but since the 1 st time point is taken as a reference, the expression quantity of the gene at the 1 st time point is not too small.
Wherein, the algorithm steps of the expression profile selection module are as follows:
1) let P1For an expression profile that is always adjusted up and down by one unit at successive time points, let R ═ P1And L ═ P/{ P1Initially selecting a typical expression profile;
2) let P be equal to L, so thatMaximum, R is R ∪ { p }, L is L/{ p }, selecting an expression spectrum which has the maximum difference with all the selected expression spectrums;
3) repeat 2), perform m-1 times in total, end, select m expression profiles in total.
It is worth to be noted that the HMM clustering module comprises a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module and an output probability matrix module, wherein the hidden state set module is used for containing hidden data and cannot obtain gene data through direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for expressing the probability of each hidden state at the initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for expressing the probability of outputting a certain output value under a certain state.
The general idea of clustering genes based on the HMM method is as follows: given n sequences S1Index setGiven I ═ {1,2, … …, n }, the assigned integer K, an assignment C (C) for I is calculated1,C2,……,CK) And K HMM models L1,L2,……,LKSo that the objective function takes the maximum value:
wherein L (S)i|Lk) Is a likelihood function, i.e. in the model LKLower generation sequence SiThe probability density of (c). The clustering algorithm based on HMM comprises the following steps:
2) Generating a new allocation: each sequence SiAssigning to make likelihood function valuesThe largest model k;
3) according to the initial parametersAnd the sequence data assigned thereto reestimates the parameters of the K models
4) And when the value change of the objective function is small enough, the distribution does not change any more, or the preset maximum iteration number is reached, terminating the algorithm.
Example 3
As shown in fig. 8, the gene data acquisition module includes a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, the scientific naming module is used for entering a scientific name of a gene sequence, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
In this embodiment, the scientific name of the gene sequence is entered through the scientific name entry module, the source species classification information is recorded through the source species classification module, the reference document information is recorded through the reference document entry module, and the genome data is recorded through the genome data entry module.
Example 4
As shown in fig. 9, the computation container includes a data storage module, an operation stationing module, a centralized deployment module, an operation module, a data processing module, and a node state module, where the data storage module is configured to store data, the operation stationing module is configured to be responsible for Vertical stationing in operation in an Arda framework, the centralized deployment module is configured to be responsible for deployment of Vertical created by the Arda framework in a cluster, the operation module is configured to be responsible for services such as starting and closing of Vertical in operation of the Arda framework, the data processing module is configured to collect, copy, migrate, and output data, and the node state module is configured to be responsible for a state of Vertical in operation in the Arda framework and a state of each node.
In the embodiment, the computing container comprises an infrastructure automation management layer applied to a big data cluster based on container technology, the aim is to simply manage DevOps automation deployment, Arda allows users to build a big data system, and a Vetical is created by the Arda as a management object, so that the automation operation and management system for the life cycle management of the big data cluster is realized.
In Arda, for the management of a large data cluster or a virtualization cluster, the unified management is mainly realized by using Vertical as a management object. Creating a big data cluster is to create a Vertical, where a Vertical may contain multiple services, and each service may contain multiple micro-services.
Further, the data storage module is mainly responsible for storing Arda data, information including all association relations of Arda and states of all components is stored in the group route, and at present, data storage is mainly realized through Etcd, but may also be realized through other database components.
Specifically, the operation stationing module is mainly responsible for Vertical stationing operated in the Arda framework and ensures that Vertical can be operated in the storage system of Arda persistently.
In another aspect, the present invention further provides a life science calculation container pack method, including any one of the above life science calculation container pack systems, comprising the following operation steps:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (9)
1. The life science calculation container pack system comprises a gene data acquisition module, a gene data analysis module and a data calculation container pack, wherein the gene data analysis module comprises a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for clustering the sequence data.
2. The life science computing pod system of claim 1, wherein: the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, and the sequence data representation module is used for digitally representing a DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
3. The life science computing pod system of claim 1, wherein: the sequence data clustering module comprises a static clustering module, a time sequence clustering module or an HMM clustering module.
4. The life science computing pod system of claim 3, wherein: the static clustering module comprises a gene similarity measuring module and a hierarchical clustering calculating module, wherein the gene similarity measuring module is used for measuring the similarity of genes, and the hierarchical clustering calculating module is used for carrying out hierarchical clustering on the genes.
5. The life science computing pod system of claim 3, wherein: the time sequence clustering module comprises an original data transformation module and a selection expression spectrum module, wherein the original data transformation module is used for transforming original data, and the selection expression spectrum module is used for selecting expression spectrums.
6. The life science computing pod system of claim 3, wherein: the HMM clustering module comprises a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module and an output probability matrix module, wherein the hidden state set module is used for containing hidden data and cannot obtain gene data through direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for representing the probability of each hidden state at an initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for representing the probability of outputting a certain output value under a certain state.
7. The life science computing pod system of claim 5, wherein: the gene data acquisition module comprises a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, wherein the scientific naming module is used for entering scientific names of gene sequences, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
8. The life science computing pod system of claim 1, wherein: the computing container comprises a data storage module, a running and stationing module, a centralized deployment module, an operation module, a data processing module and a node state module, wherein the data storage module is used for storing data, the running and stationing module is used for being responsible for Vertical stationing running in an Arda framework, the centralized deployment module is used for being responsible for deploying Vertical created by the Arda framework in a cluster, the operation module is used for being responsible for starting and closing services of Vertical in the operation of the Arda framework, the data processing module is used for collecting, copying, migrating and outputting data, and the node state module is used for being responsible for the state of the Vertical running in the Arda framework and the state of each node.
9. A life science computing container pack method comprising the life science computing container pack system as claimed in any one of claims 1 to 8, the operational steps of which are as follows:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910896021.9A CN110718269A (en) | 2019-09-22 | 2019-09-22 | Life science calculation container pack system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910896021.9A CN110718269A (en) | 2019-09-22 | 2019-09-22 | Life science calculation container pack system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110718269A true CN110718269A (en) | 2020-01-21 |
Family
ID=69210704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910896021.9A Pending CN110718269A (en) | 2019-09-22 | 2019-09-22 | Life science calculation container pack system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110718269A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920900A (en) * | 2018-06-21 | 2018-11-30 | 福州大学 | The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data |
-
2019
- 2019-09-22 CN CN201910896021.9A patent/CN110718269A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108920900A (en) * | 2018-06-21 | 2018-11-30 | 福州大学 | The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data |
Non-Patent Citations (3)
Title |
---|
宋英;: "数据挖掘技术中聚类算法的研究", 《科学咨询(科技管理)》 * |
李铭轩: "基于Arda的容器化大数据应用研究", 《电信技术》 * |
熊赟: "生物序列模式挖掘与聚类研究", 《中国优秀博士学位论文全文数据库信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Talukder et al. | Interpretation of deep learning in genomics and epigenomics | |
Chang et al. | A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem | |
Olteanu et al. | On-line relational and multiple relational SOM | |
Hassanien et al. | Computational intelligence techniques in bioinformatics | |
CN106991296B (en) | Integrated classification method based on randomized greedy feature selection | |
CN114218292B (en) | Multi-element time sequence similarity retrieval method | |
CN110222745A (en) | A kind of cell type identification method based on similarity-based learning and its enhancing | |
CN112908414B (en) | Large-scale single-cell typing method, system and storage medium | |
CN110110100A (en) | Across the media Hash search methods of discrete supervision decomposed based on Harmonious Matrix | |
Raimundo et al. | Machine learning for single-cell genomics data analysis | |
CN110110739A (en) | A kind of domain self-adaptive reduced-dimensions method based on samples selection | |
CN116630718A (en) | Prototype-based low-disturbance image type incremental learning algorithm | |
CN104966106A (en) | Biological age step-by-step predication method based on support vector machine | |
CN112766400A (en) | Semi-supervised classification integration method for high-dimensional data based on multiple data transformation spaces | |
CN116701979A (en) | Social network data analysis method and system based on limited k-means | |
Noble et al. | Integrating information for protein function prediction | |
CN112668633B (en) | Adaptive graph migration learning method based on fine granularity field | |
CN110718269A (en) | Life science calculation container pack system and method | |
Jian-Xiang et al. | Application of genetic algorithm in document clustering | |
CN112071362B (en) | Method for detecting protein complex fusing global and local topological structures | |
CN115661504A (en) | Remote sensing sample classification method based on transfer learning and visual word package | |
CN114334168A (en) | Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy | |
CN111127184B (en) | Distributed combined credit evaluation method | |
CN114117040A (en) | Text data multi-label classification method based on label specific features and relevance | |
Bhat et al. | OTU clustering: A window to analyse uncultured microbial world |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200121 |