CN110718269A - Life science calculation container pack system and method - Google Patents

Life science calculation container pack system and method Download PDF

Info

Publication number
CN110718269A
CN110718269A CN201910896021.9A CN201910896021A CN110718269A CN 110718269 A CN110718269 A CN 110718269A CN 201910896021 A CN201910896021 A CN 201910896021A CN 110718269 A CN110718269 A CN 110718269A
Authority
CN
China
Prior art keywords
module
data
sequence data
sequence
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910896021.9A
Other languages
Chinese (zh)
Inventor
周会群
王玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Xinyida Computing Technology Co Ltd
Original Assignee
Nanjing Xinyida Computing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Xinyida Computing Technology Co Ltd filed Critical Nanjing Xinyida Computing Technology Co Ltd
Priority to CN201910896021.9A priority Critical patent/CN110718269A/en
Publication of CN110718269A publication Critical patent/CN110718269A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention relates to the technical field of computing container packs, in particular to a life science computing container pack system and a life science computing container pack method. The system comprises a gene data acquisition module, a gene data analysis module and a data calculation container package, wherein the gene data analysis module comprises a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for clustering and analyzing the sequence data. According to the life science calculation container pack system and method, gene data information can be recorded in an all-around mode, a complete gene database is established, deep mining of sequence data is achieved, important data in the sequence data are extracted, a sequence data clustering module is adopted, a set of the sequence data is grouped into a plurality of classes consisting of similar objects and analyzed, and knowledge of the inherent structure of a population is obtained.

Description

Life science calculation container pack system and method
Technical Field
The invention relates to the technical field of computing container packs, in particular to a life science computing container pack system and a life science computing container pack method.
Background
The life science is the science of studying the structure, function, occurrence and development laws of organisms (including plants, animals and microorganisms) and is a part of the natural science. Aims to clarify and control life activities, transform nature and provide practical services for agriculture, industry, medicine and the like. However, the existing life science data storage only stores data, life science data cannot be analyzed and clustered, important data in sequence data are difficult to extract, and the understanding of the inherent structure of a population cannot be obtained.
Disclosure of Invention
It is an object of the present invention to provide a life science computing container system and method that addresses one or more of the deficiencies set forth in the background above.
In order to achieve the above object, in one aspect, the present invention provides a life science calculation container package system, which includes a gene data acquisition module, a gene data analysis module, and a data calculation container package, wherein the gene data analysis module includes a sequence data mining module and a sequence data clustering module, the sequence data mining module is configured to mine sequence data, and the sequence data clustering module is configured to perform cluster analysis on the sequence data.
Preferably, the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, wherein the sequence data representation module is used for digitally representing the DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
Preferably, the sequence data clustering module comprises a static clustering module, a time series clustering module or an HMM clustering module.
Preferably, the static clustering module comprises a gene similarity measuring module and a hierarchical clustering calculating module, wherein the gene similarity measuring module is used for measuring the similarity of genes, and the hierarchical clustering calculating module is used for performing hierarchical clustering on the genes.
Preferably, the time series clustering module comprises an original data transformation module and a selection expression spectrum module, wherein the original data transformation module is used for transforming original data, and the selection expression spectrum module is used for selecting expression spectrums.
Preferably, the HMM clustering module includes a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module, and an output probability matrix module, and the hidden state set module is used for including hidden data that cannot be obtained by direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for representing the probability of each hidden state at an initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for representing the probability of outputting a certain output value under a certain state.
Preferably, the gene data acquisition module comprises a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, wherein the scientific naming module is used for entering scientific naming of gene sequences, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
Preferably, the computing container includes a data storage module, a run-on module, a centralized deployment module, a run-on module, a data processing module, and a node status module, where the data storage module is configured to store data, the run-on module is configured to take charge of a Vertical gatekeeper run in the Arda framework, the centralized deployment module is configured to take charge of deployment of a Vertical created by the Arda framework in a cluster, the run-on module is configured to take charge of services such as starting and closing of the Vertical in the operation of the Arda framework, the data processing module is configured to collect, copy, migrate, and output data, and the node status module is configured to take charge of a status of the Vertical run in the Arda framework and a status of each node.
In another aspect, the present invention further provides a life science calculation container pack method, including any one of the above life science calculation container pack systems, comprising the following operation steps:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
Compared with the prior art, the invention has the beneficial effects that:
1. in the life science calculation container pack system and method, a gene data acquisition module is adopted to record scientific naming, source species classification, reference document entry and genome data entry of gene data, so that gene data information can be recorded in an all-around manner, and a complete gene database is established.
2. In the life science calculation container system and method, a sequence data mining module is adopted to finish sequence data representation, sequence data training, sequence data dimension reduction, vector distance included angle calculation and sequence data identification, realize the deep mining of the sequence data and extract important data in the sequence data.
3. In the life science calculation container package system and method, a sequence data clustering module is adopted to group a set of sequence data into a plurality of classes consisting of similar objects, and the classes are analyzed to acquire the knowledge of the inherent structure of the population.
Drawings
FIG. 1 is an overall flow block diagram of the present invention;
FIG. 2 is a diagram of a gene data analysis module according to the present invention;
FIG. 3 is a block diagram of sequence data mining according to the present invention;
FIG. 4 is a diagram of a sequence data clustering module of the present invention;
FIG. 5 is a diagram of a static clustering module of the present invention;
FIG. 6 is a block diagram of a temporal clustering module according to the present invention;
FIG. 7 is a diagram of an HMM clustering module according to the present invention;
FIG. 8 is a block diagram of gene data collection according to the present invention;
FIG. 9 is a diagram of a computing container package of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
In one aspect, the present invention provides a life science calculation container package system, as shown in fig. 1 to fig. 3, including a gene data acquisition module, a gene data analysis module and a data calculation container package, wherein the gene data analysis module includes a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for performing cluster analysis on the sequence data.
In the embodiment, the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, wherein the sequence data representation module is used for digitally representing a DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
Wherein the sequence data mining module is designed based on a popular learning method and is supposed to exist in RdD-dimensional domain in Euclidean space, let f: Y → RDFor a smooth tessellated mapping in this space, let d<D. Data point yiY is generated in a random process and then mapped to form spatial data { x }i=f(yi) Is contained in RDTerm Y ═ Y1,y2,……yNIs a low-dimensional embeded person, y1(i-1, 2, … N) is a low-dimensional embeded vector, X-X (X)1,x2,……xN) For a given high-dimensional observation data set, data points x are observed from a high dimensioniMedium reconstruction f and low dimensional vector yiAre targets for manifold learning.
Wherein the DNA sequence is a sequence formed by 4 base units CGAT combination, namely cytosine nucleotide C (cytidine acid, CMP), guanine nucleotide G (guanosine acid, GMP), adenine nucleotide A (adenosine acid, AMP) and thymine nucleotide T (thymine acid, TMP), C, G, A, T, O is respectively mapped to G according to a nucleotide conversion formulai(i ═ 1,2, … …, N), to achieve digitization of each base, where the english letter "O" indicates no nucleotide, and the nucleotide conversion formula is:
the sequence data dimension reduction module specifically comprises: regarding N DNA sequences with D base units as a D-dimensional sample set X [ X ]1,x2,……xN]Assuming there are enough data points and considering that each data point can be linearly represented by its K neighbors, find neighbors, use ε neighbors, compare each xiAnd xjDistance d ofijThe calculation formula is as follows:
Figure BDA0002210292090000052
further, in the above-mentioned case,when d isij<ε,xjBelong to xiIs marked as xij(ii) a Otherwise xjNot belonging to xiFor each sample point xi(i is 1,2, … N), and a weight vector W corresponding to the minimum value of Φ (W) is calculated by the lagrange multiplier methodij(j-1, 2, … k), wherein,
Figure BDA0002210292090000054
substituting the weight vector into function phi (Y), and calculating the corresponding low latitude embedding Y (Y) when phi (Y) takes the minimum value by using the Lagrange multiplier method1,y2,……yN},yi∈RdWherein
among them, the Euclidean distance d between 2 DNA sequence vectorsijReferred to as DNA distance similarity parameter; 2 are provided withThe angle cos θ between the DNA sequence vectorsijCalled DNA angle similarity parameter; euclidean distance d between 2 low-dimensional embedded vectorsij' referred to as DNA low dimensional distance similarity parameter; angle cos theta between 2 low-dimensional embeded vectorsij' referred to as DNA low-dimensional angle similarity parameter, the algorithms are as follows:
Figure BDA0002210292090000061
Figure BDA0002210292090000062
Figure BDA0002210292090000063
Figure BDA0002210292090000064
the DNA sequence data mining method based on manifold learning comprises the following steps:
1) digitally representing the DNA sequence;
2) and extracting a part of data as training data, and setting a distinguishing mark for the training data.
3) Taking the training data and the sample to be detected as a data set X to perform DNA sequence dimensionality reduction calculation to obtain various different low-dimensional embedded Y;
4) calculating the distance and included angle of different vectors in the low-dimensional human-embedded Y to obtain a DNA low-dimensional distance similarity parameter dij' similarity parameter cos theta with DNA Low dimensional Angleij’;
(5) And (4) identifying and classifying according to the data in the step (4).
Example 2
Referring to fig. 4-7, the sequence data clustering module includes a static clustering module, a time series clustering module, or an HMM clustering module.
In this embodiment, the static clustering module includes a gene similarity measurement module and a hierarchical clustering calculation module, the gene similarity measurement module is used for measuring the similarity of genes, and the hierarchical clustering calculation module is used for performing hierarchical clustering on genes.
Further, for hierarchical clustering, the similarity of two genes is first measured, which is actually a correlation coefficient, GiEqual to the observed data of gene G under condition i, for any two genes X and Y, their similarity score calculation methods are defined as the following formula:
wherein,
Figure BDA0002210292090000072
wherein N represents the number of conditions, G represents X or Y, GoffsetRepresents the mean of the observed values of G, thus phiGThe standard deviation of G, S (X, Y) is Pearson correlation coefficient of X and Y, hierarchical clustering is to generate a phylogenetic tree diagram, all elements are gathered in a tree, and the algorithm for hierarchical clustering of genes comprises the following steps:
1) calculating the similarity score of each pair of genes to obtain an upper diagonal similarity matrix, wherein each point in the matrix represents the similarity score of the pair of genes;
2) scanning a similarity matrix, finding out the maximum value in the matrix, and representing that the similarity between the pair of genes is highest;
3) combining the two genes into a node which is a father node of the two genes, and calculating the expression profile of the gene, namely changing the two genes into one gene;
4) repeating 1) -3) until only one element is left, namely the root node.
Specifically, the time sequence clustering module comprises an original data transformation module and an expression spectrum selection module, wherein the original data transformation module is used for transforming original data, and the expression spectrum selection module is used for selecting expression spectrums.
Wherein, firstly, log transformation is carried out on original data, the expression quantity of each time point is compared with the expression quantity of the 1 st time point, then log logarithm is taken, after transformation, the value of the first time point is always 0, and the value of the later time point is generally between-2 and 2, because logarithm is taken, if the value is larger than 4, the result shows that the expression quantity of the time point is too much larger than the expression quantity of the 1 st time point, which is likely to be an error, and for the robustness of the algorithm, such a gene can be screened out, but since the 1 st time point is taken as a reference, the expression quantity of the gene at the 1 st time point is not too small.
Wherein, the algorithm steps of the expression profile selection module are as follows:
1) let P1For an expression profile that is always adjusted up and down by one unit at successive time points, let R ═ P1And L ═ P/{ P1Initially selecting a typical expression profile;
2) let P be equal to L, so that
Figure BDA0002210292090000081
Maximum, R is R ∪ { p }, L is L/{ p }, selecting an expression spectrum which has the maximum difference with all the selected expression spectrums;
3) repeat 2), perform m-1 times in total, end, select m expression profiles in total.
It is worth to be noted that the HMM clustering module comprises a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module and an output probability matrix module, wherein the hidden state set module is used for containing hidden data and cannot obtain gene data through direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for expressing the probability of each hidden state at the initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for expressing the probability of outputting a certain output value under a certain state.
The general idea of clustering genes based on the HMM method is as follows: given n sequences S1Index setGiven I ═ {1,2, … …, n }, the assigned integer K, an assignment C (C) for I is calculated1,C2,……,CK) And K HMM models L1,L2,……,LKSo that the objective function takes the maximum value:
Figure BDA0002210292090000082
wherein L (S)i|Lk) Is a likelihood function, i.e. in the model LKLower generation sequence SiThe probability density of (c). The clustering algorithm based on HMM comprises the following steps:
1) given K initial HMM models
Figure BDA0002210292090000083
2) Generating a new allocation: each sequence SiAssigning to make likelihood function valuesThe largest model k;
3) according to the initial parameters
Figure BDA0002210292090000091
And the sequence data assigned thereto reestimates the parameters of the K models
Figure BDA0002210292090000092
4) And when the value change of the objective function is small enough, the distribution does not change any more, or the preset maximum iteration number is reached, terminating the algorithm.
Example 3
As shown in fig. 8, the gene data acquisition module includes a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, the scientific naming module is used for entering a scientific name of a gene sequence, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
In this embodiment, the scientific name of the gene sequence is entered through the scientific name entry module, the source species classification information is recorded through the source species classification module, the reference document information is recorded through the reference document entry module, and the genome data is recorded through the genome data entry module.
Example 4
As shown in fig. 9, the computation container includes a data storage module, an operation stationing module, a centralized deployment module, an operation module, a data processing module, and a node state module, where the data storage module is configured to store data, the operation stationing module is configured to be responsible for Vertical stationing in operation in an Arda framework, the centralized deployment module is configured to be responsible for deployment of Vertical created by the Arda framework in a cluster, the operation module is configured to be responsible for services such as starting and closing of Vertical in operation of the Arda framework, the data processing module is configured to collect, copy, migrate, and output data, and the node state module is configured to be responsible for a state of Vertical in operation in the Arda framework and a state of each node.
In the embodiment, the computing container comprises an infrastructure automation management layer applied to a big data cluster based on container technology, the aim is to simply manage DevOps automation deployment, Arda allows users to build a big data system, and a Vetical is created by the Arda as a management object, so that the automation operation and management system for the life cycle management of the big data cluster is realized.
In Arda, for the management of a large data cluster or a virtualization cluster, the unified management is mainly realized by using Vertical as a management object. Creating a big data cluster is to create a Vertical, where a Vertical may contain multiple services, and each service may contain multiple micro-services.
Further, the data storage module is mainly responsible for storing Arda data, information including all association relations of Arda and states of all components is stored in the group route, and at present, data storage is mainly realized through Etcd, but may also be realized through other database components.
Specifically, the operation stationing module is mainly responsible for Vertical stationing operated in the Arda framework and ensures that Vertical can be operated in the storage system of Arda persistently.
In another aspect, the present invention further provides a life science calculation container pack method, including any one of the above life science calculation container pack systems, comprising the following operation steps:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (9)

1. The life science calculation container pack system comprises a gene data acquisition module, a gene data analysis module and a data calculation container pack, wherein the gene data analysis module comprises a sequence data mining module and a sequence data clustering module, the sequence data mining module is used for mining sequence data, and the sequence data clustering module is used for clustering the sequence data.
2. The life science computing pod system of claim 1, wherein: the sequence data mining module comprises a sequence data representation module, a sequence data training module, a sequence data dimension reduction module, a vector distance included angle calculation module and a sequence data identification module, and the sequence data representation module is used for digitally representing a DNA sequence; the sequence data training module is used for training data and setting a distinguishing mark for the training data; the sequence data dimension reduction module is used for carrying out dimension reduction calculation on the DVA sequence; the vector distance included angle calculation module is used for calculating the distance and included angle of different vectors; the sequence data identification module is used for identifying and classifying data.
3. The life science computing pod system of claim 1, wherein: the sequence data clustering module comprises a static clustering module, a time sequence clustering module or an HMM clustering module.
4. The life science computing pod system of claim 3, wherein: the static clustering module comprises a gene similarity measuring module and a hierarchical clustering calculating module, wherein the gene similarity measuring module is used for measuring the similarity of genes, and the hierarchical clustering calculating module is used for carrying out hierarchical clustering on the genes.
5. The life science computing pod system of claim 3, wherein: the time sequence clustering module comprises an original data transformation module and a selection expression spectrum module, wherein the original data transformation module is used for transforming original data, and the selection expression spectrum module is used for selecting expression spectrums.
6. The life science computing pod system of claim 3, wherein: the HMM clustering module comprises a hidden state set module, an output set module, an initial state probability matrix module, a state transition probability matrix module and an output probability matrix module, wherein the hidden state set module is used for containing hidden data and cannot obtain gene data through direct observation; the output set module outputs gene data which can be obtained through direct observation; the initial state probability matrix module is used for representing the probability of each hidden state at an initial moment; the state transition probability matrix module is used for representing the probability of transition from one state to another state; the output probability matrix module is used for representing the probability of outputting a certain output value under a certain state.
7. The life science computing pod system of claim 5, wherein: the gene data acquisition module comprises a scientific naming module, a source species classification module, a reference document entry module and a genome data entry module, wherein the scientific naming module is used for entering scientific names of gene sequences, and the source species classification module is used for recording source species classification information; the reference document recording module is used for recording reference document information; the genome data entry module is used for recording genome data.
8. The life science computing pod system of claim 1, wherein: the computing container comprises a data storage module, a running and stationing module, a centralized deployment module, an operation module, a data processing module and a node state module, wherein the data storage module is used for storing data, the running and stationing module is used for being responsible for Vertical stationing running in an Arda framework, the centralized deployment module is used for being responsible for deploying Vertical created by the Arda framework in a cluster, the operation module is used for being responsible for starting and closing services of Vertical in the operation of the Arda framework, the data processing module is used for collecting, copying, migrating and outputting data, and the node state module is used for being responsible for the state of the Vertical running in the Arda framework and the state of each node.
9. A life science computing container pack method comprising the life science computing container pack system as claimed in any one of claims 1 to 8, the operational steps of which are as follows:
s1, data acquisition: inputting scientific names of gene sequences through a naming learning module, recording source species classification information through a source species classification module, recording reference document information through a reference document input module, and recording genome data through a genome data input module;
s2, sequence data mining: the DNA sequence is digitally represented by a sequence data representation module, data is trained by a sequence data training module, a distinguishing mark is set for the training data, the DVA sequence is subjected to dimensionality reduction calculation by a sequence data dimensionality reduction module, the distance and included angle calculation is performed on different vectors by a vector distance included angle calculation module, and the data is identified and classified by a sequence data identification module;
s3, clustering sequence data: measuring the similarity of the genes by using a gene similarity measuring module, and carrying out hierarchical clustering on the genes by using a hierarchical clustering calculating module;
s4, operating a calculation container package: the data storage module is used for storing data, the operation of the stationer module is used for stationing the Vertical operated in the Arda framework, the centralized deployment module is used for deploying the Vertical created by the Arda framework in a cluster, the operation module is used for starting and closing the Vertical operated in the Arda framework, the data processing module is used for collecting, copying, transferring and outputting the data, and the node state module is used for controlling the state of the Vertical operated in the Arda framework and the state of each node.
CN201910896021.9A 2019-09-22 2019-09-22 Life science calculation container pack system and method Pending CN110718269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910896021.9A CN110718269A (en) 2019-09-22 2019-09-22 Life science calculation container pack system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910896021.9A CN110718269A (en) 2019-09-22 2019-09-22 Life science calculation container pack system and method

Publications (1)

Publication Number Publication Date
CN110718269A true CN110718269A (en) 2020-01-21

Family

ID=69210704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910896021.9A Pending CN110718269A (en) 2019-09-22 2019-09-22 Life science calculation container pack system and method

Country Status (1)

Country Link
CN (1) CN110718269A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920900A (en) * 2018-06-21 2018-11-30 福州大学 The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920900A (en) * 2018-06-21 2018-11-30 福州大学 The unsupervised extreme learning machine Feature Extraction System and method of gene expression profile data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
宋英;: "数据挖掘技术中聚类算法的研究", 《科学咨询(科技管理)》 *
李铭轩: "基于Arda的容器化大数据应用研究", 《电信技术》 *
熊赟: "生物序列模式挖掘与聚类研究", 《中国优秀博士学位论文全文数据库信息科技辑》 *

Similar Documents

Publication Publication Date Title
Talukder et al. Interpretation of deep learning in genomics and epigenomics
Chang et al. A robust dynamic niching genetic algorithm with niche migration for automatic clustering problem
Olteanu et al. On-line relational and multiple relational SOM
Hassanien et al. Computational intelligence techniques in bioinformatics
CN106991296B (en) Integrated classification method based on randomized greedy feature selection
CN114218292B (en) Multi-element time sequence similarity retrieval method
CN110222745A (en) A kind of cell type identification method based on similarity-based learning and its enhancing
CN112908414B (en) Large-scale single-cell typing method, system and storage medium
CN110110100A (en) Across the media Hash search methods of discrete supervision decomposed based on Harmonious Matrix
Raimundo et al. Machine learning for single-cell genomics data analysis
CN110110739A (en) A kind of domain self-adaptive reduced-dimensions method based on samples selection
CN116630718A (en) Prototype-based low-disturbance image type incremental learning algorithm
CN104966106A (en) Biological age step-by-step predication method based on support vector machine
CN112766400A (en) Semi-supervised classification integration method for high-dimensional data based on multiple data transformation spaces
CN116701979A (en) Social network data analysis method and system based on limited k-means
Noble et al. Integrating information for protein function prediction
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN110718269A (en) Life science calculation container pack system and method
Jian-Xiang et al. Application of genetic algorithm in document clustering
CN112071362B (en) Method for detecting protein complex fusing global and local topological structures
CN115661504A (en) Remote sensing sample classification method based on transfer learning and visual word package
CN114334168A (en) Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy
CN111127184B (en) Distributed combined credit evaluation method
CN114117040A (en) Text data multi-label classification method based on label specific features and relevance
Bhat et al. OTU clustering: A window to analyse uncultured microbial world

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200121