CN114512185B - Donkey population natural selection classification system for variable data dimension reduction input - Google Patents

Donkey population natural selection classification system for variable data dimension reduction input Download PDF

Info

Publication number
CN114512185B
CN114512185B CN202210038022.1A CN202210038022A CN114512185B CN 114512185 B CN114512185 B CN 114512185B CN 202210038022 A CN202210038022 A CN 202210038022A CN 114512185 B CN114512185 B CN 114512185B
Authority
CN
China
Prior art keywords
donkey
data
genome sequence
sequence data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210038022.1A
Other languages
Chinese (zh)
Other versions
CN114512185A (en
Inventor
彭绍亮
黄浩
刘凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210038022.1A priority Critical patent/CN114512185B/en
Publication of CN114512185A publication Critical patent/CN114512185A/en
Application granted granted Critical
Publication of CN114512185B publication Critical patent/CN114512185B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Epidemiology (AREA)
  • Genetics & Genomics (AREA)
  • Public Health (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the field of biological information data mining, and particularly discloses a donkey population natural selection classification system for variable data dimension reduction input. The system comprises: the input module, donkey genome sequence data processing module, classification module; the input module is used for acquiring donkey genome sequence data; a donkey genome sequence data processing module comprising: the donkey genome sequence data preprocessing unit and the donkey genome sequence data fusion unit are used for processing the donkey genome sequence data acquired by the input module and converting the donkey genome sequence data into mutation site fusion data; the classification module comprises a model construction unit and a model prediction unit, wherein the model construction unit is used for constructing a natural selection classification model by utilizing a convolutional neural network, performing data dimension reduction by utilizing mutation site fusion data, and then performing natural selection classification on donkey populations. The method has the functional advantage of analyzing the natural selection influence of the donkey population by excavating the genome data of the donkey population, and has few model parameters and high accuracy.

Description

Donkey population natural selection classification system for variable data dimension reduction input
Technical Field
The invention relates to the field of genome data mining of biological populations, in particular to a natural selection classification system based on a neural network.
Background
Population genetics is life science that researches genetic characteristics and genetic rules of biological populations. In agricultural production, the method has great economic value for pest and disease management, seed selection and breeding; in medical treatment, the preparation method has great contribution to the infection rule of diseases; has great scientific significance for protecting and researching biological diversity.
At present, a plurality of systems for natural selection and classification of drosophila and fish sequentially appear at home and abroad, the systems process the position information of the variation nodes and the variation matrix data separately to form a multi-input network, and the position information of the variation nodes is combined through a full-connection network, so that the problems of overlarge parameter and overfitting are caused, the accuracy of a model is influenced, and the system is difficult to apply to specific production practice. In addition, there are few natural selection classification systems for donkey populations, which fail to solve the problem of natural selection classification of donkey populations well. The problems restrict the agricultural production and seed selection breeding efficiency of the donkey and restrict the scientific feeding of the donkey.
Disclosure of Invention
Aiming at the problems of difficult model training and low model accuracy caused by overlarge neural network parameters acting on natural selection classification and the situation that the current domestic and foreign models slightly study on donkey population natural selection, the invention provides a donkey population natural selection classification system with variable data dimension reduction input for exploring the situation that the donkey population is influenced by natural selection and the model accuracy is improved and being applicable to specific production practice. And outputting a natural selection classification result through fusion data of the mutation sites and the neural network. The system adopts CNN neural network to carry out natural selection judgment.
The technical scheme adopted by the invention is as follows:
a donkey population natural selection classification system for reducing dimension input of mutation data comprises: the input module, donkey genome sequence data processing module, classification module;
the input module is used for acquiring donkey genome sequence data;
the donkey genome sequence data processing module is connected with the input module and is used for processing the genome sequence data acquired by the input module and outputting mutation site position data and mutation matrix data of the donkey genome sequence;
the donkey genome sequence data processing module comprises: a donkey genome sequence data preprocessing unit and a donkey genome sequence data fusion unit;
the donkey genome sequence data preprocessing unit is used for dividing and cleaning donkey genome sequence data;
the donkey genome sequence data preprocessing unit comprises: the device comprises a donkey genome sequence data slice divider, a mutation node position calculator, a donkey genome sequence data converter and a donkey genome sequence data cleaner;
the donkey genome sequence data slicing divider is used for dividing a donkey genome sequence into a plurality of fragments with equal sizes;
the variation node position calculator is used for calculating the relative position of the locus in the donkey gene segment in the corresponding segment;
the donkey genome sequence data converter is used for converting the divided donkey genome segment data into a 0,1 binary data matrix, namely variation matrix data, wherein 0 represents ancestral genes and 1 represents variation genes;
the donkey genome sequence data washer is used for deleting too short and too long data, combining repeated locus data, and obtaining a result through OR operation on the repeated locus data;
the donkey genome sequence data fusion unit is used for carrying out operation on the mutation site position data and the mutation matrix data to generate mutation site fusion data, and the specific operation process is as follows:
M ij =H ij *pos j
wherein H is ij Refers to the ith row and jth column data of the variation matrix, pos j Refers to the relative position of the jth mutation site in the fragment interval, and represents multiplication, M ij Ith row and jth column data of fusion data of mutation sites;
the classification module comprises: the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data, and the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data;
the donkey genome sequence model construction unit adopts a convolutional neural network to construct a natural selection model, and model construction is carried out according to the following sequence: calling an Input layer, a CNN layer and a Dropout layer, and building a classification model; the CNN layer is used for vector characterization learning, the Dropout layer is used for preventing the model from being fitted excessively, then the weight is adjusted according to each feature, and finally the weight and the feature value vector are multiplied and summed to be output;
the calculation process of the CNN layer is as follows:
V=conv(W,X)+b
w is a weight matrix, X is mutation site fusion data, b is bias,is an activation function, conv is a convolution function, V is a convolution function output result, and Y is an activation function output result;
the donkey genome sequence model classifying unit inputs the mutation site fusion data obtained by the donkey genome sequence data processing module into a model constructed by the genome sequence model constructing unit, trains the model by using training set data, and reads a test set into the trained model for natural selection classification; wherein, the natural selection prediction classification process is performed according to the following sequence: the method comprises the steps of (1) reading in mutation site fusion data output by a donkey genome sequence data processing module, dividing the read-in data into a training set and a test set according to a ratio of 8:2, (2) encoding the discrete type data by adopting a single-heat encoding mode to finally obtain vector representation of the mutation site data, (3) inputting the training set data converted into the vector representation into a model for model training, and (4) reading in the test set data by utilizing the trained model for natural selection classification.
Compared with the prior art, the invention has the beneficial effects that:
the donkey population natural selection classification system with the variable data dimension reduction input can analyze the natural selection classification of the donkey population, the variable site fusion data can reduce the input size so as to reduce the neural network parameters, and the variable site fusion data of the donkey population is automatically extracted and analyzed through the neural network so as to obtain a result with high accuracy and reliability, and the donkey population natural selection classification system has application value in the aspects of analyzing the natural selection effect received by the variable nodes of the donkey population, agricultural production, seed selection and breeding, scientific feeding and the like of the donkey.
Drawings
Fig. 1 is a natural selection classification system of donkey populations with variable data dimension reduction input.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Fig. 1 shows a donkey population natural selection classification system with variable data dimension reduction input according to an embodiment of the invention.
Referring to fig. 1, a donkey population natural selection classification system for variable data dimension reduction input provided by an embodiment of the invention includes: the input module, the data processing module, the classification module;
the input module is used for acquiring donkey genome sequence data;
the donkey genome sequence data processing module is connected with the input module and is used for processing the genome sequence data acquired by the input module and outputting mutation site position data and mutation matrix data of the donkey genome sequence;
the donkey genome sequence data processing module comprises: a donkey genome sequence data preprocessing unit and a donkey genome sequence data fusion unit;
the donkey genome sequence data preprocessing unit is used for dividing and cleaning donkey genome sequence data;
the donkey genome sequence data preprocessing unit comprises: the device comprises a donkey genome sequence data slice divider, a mutation node position calculator, a donkey genome sequence data converter and a donkey genome sequence data cleaner;
the donkey genome sequence data slicing divider is used for dividing a donkey genome sequence into a plurality of fragments with equal sizes;
the variation node position calculator is used for calculating the relative position of the locus in the donkey gene segment in the corresponding segment;
the donkey genome sequence data converter is used for converting the divided donkey genome segment data into a 0,1 binary data matrix, namely variation matrix data, wherein 0 represents ancestral genes and 1 represents variation genes;
the donkey genome sequence data washer is used for deleting too short and too long data, combining repeated locus data, and obtaining a result through OR operation on the repeated locus data;
the donkey genome sequence data fusion unit is used for carrying out operation on the mutation site position data and the mutation matrix data to generate mutation site fusion data, and the specific operation process is as follows:
M ij =H ij *pos j
wherein H is ij Refers to the ith row and jth column data of the variation matrix, pos j Refers to the relative position of the jth mutation site in the fragment interval, and represents multiplication, M ij Ith row and jth column data of fusion data of mutation sites;
the classification module comprises: the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data, and the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data;
the donkey genome sequence model construction unit adopts a convolutional neural network to construct a natural selection model, and model construction is carried out according to the following sequence: calling an Input layer, a CNN layer and a Dropout layer, and building a classification model; the CNN layer is used for vector characterization learning, the Dropout layer is used for preventing the model from being fitted excessively, then the weight is adjusted according to each feature, and finally the weight and the feature value vector are multiplied and summed to be output;
the calculation process of the CNN layer is as follows:
V=conv(W,X)+b
w is a weight matrix, X is mutation site fusion data, b is bias,is an activation function, conv is a convolution function, V is a convolution function output result, and Y is an activation function output result;
the donkey genome sequence model classifying unit inputs the mutation site fusion data obtained by the donkey genome sequence data processing module into a model constructed by the genome sequence model constructing unit, trains the model by using training set data, and reads a test set into the trained model for natural selection classification; wherein, the natural selection prediction classification process is performed according to the following sequence: the method comprises the steps of (1) reading in mutation site fusion data output by a donkey genome sequence data processing module, dividing the read-in data into a training set and a test set according to a ratio of 8:2, (2) encoding the discrete type data by adopting a single-heat encoding mode to finally obtain vector representation of the mutation site data, (3) inputting the training set data converted into the vector representation into a model for model training, and (4) reading in the test set data by utilizing the trained model for natural selection classification.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (1)

1. A donkey population natural selection classification system for variable data dimension reduction input is characterized in that the classification system comprises: the input module, donkey genome sequence data processing module, classification module;
the input module is used for acquiring donkey genome sequence data;
the donkey genome sequence data processing module is connected with the input module and is used for processing the genome sequence data acquired by the input module and outputting mutation site position data and mutation matrix data of the donkey genome sequence;
the donkey genome sequence data processing module comprises: a donkey genome sequence data preprocessing unit and a donkey genome sequence data fusion unit;
the donkey genome sequence data preprocessing unit is used for dividing and cleaning donkey genome sequence data;
the donkey genome sequence data preprocessing unit comprises: the device comprises a donkey genome sequence data slice divider, a mutation node position calculator, a donkey genome sequence data converter and a donkey genome sequence data cleaner;
the donkey genome sequence data slicing divider is used for dividing a donkey genome sequence into a plurality of fragments with equal sizes;
the variation node position calculator is used for calculating the relative position of the locus in the donkey gene segment in the corresponding segment;
the donkey genome sequence data converter is used for converting the divided donkey genome segment data into a 0,1 binary data matrix, namely variation matrix data, wherein 0 represents ancestral genes and 1 represents variation genes;
the donkey genome sequence data washer is used for deleting too short and too long data, combining repeated locus data, and obtaining a result through OR operation on the repeated locus data;
the donkey genome sequence data fusion unit is used for carrying out operation on the mutation site position data and the mutation matrix data to generate mutation site fusion data, and the specific operation process is as follows:
M ij =H ij *pos j
wherein H is ij Refers to the ith row and jth column data of the variation matrix, pos j Refers to the relative position of the jth mutation site in the fragment interval, and represents multiplication, M ij Ith row and jth column data of fusion data of mutation sites;
the classification module comprises: the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data, and the donkey genome sequence data processing module is used for outputting data of donkey genome sequence data;
the donkey genome sequence model construction unit adopts a convolutional neural network to construct a natural selection model, and model construction is carried out according to the following sequence: calling an Input layer, a CNN layer and a Dropout layer, and building a classification model; the CNN layer is used for vector characterization learning, the Dropout layer is used for preventing the model from being fitted excessively, then the weight is adjusted according to each feature, and finally the weight and the feature value vector are multiplied and summed to be output;
the calculation process of the CNN layer is as follows:
V=conv(W,X)+b
w is a weight matrix, X is mutation site fusion data, b is bias,is an activation function, conv is a convolution function, V is a convolution function output result, and Y is an activation function output result;
the donkey genome sequence model classifying unit inputs the mutation site fusion data obtained by the donkey genome sequence data processing module into a model constructed by the genome sequence model constructing unit, trains the model by using training set data, and reads a test set into the trained model for natural selection classification; wherein, the natural selection prediction classification process is performed according to the following sequence: the method comprises the steps of (1) reading in mutation site fusion data output by a donkey genome sequence data processing module, dividing the read-in data into a training set and a test set according to a ratio of 8:2, (2) encoding the discrete type data by adopting a single-heat encoding mode to finally obtain vector representation of the mutation site data, (3) inputting the training set data converted into the vector representation into a model for model training, and (4) reading in the test set data by utilizing the trained model for natural selection classification.
CN202210038022.1A 2022-01-13 2022-01-13 Donkey population natural selection classification system for variable data dimension reduction input Active CN114512185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210038022.1A CN114512185B (en) 2022-01-13 2022-01-13 Donkey population natural selection classification system for variable data dimension reduction input

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210038022.1A CN114512185B (en) 2022-01-13 2022-01-13 Donkey population natural selection classification system for variable data dimension reduction input

Publications (2)

Publication Number Publication Date
CN114512185A CN114512185A (en) 2022-05-17
CN114512185B true CN114512185B (en) 2024-04-05

Family

ID=81549378

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210038022.1A Active CN114512185B (en) 2022-01-13 2022-01-13 Donkey population natural selection classification system for variable data dimension reduction input

Country Status (1)

Country Link
CN (1) CN114512185B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052796A (en) * 2017-12-26 2018-05-18 云南大学 Global human mtDNA development tree classification querying methods based on integrated study
CN108509860A (en) * 2018-03-09 2018-09-07 西安电子科技大学 HOh Xil Tibetan antelope detection method based on convolutional neural networks
CN110111901A (en) * 2019-05-16 2019-08-09 湖南大学 Transportable patient classification system based on RNN neural network
CN112182247A (en) * 2020-10-15 2021-01-05 华中农业大学 Genetic population map construction method and system, storage medium and electronic equipment
CN113128685A (en) * 2021-04-25 2021-07-16 湖南大学 Natural selection classification and population scale change analysis system based on neural network
WO2021211840A1 (en) * 2020-04-15 2021-10-21 Chan Zuckerberg Biohub, Inc. Local-ancestry inference with machine learning model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318806A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant Classifier Based on Deep Neural Networks
US12009060B2 (en) * 2018-12-14 2024-06-11 Merck Sharp & Dohme Llc Identifying biosynthetic gene clusters

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108052796A (en) * 2017-12-26 2018-05-18 云南大学 Global human mtDNA development tree classification querying methods based on integrated study
CN108509860A (en) * 2018-03-09 2018-09-07 西安电子科技大学 HOh Xil Tibetan antelope detection method based on convolutional neural networks
CN110111901A (en) * 2019-05-16 2019-08-09 湖南大学 Transportable patient classification system based on RNN neural network
WO2021211840A1 (en) * 2020-04-15 2021-10-21 Chan Zuckerberg Biohub, Inc. Local-ancestry inference with machine learning model
CN112182247A (en) * 2020-10-15 2021-01-05 华中农业大学 Genetic population map construction method and system, storage medium and electronic equipment
CN113128685A (en) * 2021-04-25 2021-07-16 湖南大学 Natural selection classification and population scale change analysis system based on neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Donkey genomes provide new insights into dom estication and selection for coat color;Wang C等;《Nature Communications》;20201231;全文 *
群体基因组学方法:从经典统计学到有监督学习;施怿;李海鹏;;中国科学:生命科学;20190325(04);全文 *

Also Published As

Publication number Publication date
CN114512185A (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN113519028B (en) Methods and compositions for estimating or predicting genotypes and phenotypes
Anderson Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations
Mardle et al. An investigation of genetic algorithms for the optimization of multi‐objective fisheries bioeconomic models
Steinrücken et al. Analysis of DNA sequence variation within marine species using Beta-coalescents
Pincot et al. Social network analysis of the genealogy of strawberry: retracing the wild roots of heirloom and modern cultivars
JP2024524795A (en) Gene phenotype prediction based on graph neural networks
CN110045771B (en) Intelligent monitoring system for water quality of fishpond
CN114639446B (en) Method for estimating aquatic animal genome breeding value based on MCP sparse deep neural network model
Laetsch et al. Demographically explicit scans for barriers to gene flow using gIMble
CN116386899A (en) Graph learning-based medicine disease association relation prediction method and related equipment
Zhu et al. Genomic prediction of growth traits in scallops using convolutional neural networks
CN114512185B (en) Donkey population natural selection classification system for variable data dimension reduction input
CN113128685B (en) Natural selection classification and group scale change analysis system based on neural network
CN115995262B (en) Method for analyzing corn genetic mechanism based on random forest and LASSO regression
CN111524023A (en) Greenhouse adjusting method and system
CN109034392A (en) The selection and system of a kind of Tilapia mossambica corss combination system
Mourad et al. Learning hierarchical Bayesian networks for genome-wide association studies
CN115769300A (en) Variant pathogenicity scoring and classification and uses thereof
Whalen et al. Evolving SNP panels for genomic prediction
Houssein et al. Salp swarm algorithm: modification and application
Luh et al. Classification of generalist or specialist life styles of predaceous phytoseiid mites using a computer genetic algorithm, information theory, and life history traits
Silva et al. Neural networks and dimensionality reduction to increase predictive efficiency for complex traits
US20240119314A1 (en) Gene coding breeding prediction method and device based on graph clustering
CN118609667A (en) Crop phenotype association regulation and control network optimization method and system
Pincot et al. Social network analysis of the genealogy of strawberry: retracing the wild roots of

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant