CN111883212B - Construction method and construction device of DNA fingerprint spectrum and terminal equipment - Google Patents

Construction method and construction device of DNA fingerprint spectrum and terminal equipment Download PDF

Info

Publication number
CN111883212B
CN111883212B CN202010102817.5A CN202010102817A CN111883212B CN 111883212 B CN111883212 B CN 111883212B CN 202010102817 A CN202010102817 A CN 202010102817A CN 111883212 B CN111883212 B CN 111883212B
Authority
CN
China
Prior art keywords
base
polymorphic site
base polymorphic
sites
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010102817.5A
Other languages
Chinese (zh)
Other versions
CN111883212A (en
Inventor
邹枚伶
王文泉
江思容
夏志强
张辰笈
孙倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Tropical Bioscience and Biotechnology Chinese Academy of Tropical Agricultural Sciences
Original Assignee
Institute of Tropical Bioscience and Biotechnology Chinese Academy of Tropical Agricultural Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Tropical Bioscience and Biotechnology Chinese Academy of Tropical Agricultural Sciences filed Critical Institute of Tropical Bioscience and Biotechnology Chinese Academy of Tropical Agricultural Sciences
Priority to CN202010102817.5A priority Critical patent/CN111883212B/en
Publication of CN111883212A publication Critical patent/CN111883212A/en
Priority to LU102543A priority patent/LU102543B1/en
Application granted granted Critical
Publication of CN111883212B publication Critical patent/CN111883212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application is applicable to the technical field of data processing, and provides a method, a device and a terminal device for constructing a DNA fingerprint, wherein the method comprises the following steps: acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site; screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, wherein N is less than or equal to M; inputting the N core loci into a preset genetic algorithm model to construct the DNA fingerprint of the target species. By the method, the creation efficiency of the DNA fingerprint is effectively improved, the adaptivity of the creation method is effectively improved, and the accuracy of the constructed DNA fingerprint is ensured.

Description

Construction method and construction device of DNA fingerprint spectrum and terminal equipment
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a method and a device for constructing a DNA fingerprint and a terminal device.
Background
A DNA (deoxyribonic Acid, abbreviated as DNA) fingerprint refers to a generic term indicating that a DNA sample treated with a specific molecular labeling technique shows a specific DNA fragment. DNA fingerprinting technology was first used to identify a person in criminal investigation or paternity testing, and then with the progress and development of biotechnology, DNA fingerprinting technology is widely used for identification of biological species.
However, the existing method for constructing the DNA fingerprint has complicated steps, is difficult to realize automatic analysis and consumes long time; in addition, the existing method for constructing the DNA fingerprint is easily influenced by various factors, for example, when the operation conditions or the operation method of the molecular genetic marker and the like are changed, the constructed DNA fingerprint is often not accurate enough, and the method has poor adaptability.
Disclosure of Invention
The embodiment of the application provides a method and a device for constructing a DNA fingerprint and a terminal device, and can solve the problems of long time consumption and poor adaptability of the existing method for constructing the DNA fingerprint.
In a first aspect, an embodiment of the present application provides a method for constructing a DNA fingerprint, including:
acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site;
screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, wherein N is less than or equal to M;
inputting the N core loci into a preset genetic algorithm model to construct the DNA fingerprint of the target species.
In one possible implementation manner of the first aspect, the screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site includes:
obtaining a reference genome of the target species;
respectively comparing the base information on each single-base polymorphic site with the base information on the corresponding gene site in the reference genome, and respectively determining a variation label corresponding to each single-base polymorphic site according to the comparison result;
and selecting N core sites from the M single-base polymorphic sites according to a preset screening condition and the variation label corresponding to each single-base polymorphic site.
In a possible implementation manner of the first aspect, the comparing the base information at each single-base polymorphic site with the base information at the corresponding genetic site in the reference genome, and determining the variation label corresponding to each single-base polymorphic site according to the comparison result respectively includes:
for each single-base polymorphic site, if the base information on the single-base polymorphic site is the same as the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a first label;
if the base information on the single-base polymorphic site is different from the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a second label;
and if the single-base polymorphic site lacks base information, determining that a variation tag corresponding to the single-base polymorphic site is a third tag.
In a possible implementation manner of the first aspect, the N core loci include at least one single-base polymorphic site corresponding to the first tag, at least one single-base polymorphic site corresponding to the second tag, and at least one single-base polymorphic site corresponding to the third tag;
the screening conditions include:
in the N core sites, the ratio of the number of the single-base polymorphic sites corresponding to the third tag to the N is less than a first preset value, and
Figure BDA0002387439770000031
and
Figure BDA0002387439770000032
respectively smaller than a second preset value;
wherein a is the number of single-base polymorphic sites corresponding to the first tag in the N core sites, and b is the number of single-base polymorphic sites corresponding to the second tag in the N core sites.
In a possible implementation manner of the first aspect, the inputting the N core loci into a preset genetic algorithm model to construct a DNA fingerprint of the target species includes:
acquiring preset iteration times and the number of sites, wherein the number of the sites is less than or equal to N;
taking the iteration number and the site number as parameters of the genetic algorithm, and constructing a DNA fingerprint map of the target species according to the genetic algorithm and the N core sites based on the parameters;
the DNA fingerprint of the target species comprises DNA fingerprints of a plurality of varieties of the target species, the DNA fingerprint of each variety comprises all locus combinations corresponding to the variety, each locus combination comprises L core loci and variation labels corresponding to the core loci, and L is the number of the loci.
In a possible implementation manner of the first aspect, after inputting the N core loci into a preset genetic algorithm model to construct a DNA fingerprint of the target species, the method further includes:
obtaining a sample to be identified belonging to the target species, and obtaining base information on the N core loci of the sample to be identified as target information;
comparing each target information with base information on a corresponding gene locus in the reference genome respectively, and determining the DNA fingerprint of the sample to be identified according to the comparison result;
and determining the variety corresponding to the target fingerprint in the DNA fingerprint of the target species as the variety of the sample to be identified, wherein the target fingerprint is a DNA fingerprint matched with the DNA fingerprint of the sample to be identified.
In a second aspect, an embodiment of the present application provides an apparatus for constructing a DNA fingerprint, including:
an acquisition unit for acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site;
a screening unit, configured to screen N core sites from the M single-base polymorphic sites according to base information on each single-base polymorphic site, where N is less than M;
and the construction unit is used for inputting the N core sites into a preset genetic algorithm model to construct the DNA fingerprint of the target species.
In a possible implementation manner of the second aspect, the screening unit includes:
an acquisition module for acquiring a reference genome of the target species;
a comparison module for comparing the base information of each single base polymorphic site with the base information of the corresponding gene site in the reference genome, and determining the variation label corresponding to each single base polymorphic site according to the comparison result;
and the screening module is used for selecting N core sites from the M single-base polymorphic sites according to preset screening conditions and the variation label corresponding to each single-base polymorphic site.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the method for constructing a DNA fingerprint map according to any one of the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, and the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the method for constructing a DNA fingerprint according to any one of the above first aspects.
In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method for constructing a DNA fingerprint according to any one of the above first aspects.
It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.
Compared with the prior art, the embodiment of the application has the advantages that:
the method comprises the steps of acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site; screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, wherein N is less than or equal to M; the single base polymorphism molecular genetic marker technology has the advantages of high efficiency, high accuracy and the like, so that the marking efficiency can be improved by utilizing the single base polymorphism molecular genetic marker technology for marking, and the obtained single base polymorphism sites can provide a reliable data base for the subsequent DNA fingerprint; inputting the N core sites into a preset genetic algorithm model to construct a DNA fingerprint of the target species; the genetic algorithm can simultaneously process a plurality of acquired information, has high processing efficiency and self-adaptability and self-learning, so that the efficiency of constructing the DNA fingerprint can be further improved and the self-adaptability of the method can be improved by utilizing the genetic algorithm. By the method, the creation efficiency of the DNA fingerprint is effectively improved, the adaptivity of the creation method is effectively improved, and the accuracy of the constructed DNA fingerprint is ensured.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of a DNA fingerprinting construction system provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a method for constructing a DNA fingerprint provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of a method for identifying a variety provided in an embodiment of the present application;
FIG. 4 is a block diagram of a DNA fingerprinting device provided in the embodiments of the present application;
fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when.. or" upon "or" in response to a determination "or" in response to a detection ".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.
An application scenario of the method for constructing the DNA fingerprint provided by the embodiment of the present application is introduced first. Referring to fig. 1, a schematic diagram of a DNA fingerprinting construction system provided in the embodiments of the present application is shown. As shown in fig. 1, the construction system may include: a database 11 and a processor 12. Wherein the database is communicatively coupled to the processor.
Before constructing a DNA fingerprint, a Single Nucleotide Polymorphism (SNP) molecular genetic marker technology may be used to mark DNA molecules of each species to obtain a plurality of single nucleotide polymorphism sites of each species. Herein, SNP refers to DNA sequence polymorphism caused by current base variation at the genome level, and such variation usually includes transition (e.g., C variation is T, or A variation is G, etc.), transversion (e.g., A variation is C or T, or C variation is A or G, etc.), deletion or insertion. SNP is a two-allele marker, and is suitable for rapid and large-scale marking. The single base polymorphisms of the respective species can be stored as files in VCF format (i.e., one species corresponds to one file in VCF format), and then a plurality of files in VCF format can be stored in the database.
When the DNA fingerprint of the target species needs to be constructed, the processor acquires a file in a VCF format of the target species from the database, analyzes the file to obtain a plurality of single base polymorphism sites of the target species, and then constructs the DNA fingerprint of the target species by using the construction method of the DNA fingerprint.
Fig. 2 shows a schematic flow chart of a method for constructing a DNA fingerprint provided in an embodiment of the present application, which may include the following steps, by way of example and not limitation:
s201, obtaining M single-base polymorphic sites of a target species and base information on each single-base polymorphic site.
In practical application, a file in a VCF format of a target species is obtained first. Files in VCF format typically have two parts, one part being annotation information (usually beginning with # # and the other part being genotype information (i.e. mutation information). All annotation information in the file needs to be removed, and only genotype information is left, so that the single-base polymorphic sites of the target species and the base information on each single-base polymorphic site can be obtained.
All single-base polymorphic sites corresponding to a target species can be obtained from a file in a VCF format, and only partial high-quality single-base polymorphic sites can be obtained. Wherein M is equal to the number of single-base polymorphic sites finally obtained.
Illustratively, it is assumed that the number of all single-base polymorphic sites corresponding to a target species is 10 ten thousand. Only 20000 high-quality single-base polymorphic sites are obtained, and correspondingly, M is 20000.
Only partial high-quality single base polymorphic sites are obtained, and the accuracy of the constructed DNA fingerprint map can be improved.
S202, according to the base information of each single-base polymorphic site, screening N core sites from the M single-base polymorphic sites.
Wherein N is less than or equal to M.
In practical applications, the file in the VCF format includes genotype information corresponding to each single-base polymorphic site, and the genotype information can be used as base information. Typical genotype information includes 0/0, 0/1, and 1/1. Wherein 0/0 indicates that the locus in a sample of the target species is homozygous for, i.e., identical to, the locus in the reference genome; 0/1 indicating that the locus is heterozygous for the locus in the sample, i.e., partially identical to the locus in the reference genome; 1/1 indicates that the site in the sample is variant, i.e., completely different from the site in the reference genome.
The variation label of each single-base polymorphism site can be directly determined according to the genotype information in the file in the VCF format. For example: 0/0 corresponds to a first tag and 1/1 corresponds to a second tag.
Of course, the base information may be purine information such as A, G, C or T. In one embodiment, the step S202 of screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site includes:
s11, obtaining a reference genome of the target species.
In practical applications, a reference genome of a target species is preset. The reference genome includes genetic information of the target species under the non-variant condition.
And S12, comparing the base information of each single-base polymorphic site with the base information of the corresponding gene site in the reference genome, and determining the variation label corresponding to each single-base polymorphic site according to the comparison result.
The alignment results include the following cases:
the base information on the single-base polymorphic site is completely the same as the base gene on the corresponding gene site in the reference genome (no mutation occurs to the allele at the site), partially the same (only one mutation occurs to the allele at the site), completely different (both mutations occur to the alleles at the site), or the base information is deleted from the single-base polymorphic site.
Optionally, step S12 may include:
I. and for each single-base polymorphic site, if the base information on the single-base polymorphic site is the same as the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a first label.
II. And if the base information on the single-base polymorphic site is different from the base information on the corresponding gene site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a second label.
The difference here means completely different, i.e.both alleles are mutated.
And III, if the single-base polymorphic site lacks base information, determining that a variation tag corresponding to the single-base polymorphic site is a third tag.
S13, selecting N core sites from the M single-base polymorphic sites according to preset screening conditions and the variation labels corresponding to the single-base polymorphic sites.
Wherein, the screening condition can be adjusted according to actual need.
Optionally, the N core loci include at least one single-base polymorphic site corresponding to the first tag, at least one single-base polymorphic site corresponding to the second tag, and at least one single-base polymorphic site corresponding to the third tag.
The screening conditions in step S13 include:
in the N core sites, the ratio of the number of the single-base polymorphic sites corresponding to the third tag to the N is less than a first preset value, and
Figure BDA0002387439770000091
and
Figure BDA0002387439770000092
respectively less than the second preset value.
Wherein a is the number of single-base polymorphic sites corresponding to the first tag in the N core sites, and b is the number of single-base polymorphic sites corresponding to the second tag in the N core sites.
For example, the first preset value may be set to 0.1, and the second preset value may be set to 0.83. Assuming that 80 core sites are selected from 20000 single-base polymorphic sites according to the above-mentioned screening conditions, the number of single-base polymorphic sites corresponding to the third tag (i.e., the site lacking base information) in the 80 core sites is less than 8, and
Figure BDA0002387439770000093
and
Figure BDA0002387439770000094
respectively less than 66.4.
S203, inputting the N core loci into a preset genetic algorithm model to construct the DNA fingerprint of the target species.
Optionally, the step S203 of constructing a DNA fingerprint of the target species according to a genetic algorithm and the N core loci includes:
and S21, acquiring preset iteration times and the number of the sites, wherein the number of the sites is less than or equal to the N.
Wherein the number of sites refers to the number of single base variability sites contained in a combination.
S22, taking the iteration number and the number of the sites as parameters of the genetic algorithm, and constructing the DNA fingerprint of the target species according to the genetic algorithm and the N core sites based on the parameters.
The DNA fingerprint of the target species comprises DNA fingerprints of a plurality of varieties of the target species, the DNA fingerprint of each variety comprises all locus combinations corresponding to the variety, each locus combination comprises L core loci and variation labels corresponding to the core loci, and L is the number of the loci.
In the genetic algorithm, an initial population needs to be set, and in the embodiment of the present application, the initial population is N core loci. In the first iteration process, randomly selecting individuals with the length of L from the initial population to breed to obtain a first batch of offspring; calculating the fitness of the first offspring, and reserving the first offspring with high fitness; the first set of offspring is then used to perform a second iteration process. And the rest can be done in the same way until the preset iteration times are reached.
Exemplarily, assuming a total of N ═ 5 core sites, the corresponding variant signature is abbaa. Assuming that L is 3, three individuals, abb, bba, baa, can be picked out from the initial population. These three individuals are propagated pairwise by crossing (randomly selecting one crossing point, and then crossing and exchanging the parts before and after the crossing point), for example, abb and bba are propagated crosswise to obtain two offspring bbb and aba, abb and baa are propagated crosswise to obtain two offspring bbb and aaa, and bba and baa are propagated crosswise to obtain two offspring bba and baa. And reserving the descendants with higher fitness as parents of the next iteration.
The method comprises the steps of acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site; screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, wherein N is less than or equal to M; the single base polymorphism molecular genetic marker technology has the advantages of high efficiency, high accuracy and the like, so that the marking efficiency can be improved by utilizing the single base polymorphism molecular genetic marker technology for marking, and the obtained single base polymorphism sites can provide a reliable data base for the subsequent DNA fingerprint; inputting the N core sites into a preset genetic algorithm model to construct a DNA fingerprint of the target species; the genetic algorithm can simultaneously process a plurality of acquired information, has high processing efficiency and self-adaptability and self-learning, so that the efficiency can be improved and the self-adaptability of the method can be improved by utilizing the genetic algorithm to construct the DNA fingerprint. By the method, the creation efficiency of the DNA fingerprint is effectively improved, the adaptivity of the creation method is effectively improved, and the accuracy of the constructed DNA fingerprint is ensured.
Fig. 3 shows a schematic flow chart of a variety identification method provided in an embodiment of the present application, which may include, by way of example and not limitation, the following steps:
s301, obtaining a preset DNA fingerprint of the target species.
The predetermined DNA fingerprint of the target species is the DNA fingerprint constructed in the example of FIG. 2.
S302, obtaining a sample to be identified belonging to the target species, and obtaining base information on the N core loci of the sample to be identified, wherein the base information is target information.
The sample to be identified belongs to the target species, but it is not known to which variety the sample to be identified belongs.
S303, comparing each target information with the base information on the corresponding gene locus in the reference genome respectively, and determining the DNA fingerprint of the sample to be identified according to the comparison result.
S304, determining the variety corresponding to the target fingerprint in the DNA fingerprint of the target species as the variety of the sample to be identified.
Wherein the target fingerprint is a DNA fingerprint matched with the DNA fingerprint of the sample to be identified.
The matching here may mean that all the site combinations are the same, or that the ratio of the same site combinations is greater than a predetermined value.
In the embodiment of the application, the DNA fingerprint of the sample to be identified is determined, the target fingerprint matched with the DNA fingerprint of the sample to be identified of unknown variety is determined in the constructed DNA fingerprint spectrum, and the variety corresponding to the target fingerprint is determined as the variety identification of the sample to be identified. By the method, the rapid identification of the variety can be realized, and the efficiency of variety identification is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
Corresponding to the method for constructing a DNA fingerprint described in the above embodiments, fig. 4 shows a block diagram of a DNA fingerprint constructing apparatus provided in the embodiments of the present application, and for convenience of explanation, only the parts related to the embodiments of the present application are shown.
Referring to fig. 4, the apparatus includes:
an obtaining unit 41, configured to obtain the M single-base polymorphic sites of the target species and base information at each single-base polymorphic site.
And a screening unit 42, configured to screen N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, where N is less than or equal to M.
And a constructing unit 43, configured to input the N core loci into a preset genetic algorithm model to construct a DNA fingerprint of the target species.
Optionally, the screening unit 42 includes:
an obtaining module for obtaining a reference genome of the target species.
And the comparison module is used for comparing the base information on each single-base polymorphic site with the base information on the corresponding gene site in the reference genome respectively and determining the variation label corresponding to each single-base polymorphic site according to the comparison result.
And the screening module is used for selecting N core sites from the M single-base polymorphic sites according to preset screening conditions and the variation label corresponding to each single-base polymorphic site.
Optionally, the alignment module is further configured to:
and for each single-base polymorphic site, if the base information on the single-base polymorphic site is the same as the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a first label.
And if the base information on the single-base polymorphic site is different from the base information on the corresponding gene site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a second label.
And if the single-base polymorphic site lacks base information, determining that a variation tag corresponding to the single-base polymorphic site is a third tag.
Optionally, the N core loci include at least one single-base polymorphic site corresponding to the first tag, at least one single-base polymorphic site corresponding to the second tag, and at least one single-base polymorphic site corresponding to the third tag.
Optionally, the screening conditions include:
in the N core sites, the ratio of the number of the single-base polymorphic sites corresponding to the third tag to the N is less than a first preset value, and
Figure BDA0002387439770000131
and
Figure BDA0002387439770000132
respectively smaller than a second preset value;
wherein a is the number of single-base polymorphic sites corresponding to the first tag in the N core sites, and b is the number of single-base polymorphic sites corresponding to the second tag in the N core sites.
Optionally, the building unit 43 includes:
and the parameter acquisition module is used for acquiring preset iteration times and the number of the position points, wherein the number of the position points is less than or equal to the N.
And the construction module is used for taking the iteration times and the number of the sites as parameters of the genetic algorithm and constructing the DNA fingerprint of the target species according to the genetic algorithm and the N core sites on the basis of the parameters.
The DNA fingerprint of the target species comprises DNA fingerprints of a plurality of varieties of the target species, the DNA fingerprint of each variety comprises all locus combinations corresponding to the variety, each locus combination comprises L core loci and variation labels corresponding to the core loci, and L is the number of the loci.
Optionally, the apparatus 4 further comprises:
and the information acquisition unit is also used for acquiring a sample to be identified belonging to the target species after the N core loci are input into a preset genetic algorithm model to construct a DNA fingerprint of the target species, and acquiring base information on the N core loci of the sample to be identified, wherein the base information is target information.
And the comparison unit is used for respectively comparing each target information with the base information on the corresponding gene locus in the reference genome and determining the DNA fingerprint of the sample to be identified according to the comparison result.
And the identification unit is used for determining the variety corresponding to the target fingerprint in the DNA fingerprint map of the target species as the variety of the sample to be identified, wherein the target fingerprint is a DNA fingerprint matched with the DNA fingerprint of the sample to be identified.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
In addition, the DNA fingerprint constructing apparatus shown in fig. 4 may be a software unit, a hardware unit, or a combination of software and hardware unit built in the existing terminal device, or may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: at least one processor 50 (only one is shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and operable on the at least one processor 50, wherein the processor 50 executes the computer program 52 to implement the steps in any of the above-mentioned embodiments of the DNA fingerprint construction method.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 5 is only an example of the terminal device 5, and does not constitute a limitation to the terminal device 5, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.
The Processor 50 may be a Central Processing Unit (CPU), and the Processor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may in some embodiments be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 51 may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a mobile terminal, enables the mobile terminal to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or apparatus capable of carrying computer program code to a terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (7)

1. A method for constructing a DNA fingerprint is characterized by comprising the following steps:
acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site;
screening N core sites from the M single-base polymorphic sites according to the base information on each single-base polymorphic site, wherein N is less than or equal to M, and the method comprises the following steps: obtaining a reference genome of the target species; respectively comparing the base information on each single-base polymorphic site with the base information on the corresponding gene site in the reference genome, and respectively determining the variation label corresponding to each single-base polymorphic site according to the comparison result, wherein the method comprises the following steps: for each single-base polymorphic site, if the base information on the single-base polymorphic site is the same as the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a first label; if the base information on the single-base polymorphic site is different from the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a second label; if the single-base polymorphic site lacks base information, determining that a variation tag corresponding to the single-base polymorphic site is a third tag; selecting N core sites from M single-base polymorphic sites according to a preset screening condition and a variation label corresponding to each single-base polymorphic site; the N core loci comprise at least one single-base polymorphic site corresponding to the first tag, at least one single-base polymorphic site corresponding to the second tag and at least one single-base polymorphic site corresponding to the third tag;
inputting the N core loci into a preset genetic algorithm model to construct the DNA fingerprint of the target species.
2. The method for constructing a DNA fingerprint according to claim 1 wherein said N core loci comprise at least one single base polymorphic site corresponding to said first tag, at least one single base polymorphic site corresponding to said second tag, and at least one single base polymorphic site corresponding to a third tag;
the screening conditions include:
in the N core sites, the ratio of the number of the single-base polymorphic sites corresponding to the third tag to the N is less than a first preset value, and
Figure FDA0003272706490000021
and
Figure FDA0003272706490000022
respectively smaller than a second preset value;
wherein a is the number of single-base polymorphic sites corresponding to the first tag in the N core sites, and b is the number of single-base polymorphic sites corresponding to the second tag in the N core sites.
3. The method for constructing a DNA fingerprint according to claim 1, wherein the inputting the N core loci into a predetermined genetic algorithm model to construct the DNA fingerprint of the target species comprises:
acquiring preset iteration times and the number of sites, wherein the number of the sites is less than or equal to N;
taking the iteration number and the site number as parameters of the genetic algorithm, and constructing a DNA fingerprint map of the target species according to the genetic algorithm and the N core sites based on the parameters;
the DNA fingerprint of the target species comprises DNA fingerprints of a plurality of varieties of the target species, the DNA fingerprint of each variety comprises all locus combinations corresponding to the variety, each locus combination comprises L core loci and variation labels corresponding to the core loci, and L is the number of the loci.
4. The method of claim 3, wherein after inputting the N core loci into a predetermined genetic algorithm model to construct the DNA fingerprint of the target species, the method further comprises:
obtaining a sample to be identified belonging to the target species, and obtaining base information on the N core loci of the sample to be identified as target information;
comparing each target information with base information on a corresponding gene locus in the reference genome respectively, and determining the DNA fingerprint of the sample to be identified according to the comparison result;
and determining the variety corresponding to the target fingerprint in the DNA fingerprint of the target species as the variety of the sample to be identified, wherein the target fingerprint is a DNA fingerprint matched with the DNA fingerprint of the sample to be identified.
5. A DNA fingerprint map construction device is characterized by comprising:
an acquisition unit for acquiring M single-base polymorphic sites of a target species and base information on each single-base polymorphic site;
a screening unit, configured to screen N core sites from the M single-base polymorphic sites according to base information on each single-base polymorphic site, where N is less than or equal to M; the screening unit includes: an acquisition module for acquiring a reference genome of the target species; a comparison module for comparing the base information of each single base polymorphic site with the base information of the corresponding gene site in the reference genome, and determining the variation label corresponding to each single base polymorphic site according to the comparison result; the screening module is used for selecting N core sites from the M single-base polymorphic sites according to preset screening conditions and variation labels corresponding to the single-base polymorphic sites; the alignment module is further configured to: for each single-base polymorphic site, if the base information on the single-base polymorphic site is the same as the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a first label; if the base information on the single-base polymorphic site is different from the base information on the corresponding genetic site in the reference genome, determining that the variation label corresponding to the single-base polymorphic site is a second label; if the single-base polymorphic site lacks base information, determining that a variation tag corresponding to the single-base polymorphic site is a third tag; the N core loci comprise at least one single-base polymorphic site corresponding to the first tag, at least one single-base polymorphic site corresponding to the second tag and at least one single-base polymorphic site corresponding to the third tag;
and the construction unit is used for inputting the N core sites into a preset genetic algorithm model to construct the DNA fingerprint of the target species.
6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the computer program.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.
CN202010102817.5A 2020-02-19 2020-02-19 Construction method and construction device of DNA fingerprint spectrum and terminal equipment Active CN111883212B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010102817.5A CN111883212B (en) 2020-02-19 2020-02-19 Construction method and construction device of DNA fingerprint spectrum and terminal equipment
LU102543A LU102543B1 (en) 2020-02-19 2021-02-17 Method and apparatus for constructing DNA fingerprint, and terminal device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010102817.5A CN111883212B (en) 2020-02-19 2020-02-19 Construction method and construction device of DNA fingerprint spectrum and terminal equipment

Publications (2)

Publication Number Publication Date
CN111883212A CN111883212A (en) 2020-11-03
CN111883212B true CN111883212B (en) 2021-11-26

Family

ID=73153983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010102817.5A Active CN111883212B (en) 2020-02-19 2020-02-19 Construction method and construction device of DNA fingerprint spectrum and terminal equipment

Country Status (2)

Country Link
CN (1) CN111883212B (en)
LU (1) LU102543B1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108486266A (en) * 2018-02-06 2018-09-04 北京市农林科学院 The molecular labeling of DCIPThe chloroplast of maize genome and the application in cultivar identification
CN108913797A (en) * 2018-06-22 2018-11-30 中国农业科学院蔬菜花卉研究所 The method that GBS obtains Chinese cabbage group genome SNP building finger-print
CN110241252A (en) * 2019-07-30 2019-09-17 中国农业科学院郑州果树研究所 SNP marker for constructing peach DNA fingerprinting combines and application and method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020086289A1 (en) * 1999-06-15 2002-07-04 Don Straus Genomic profiling: a rapid method for testing a complex biological sample for the presence of many types of organisms
US8898021B2 (en) * 2001-02-02 2014-11-25 Mark W. Perlin Method and system for DNA mixture analysis
US20160004814A1 (en) * 2012-09-05 2016-01-07 University Of Washington Through Its Center For Commercialization Methods and compositions related to regulation of nucleic acids
CN110527736B (en) * 2019-08-19 2022-09-27 中国农业科学院作物科学研究所 SNP marker combination for rice germplasm resource and variety identification and application thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108486266A (en) * 2018-02-06 2018-09-04 北京市农林科学院 The molecular labeling of DCIPThe chloroplast of maize genome and the application in cultivar identification
CN108913797A (en) * 2018-06-22 2018-11-30 中国农业科学院蔬菜花卉研究所 The method that GBS obtains Chinese cabbage group genome SNP building finger-print
CN110241252A (en) * 2019-07-30 2019-09-17 中国农业科学院郑州果树研究所 SNP marker for constructing peach DNA fingerprinting combines and application and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"利用InDel标记构建番茄新品种指纹图谱";胡陶铸 等;《上海交通大学学报(农业科学版)》;20190430;第37卷(第2期);第1-3节 *
"单核苷酸多态性在作物遗传育种中的研究";邹枚伶 等;《安徽农学通报》;20121231;全文 *

Also Published As

Publication number Publication date
CN111883212A (en) 2020-11-03
LU102543B1 (en) 2021-11-22
LU102543A1 (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN109994155B (en) Gene variation identification method, device and storage medium
US20190139624A1 (en) Identifying ancestral relationships using a continuous stream of input
CN115312121B (en) Target gene locus detection method, device, equipment and computer storage medium
WO2019001168A1 (en) Sequencing data result analysis method and apparatus, and sequencing library construction and sequencing method
CN110021355B (en) Haploid typing and variation detection method and device for diploid genome sequencing segment
CN114999573A (en) Genome variation detection method and detection system
CN115198023B (en) Hainan cattle liquid-phase breeding chip and application thereof
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
Liu Bioinformatics in aquaculture: principles and methods
Normand et al. An introduction to high-throughput sequencing experiments: design and bioinformatics analysis
CN107967411B (en) Method and device for detecting off-target site and terminal equipment
CN110782946A (en) Method and device for identifying repeated sequence, storage medium and electronic equipment
CN108182348B (en) DNA methylation data detection method and device based on seed sequence information
KR102572274B1 (en) An apparatus for analyzing nucleic sequencing data and a method for operating it
CN111883212B (en) Construction method and construction device of DNA fingerprint spectrum and terminal equipment
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
KR20210040714A (en) Method and appartus for detecting false positive variants in nucleic acid sequencing analysis
CN110942806A (en) Blood type genotyping method and device and storage medium
WO2014119914A1 (en) Method for providing information about gene sequence-based personal marker and apparatus using same
CN111627492A (en) Cancer genome Hi-C data simulation method and device and electronic equipment
CN111161798A (en) Reassembling method and reassembling device for metagenome and terminal equipment
CN110570908A (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
CN109741788A (en) A kind of SNP site analysis method and system
CN112562786B (en) Method, device and storage medium for assembling genome based on genetic population
CN113284552B (en) Screening method and device for micro haplotypes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant