CN114300045A - Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment - Google Patents

Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment Download PDF

Info

Publication number
CN114300045A
CN114300045A CN202210002889.1A CN202210002889A CN114300045A CN 114300045 A CN114300045 A CN 114300045A CN 202210002889 A CN202210002889 A CN 202210002889A CN 114300045 A CN114300045 A CN 114300045A
Authority
CN
China
Prior art keywords
sample
control group
sample data
snp
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210002889.1A
Other languages
Chinese (zh)
Inventor
杨智
李冬
余海
贺贤汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Bori Technology Co ltd
Original Assignee
Hangzhou Bori Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Bori Technology Co ltd filed Critical Hangzhou Bori Technology Co ltd
Priority to CN202210002889.1A priority Critical patent/CN114300045A/en
Publication of CN114300045A publication Critical patent/CN114300045A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a control group-based semi-supervised SNP typing method, a control group-based semi-supervised SNP typing device and electronic equipment, and relates to the technical field of genotyping, wherein the method comprises the following steps: determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected; according to the category center, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result; and classifying the clustering result according to the genotype characteristics of the SNP to determine the SNP typing of the sample to be detected. The method relieves the technical problem of poor typing effect caused by irregular distribution of the genotype cluster, and achieves the technical effects of improving the typing accuracy and being easy to realize.

Description

Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment
Technical Field
The invention relates to the technical field of genotyping, in particular to a control group-based semi-supervised SNP (single nucleotide polymorphism) typing method, a control group-based semi-supervised SNP typing device and electronic equipment.
Background
Single Nucleotide Polymorphism (SNP) mainly refers to a DNA sequence polymorphism caused by variation of a single nucleotide at the genome level. It is the most common one of the human heritable variations, accounting for over 90% of all known polymorphisms. SNP widely exists in human genome, and various typing detection technologies such as a direct sequencing method, an amplification curve method, a High Resolution Melting curve analysis (HRM) and the like are formed at present. Among them, the amplification curve method is widely used by virtue of its advantages of simple operation, fast speed, large flux, easy interpretation of results, etc.
The amplification curve method generally collects fluorescence intensity data of each sample for each allele at the end of amplification, and then studies based on this (also referred to as "end point method"). The most used analysis method is cluster analysis, but experiments show that the influence of the irregularity degree of cluster distribution on the clustering result is large, if the cluster distribution is irregular, the clustering effect is not good usually, at the moment, a good clustering effect is obtained, more labeled data is needed, and the more irregular the cluster distribution is, the more labeled data is needed.
That is, the conventional SNP genotyping technology has the technical problem of poor genotyping effect due to irregular distribution of genotype clusters.
Disclosure of Invention
The invention aims to provide a control group-based semi-supervised SNP typing method, a control group-based semi-supervised SNP typing device and electronic equipment, so as to solve the technical problem of poor typing effect caused by irregular distribution of genotype clusters in the prior art.
In a first aspect, the embodiments of the present invention provide a control group-based semi-supervised SNP typing method, including:
determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected;
according to the category center, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result;
and classifying the clustering result according to the genotype characteristics of the SNP to determine the genotype of the sample to be detected.
In some possible embodiments, before the step of determining the category center based on pre-collected sample data, the method further includes:
collecting fluorescence intensity data of each channel endpoint aiming at the SNP locus;
and preprocessing the end point fluorescence intensity data to generate sample data.
In some possible embodiments, the step of determining the category center based on pre-collected sample data includes:
determining the known category of the control group sample based on the pre-collected sample data; known classes of control group samples include: blank samples, homozygotes, and heterozygotes;
the first class center is determined based on the known class of the control group samples.
In some possible embodiments, the step of determining the category center based on pre-collected sample data further includes:
determining the number of unknown classes based on pre-collected sample data and the predetermined total number of classes;
and determining the second class center of the unknown class based on the minimum distance maximum principle.
In some possible embodiments, the step of performing adaptive cluster analysis on the sample data according to the category center to generate a cluster result includes:
and performing self-adaptive clustering analysis on the sample data according to the first category center and the second category center in combination with a predetermined set mode to generate an optimal clustering result.
In some possible embodiments, setting the mode includes allowing the change and not allowing the change;
when the set mode of the comparison group is allowed to be changed, the category to which each sample in the comparison group belongs is allowed to be changed when clustering is carried out; when the setting mode of the comparison group is not allowed to be changed, the category of each sample in the comparison group is not allowed to be changed when clustering is carried out.
In a second aspect, the embodiments of the present invention provide a control group-based semi-supervised SNP typing device, including:
the class center determining module is used for determining the class center of the control group sample based on the pre-collected sample data; the sample data comprises a control group sample and a sample to be detected;
the cluster analysis module is used for carrying out self-adaptive cluster analysis on the sample data according to the category center to generate a cluster result;
and the classification module is used for classifying the clustering result according to the genotype characteristics of the SNP to determine the genotype of the sample to be detected.
In some possible embodiments, the method further comprises: the acquisition module is used for acquiring fluorescence intensity data of each channel endpoint aiming at the SNP locus; and preprocessing the end point fluorescence intensity data to generate sample data.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps of the method according to any one of the first aspect when executing the computer program.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium storing machine executable instructions that, when invoked and executed by a processor, cause the processor to perform the method of any of the first aspects.
The invention provides a semi-supervised SNP typing method, a semi-supervised SNP typing device and electronic equipment based on a control group, wherein the method comprises the following steps: determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected; according to the category center, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result; and classifying the clustering result according to the genotype characteristics of the SNP to determine the SNP typing of the sample to be detected. The method relieves the technical problem of poor typing effect caused by irregular distribution of the genotype cluster, and achieves the technical effects of improving the typing accuracy and being easy to realize.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic flow chart of a control-based semi-supervised SNP typing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the typing result of a control group-based semi-supervised SNP typing method according to an embodiment of the present invention;
FIG. 3 is a semi-supervised SN based on a control group according to an embodiment of the present inventionPThe structural schematic diagram of the parting device;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
SNP, i.e., single nucleotide polymorphism, has been formed by various typing detection techniques such as direct sequencing, amplification curve, HRM (high resolution melting curve analysis), and the like. Among them, the amplification curve method is preferred because of its advantages such as simple operation, high speed, large throughput, and easy interpretation of the result, although it is directed to only 1 to 2 known sites. The amplification curve method generally adopts an end-point method, i.e., fluorescence intensity data of each sample for each allele amplification is collected and then studied based on the data. The most used analysis method is cluster analysis, but experiments show that the influence of the irregularity degree of cluster distribution on the clustering result is large, if the cluster distribution is irregular, the clustering effect is not good usually, at the moment, a good clustering effect is obtained, more labeled data is needed, and the more irregular the cluster distribution is, the more labeled data is needed.
The labeled data is a control group, which usually comprises a blank control, a negative control or a positive control, and in the snp typing experiment, the labeled data is a blank control sample, a homozygote control sample or a heterozygote control sample. In a particular experiment, one or more control samples may be included, and the number of samples in a control sample class is not determined. Because some samples are already given, if the experiment is expected, the samples will present the specified genotype, and how to make full use of the known control group data to classify the genotype of the final unknown sample is a conventional semi-supervised learning problem, but if the experiment is not expected, some samples may not present the specified genotype, i.e. the final classification result is related to the experimental situation, which is a more specific semi-supervised learning problem.
Therefore, the scheme provides a semi-supervised SNP typing method based on a control group, which mainly adopts the idea that the information of the control group is fully utilized, the initial class center with the control group is calculated, other class centers are determined based on the minimum distance maximum principle, then clustering analysis is carried out based on a set mode of the control group related to an experiment, and the unknown sample genotype is determined, so that the problems of poor typing effect due to irregular distribution of genotype clusters in SNP typing are solved, and the method is easy to understand and realize.
To facilitate understanding of the present embodiment, first, a control-based semi-supervised SNP typing method disclosed in the embodiments of the present invention is described in detail, referring to a schematic flow chart of a control-based semi-supervised SNP typing method shown in fig. 1, where the method may be executed by an electronic device and mainly includes the following steps S110 to S130:
s110: determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected;
wherein the sample data is pre-collected real-time fluorescence quantitative PCR amplification end point fluorescence intensity data of each channel. The category centers are used for representing all category centers in the sample data set, and comprise a first category center corresponding to a known category determined according to the control group and a second category center corresponding to an unknown category determined according to the minimum distance maximum principle.
S120: according to the category center, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result;
s130: and classifying the clustering result according to the genotype characteristics of the SNP to determine the genotype of the sample to be detected.
The invention provides a semi-supervised SNP typing method based on a control group, which comprises the following steps: determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected; according to the category center, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result; and classifying the clustering result according to the genotype characteristics of the SNP to determine the SNP typing of the sample to be detected. The method relieves the technical problem of poor typing effect caused by irregular distribution of the genotype cluster, and achieves the technical effects of improving the typing accuracy and being easy to realize.
In an embodiment, before the step S110, the method further includes:
s21: collecting fluorescence intensity data of each channel endpoint aiming at the SNP locus;
s22: and preprocessing the end point fluorescence intensity data to generate sample data.
In the embodiment of the invention, the real-time fluorescence quantitative PCR amplification end point fluorescence intensity data of each channel is collected, and the influence of a base line, crosstalk among channels, hole errors and the like is removed before the end point fluorescence intensity is determined. The preprocessing of the endpoint fluorescence intensity data also comprises normalization, and the overall normalization is adopted, namely, the maximum endpoint fluorescence intensity and the minimum endpoint fluorescence intensity in all channels are firstly determined, and then the maximum and minimum normalization method is adopted.
In one embodiment, the step S110 includes:
s31: determining the known category of the control group sample based on the pre-collected sample data; known classes of control group samples include: blank samples, homozygotes, and heterozygotes;
s32: the first class center is determined based on the known class of the control group samples.
In the embodiment of the present invention, the initial center of each category in the control group (i.e., the first category center) is first calculated. The double channels are the basis of SNP typing, and for the situation of multiple channels, the double channel analysis can be respectively carried out on any two channels. In the case of the heterozygote control group, assume that there are n known heterozygote control samples, wherein the fluorescence intensities of 1 and 2 channel end points corresponding to sample i are respectively (X)i1,Xi2) Then the category initial center (X)1,X2) Comprises the following steps:
Figure BDA0003455499690000071
the same applies to determining the first category centers for other known categories.
In an embodiment, the step S110 further includes:
determining the number of unknown classes based on pre-collected sample data and the predetermined total number of classes;
and determining the center of the second class of the unknown class based on the minimum distance maximum principle.
In the embodiment of the invention, the number of the known control group categories is M, the total category number M is more than or equal to M, the possible genotypes comprise blank samples, homozygote 1, homozygote 2 and heterozygote, namely 4 types, and in addition, 1-2 types of unknown types can be additionally added according to the number of the samples. In order to find out the optimal classification result, the clustering analysis is performed by sequentially clustering the classification numbers C (m), … and 6(C is a positive integer), and then the selection is performed according to a certain evaluation index.
Assuming that there are 3 known initial class centers, if C is 3, all the initial class centers are determined, if C >3, taking C as an example, 4, then one class center needs to be determined in addition to the known 3 initial class centers, and the selection is performed on the principle of the minimum distance maximum, that is, the shortest distance between each 1 sample in all samples and the 3 initial class centers is calculated, then the shortest distances of all samples are compared, if a certain sample has the largest shortest distance, the sample position is the 4 th initial cluster center, and so on, the other initial cluster centers are determined.
In one embodiment, the step S120 includes: and performing self-adaptive clustering analysis on the sample data according to the first category center and the second category center in combination with a predetermined set mode to generate an optimal clustering result.
Wherein the predetermined setting mode includes permission of change and non-permission of change;
when the set mode of the comparison group is allowed to be changed, the category to which each sample in the comparison group belongs is allowed to be changed when clustering is carried out; when the setting mode of the comparison group is not allowed to be changed, the category of each sample in the comparison group is not allowed to be changed when clustering is carried out.
In the embodiment of the invention, the comparison group setting modes are divided into two types according to whether the category of each sample of the comparison group is allowed to be changed during clustering: one is "allow" and the other is "disallow". The setting mode of the control group can be determined according to the test condition, and if the artificial criterion of the fluorescence intensity of each channel endpoint of each known control group in the test is in line with the expectation, the setting mode can be determined as 'not allowed'; if the artificial criterion of the fluorescence intensity of each channel endpoint of each known control group in the experiment has unexpected conditions and is a problem in the aspects of test flow, method, material consumption and the like, the experiment should be redone under the safe condition, but if other factors such as unknown mutation exist, the set mode can be carefully determined as 'allowed', namely the genotype of each sample of the control group can be changed in the clustering process.
In the embodiment of the present invention, in the optimization stage, when the comparison setting mode is determined as allowable, a clustering analysis tool k-means may be sampled to perform clustering, that is, the distance between each sample and each clustering center is calculated, each sample is assigned to the clustering center closest to the sample, and then iteration is continued until a final termination condition is met, such as that the iteration number reaches a specified number, the clustering center is not changed any more, or the sum of squared distances is minimum, and the like.
When the comparison group setting mode is determined as not allowable, after a new clustering center is calculated, the distance between the comparison group sample and each center does not need to be recalculated, namely, each comparison group sample is still classified into the category to which the comparison group sample belongs, and for other samples, the comparison group sample is reclassified according to the shortest distance between the comparison group sample and each clustering center.
And (3) taking the contour coefficient as an evaluation index, sequentially carrying out cluster analysis because the selectable classification number C is m, … and 6(C is a positive integer) during cluster analysis, and selecting the cluster result with the highest contour coefficient.
According to the biological characteristics of each genotype, when the channel is homozygote, the fluorescence intensity of the endpoint of one channel is obviously increased relative to other channels, namely the channel is close to the coordinate axis corresponding to the channel, and when the channel is heterozygote, the fluorescence intensity of the endpoint of one channel and the fluorescence intensity of the endpoint of the other channel are relatively close to each other, namely the channel is close to the interface of two orthogonal coordinate axes. Therefore, the cluster centers are comprehensively considered, when a certain cluster center is close to a certain coordinate axis, the cluster center is a homozygote corresponding to the channel of the coordinate axis, and otherwise, the cluster center is a heterozygote.
As a specific example, PCR amplification experiments were performed on multiple template reagents using the Bori fluorescent quantitative PCR detection system, and SNP typing was performed using the above method. Here, taking an example that one SNP site has two allele data, the contour coefficient threshold is set to 0.90. The experiment comprises 17 samples, wherein the known control groups comprise 8 samples, the blank control group, the homozygote 1 control group, the homozygote 2 control group and the heterozygote control group comprise 2 samples respectively, the rest 9 samples are unknown samples, and the genotype of the control groups is expected according to the judgment of experts in the experiment result. The comparison group setting mode is set to "not allowed".
The analysis was performed according to the above procedure, and C is 4 and 5 are sequentially analyzed because the total number of samples is small, and it is found that C is 4 and has a high contour coefficient of 0.99, and the secondary clustering result is obtained, as shown in fig. 2. For comparison, several control group genotypes can be artificially changed, if the control group mode is set to be 'not allowed', most samples are determined as unknown samples because the individual contour coefficient is lower, and the overall contour coefficient is less than 0.5, which is obviously not satisfactory, otherwise, if the control group mode is set to be 'allowed', although the initially set clustering center is wrong, the classification effect of fig. 2 can be finally achieved through iterative optimization, namely, the control group mode is set to be 'allowed' to have better robustness purely from the clustering effect without considering other factors.
The invention provides a semi-supervised SNP typing method based on a control group, which comprises the following steps: determining the class center of the control group sample based on the pre-collected sample data; the sample data comprises a control group sample and a sample to be detected; according to the class center of the samples in the control group, carrying out self-adaptive clustering analysis on the sample data to generate a clustering result; and classifying the clustering result according to the genotype characteristics of the SNP to determine the SNP typing of the sample to be detected. The method relieves the technical problem of poor typing effect caused by irregular distribution of the genotype cluster, and achieves the technical effects of improving the typing accuracy and being easy to realize.
The embodiment of the invention provides a semi-supervised SNP typing device based on a control group, and referring to fig. 3, the device comprises:
a category center determining module 310, configured to determine a category center based on sample data acquired in advance; the sample data comprises a control group sample and a sample to be detected;
the cluster analysis module 320 is used for performing adaptive cluster analysis on the sample data according to the category center to generate a cluster result;
and the classification module 330 is configured to classify the clustering result according to the genotype characteristics of the SNP, and determine the genotype of the sample to be detected.
In one embodiment, the apparatus further comprises: the acquisition module is used for acquiring fluorescence intensity data of each channel endpoint aiming at the SNP locus; and preprocessing the end point fluorescence intensity data to generate sample data.
The control group-based semi-supervised SNP typing device provided by the embodiment of the application can be specific hardware on equipment or software or firmware installed on the equipment, and the like. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Semi-supervised SN based on control group provided by embodiment of applicationPThe typing device has the same technical characteristics as the semi-supervised SNP typing method based on the control group provided by the embodiment, so the same technical problems can be solved, and the same technical effects can be achieved.
The embodiment of the application further provides an electronic device, and specifically, the electronic device comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the above described embodiments.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device 400 includes: a processor 40, a memory 41, a bus 42 and a communication interface 43, wherein the processor 40, the communication interface 43 and the memory 41 are connected through the bus 42; the processor 40 is arranged to execute executable modules, such as computer programs, stored in the memory 41.
The Memory 41 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 43 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, etc. may be used.
The bus 42 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 4, but that does not indicate only one bus or one type of bus.
The memory 41 is used for storing a program, the processor 40 executes the program after receiving an execution instruction, and the method executed by the apparatus defined by the flow process disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 40, or implemented by the processor 40.
The processor 40 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 40. The Processor 40 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory 41, and the processor 40 reads the information in the memory 41 and completes the steps of the method in combination with the hardware thereof.
Corresponding to the method, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores machine executable instructions, and when the computer executable instructions are called and executed by a processor, the computer executable instructions cause the processor to execute the steps of the method.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that: like reference numbers and letters indicate like items in the figures, and thus once an item is defined in a figure, it need not be further defined or explained in subsequent figures, and moreover, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A semi-supervised SNP typing method based on a control group is characterized by comprising the following steps:
determining a category center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected;
according to the category center, performing self-adaptive clustering analysis on the sample data to generate a clustering result;
and classifying the clustering result according to the genotype characteristics of the SNP to determine the genotype of the sample to be detected.
2. The method of claim 1, wherein prior to the step of determining a class center based on pre-collected sample data, the method further comprises:
collecting fluorescence intensity data of each channel endpoint aiming at the SNP locus;
and preprocessing the end point fluorescence intensity data to generate sample data.
3. The method of claim 1, wherein the step of determining the class center based on pre-collected sample data comprises:
determining the known category of the control group sample based on the pre-collected sample data; known classes of the control group samples include: blank samples, homozygotes, and heterozygotes;
determining a first class center based on the known class of the control group samples.
4. The method of claim 3, wherein the step of determining the class center based on pre-collected sample data further comprises:
determining the number of unknown classes based on pre-collected sample data and the predetermined total number of classes;
and determining the second class center of the unknown class based on the minimum distance maximum principle.
5. The method of claim 4, wherein the step of performing adaptive cluster analysis on the sample data according to the class center to generate a cluster result comprises:
and performing self-adaptive clustering analysis on the sample data according to the first category center and the second category center in combination with a predetermined set mode to generate an optimal clustering result.
6. The method of claim 5, wherein the setting mode includes allowing and disallowing changes;
when the set mode of the comparison group is allowed to be changed, the category to which each sample in the comparison group belongs is allowed to be changed when clustering is carried out; and when the setting mode of the comparison group is that the change is not allowed, the class to which each sample in the comparison group belongs is not allowed to be changed when the clustering is carried out.
7. A semi-supervised SNP typing device based on a control group, comprising:
the class center determining module is used for determining a class center based on pre-collected sample data; the sample data comprises a control group sample and a sample to be detected;
the cluster analysis module is used for carrying out self-adaptive cluster analysis on the sample data according to the category center to generate a cluster result;
and the classification module is used for classifying the clustering result according to the genotype characteristics of the SNP to determine the genotype of the sample to be detected.
8. The apparatus of claim 7, further comprising:
the acquisition module is used for acquiring fluorescence intensity data of each channel endpoint aiming at the SNP locus; and preprocessing the end point fluorescence intensity data to generate sample data.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program operable on the processor, and wherein the processor implements the steps of the method of any of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to execute the method of any of claims 1 to 6.
CN202210002889.1A 2022-01-04 2022-01-04 Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment Pending CN114300045A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210002889.1A CN114300045A (en) 2022-01-04 2022-01-04 Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210002889.1A CN114300045A (en) 2022-01-04 2022-01-04 Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment

Publications (1)

Publication Number Publication Date
CN114300045A true CN114300045A (en) 2022-04-08

Family

ID=80974975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210002889.1A Pending CN114300045A (en) 2022-01-04 2022-01-04 Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment

Country Status (1)

Country Link
CN (1) CN114300045A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116525000A (en) * 2023-07-04 2023-08-01 北京市农林科学院 Crop variety genotyping method and device compatible with multiple fluorescent signal platforms

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116525000A (en) * 2023-07-04 2023-08-01 北京市农林科学院 Crop variety genotyping method and device compatible with multiple fluorescent signal platforms
CN116525000B (en) * 2023-07-04 2023-09-26 北京市农林科学院 Crop variety genotyping method and device compatible with multiple fluorescent signal platforms

Similar Documents

Publication Publication Date Title
US8571807B2 (en) Computer algorithm for automatic allele determination from fluorometer genotyping device
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN113278712B (en) Gene chip, molecular probe combination, kit and application for analyzing sheep hair color
CN111088382B (en) Corn whole genome SNP chip and application thereof
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
CN110444253B (en) Method and system suitable for mixed pool gene positioning
CN114300045A (en) Semi-supervised SNP (single nucleotide polymorphism) typing method and device based on control group and electronic equipment
CN114708915A (en) Snap typing effectiveness evaluation method and device based on contour coefficient and electronic equipment
CN117253539B (en) Method and system for detecting sample pollution in high-throughput sequencing based on germ line mutation
CN113409890B (en) HLA typing method based on next generation sequencing data
CN107075565B (en) Individual single nucleotide polymorphism site typing method and device
JP7333838B2 (en) Systems, computer programs and methods for determining genetic patterns in embryos
CN117037905A (en) Ancestral information mark-based chicken variety identification method, ancestral information mark-based chicken variety identification system, ancestral information mark-based chicken variety identification equipment and ancestral information mark-based chicken variety identification medium
CN108694304B (en) Identity relationship identification method, device, equipment and storage medium
Roy et al. NGS-μsat: bioinformatics framework supporting high throughput microsatellite genotyping from next generation sequencing platforms
US20220172798A1 (en) Method for performing genotyping analysis
WO2019213810A1 (en) Method, apparatus, and system for detecting chromosome aneuploidy
KR100601937B1 (en) Method for robust genotyping using DNA chip having a discriminating probe and amplicon probe immobilized thereon and DNA chip used therein
CN111798922A (en) Method for identifying genome selection utilization interval of wheat breeding based on polymorphic site density in resequencing data
CN116543837B (en) Genotype comparison method and device based on fluorescent signal platform
CN116525000B (en) Crop variety genotyping method and device compatible with multiple fluorescent signal platforms
Talenti et al. The evolution and convergence of mutation spectra across mammals
CN109817340B (en) Disease risk distribution information determination method, device, storage medium and equipment
CN108427866B (en) Crop inbred line group identification method based on molecular marker technology
Satyawana et al. Leveraging the 3000 Rice Genome Data for Computational Design of Polymorphic Markers in a Local Rice Variety Lacking Sequence Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination