WO2020001663A2 - Gene sequencing result type detection method and apparatus, device, and storage medium - Google Patents

Gene sequencing result type detection method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2020001663A2
WO2020001663A2 PCT/CN2019/101096 CN2019101096W WO2020001663A2 WO 2020001663 A2 WO2020001663 A2 WO 2020001663A2 CN 2019101096 W CN2019101096 W CN 2019101096W WO 2020001663 A2 WO2020001663 A2 WO 2020001663A2
Authority
WO
WIPO (PCT)
Prior art keywords
type
sequencing result
gene
peak shape
feature data
Prior art date
Application number
PCT/CN2019/101096
Other languages
French (fr)
Chinese (zh)
Other versions
WO2020001663A3 (en
Inventor
赵文妍
段广有
金亮
闵文波
顾思健
葛毅
廖国娟
Original Assignee
苏州金唯智生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州金唯智生物科技有限公司 filed Critical 苏州金唯智生物科技有限公司
Publication of WO2020001663A2 publication Critical patent/WO2020001663A2/en
Publication of WO2020001663A3 publication Critical patent/WO2020001663A3/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present application relates to the technical field of gene detection, for example, to a method, a device, a device, and a storage medium for detecting a type of a result of gene sequencing.
  • Gene sequencing technology that is, the technology for determining the sequence of Deoxyribonucleic Acid (DNA).
  • DNA Deoxyribonucleic Acid
  • the sequence analysis of DNA is the basis for the research and transformation of the target gene.
  • sequencing results may be unavailable due to various factors. Therefore, it is necessary to determine whether the DNA sequencing results are available.
  • the quality of genetic sequencing results is mainly determined by humans, and there may be millions of sequencing results every day. A large number of technicians are needed to view and analyze the peaks of sequencing results. Therefore, a lot of waste will be wasted. Human, material and financial resources, and inefficient. In addition, because the determination is manual, the sequencing results will be misjudged due to inconsistent determination standards between people, resulting in inaccurate determination results.
  • the embodiments of the present application provide a method, a device, a device, and a storage medium for detecting a type of a gene sequencing result, so as to realize automatic determination of the type of a gene sequencing result, save manpower, material resources, and financial resources, and improve determination efficiency and accuracy.
  • An embodiment of the present application provides a method for detecting a type of a gene sequencing result, including:
  • the feature data is input into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
  • An embodiment of the present application further provides a detection device for a type of gene sequencing result, and the device includes:
  • the peak shape acquisition module is configured to obtain a peak shape map of the sequencing result of the gene to be tested
  • a feature extraction module configured to extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map
  • the type detection module is configured to input the feature data into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
  • An embodiment of the present application further provides a computer device, and the device includes:
  • One or more processors are One or more processors;
  • Memory set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors enable the method for detecting a type of gene sequencing result described in any one of the embodiments of the present application.
  • An embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for detecting a type of gene sequencing result described in any of the embodiments of the present application is implemented.
  • FIG. 1 is a schematic flowchart of a method for detecting a type of a gene sequencing result provided in Embodiment 1 of the present application;
  • FIG. 1 is a schematic flowchart of a method for detecting a type of a gene sequencing result provided in Embodiment 1 of the present application;
  • FIG. 2 is a schematic flowchart of a method for establishing a type detection model provided in Embodiment 2 of the present application;
  • FIG. 3 is a schematic structural diagram of a detection device for a type of gene sequencing result provided in Embodiment 3 of the present application;
  • FIG. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for detecting a type of a gene sequencing result provided in Embodiment 1 of the present application.
  • the method can be applied to the determination of the quality of genetic sequencing results.
  • the method can be implemented by a detection device of the type of genetic sequencing results.
  • the device can be composed of hardware and / or software, and can be integrated in a computer and all the included data. Processing functions in the terminal.
  • the method includes the following steps:
  • the peak shape graph obtained in this embodiment may be a peak shape graph obtained by a one-generation sequencing result of a gene to be tested.
  • a file of the peak shape diagram corresponding to the sequencing result can be obtained by performing one generation sequencing of the gene to be tested, and then the peak shape diagram can be obtained from the file of the peak shape diagram.
  • the file of the peak graph is an abl format file.
  • the quality of the peak shape diagrams directly affects the use of gene sequencing results. For example, under normal circumstances, if there is no abnormal peak shape in the peak shape graph, it means that the gene sequencing result corresponding to the peak shape graph is available; if there is an abnormal peak shape in the peak shape graph, it means that the gene sequencing result corresponding to the peak shape graph If abnormal, you need to analyze the sequencing results or directly re-sequence.
  • the characteristic data is data that can reflect the characteristics of the peaks in the peak shape graph and / or sequence information data corresponding to the sequencing result of the gene to be tested.
  • the characteristic data includes, but is not limited to, a base sequence, a base quality value, and a peak. Wide values, lengths of base fragments with a length of more than 20 consecutive longest bases in the sequence, signal strengths of multiple bases in the sequence, and average signal strength.
  • a specific script file is run to extract the characteristic data in the peak shape graph.
  • the script file includes different feature extraction algorithms corresponding to different feature data.
  • the feature extraction algorithm contained in the script can be used to extract corresponding feature data.
  • the purpose of extracting the feature data is to provide a basis for determining the type of the gene sequencing result to be tested by using a type detection model in the subsequent steps, and to automatically determine the type of the gene sequencing result.
  • the feature data is used as the type of gene sequencing result. Compared with the manual observation of the peak shape for determination, the accuracy of the determination is also improved.
  • extracting feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map includes: extracting feature data from a file of the peak shape map; and / or, processing the peak shape map according to a preset feature extraction algorithm, Get the corresponding feature data.
  • feature data can be directly extracted from the peak shape file obtained by one-generation sequencing, such as directly extracting sequence information, base quality values, and peak width values; or obtained through one or more preset feature extraction algorithms
  • the one or more characteristic data corresponding to the peak shape graph such as a length value of a base fragment with a continuous longest base mass greater than 20 in the sequence, a signal intensity of multiple bases in the sequence, and an average signal intensity.
  • from the abl format file directly extract the sequence information, base quality values, and peak width values contained in the parameter data in the file; for another example, for the peak shape diagram of the sequencing result of the gene to be tested, run the include There is a script file of the base signal intensity extraction algorithm to obtain the signal intensity of each base in the peak shape graph.
  • the feature data is input into a type detection model, and a type matching the sequencing result of the gene to be tested is obtained.
  • the type detection model is used to identify the input feature data to identify a type that matches the sequencing result of the gene to be tested.
  • the types that match the sequencing result of the gene to be tested may include: normal types and abnormal types.
  • the normal type may be a type in which there is no abnormal peak shape in the peak shape diagram of the sequencing result of the gene to be tested
  • the abnormal type may be a type in which there is an abnormal peak shape in the peak shape diagram of the gene sequencing result.
  • the type of abnormality may include: a poly structure or a repeating sequence type, a spectral pull-up type, a diffusion type, a bubble type, an attenuation type, a peak set type, a no signal type, a primer problem type, and an interruption type.
  • One or more types For example, after the type detection model recognizes the input feature data, it outputs either normal type or abnormal type. For example, after the type detection model recognizes the input feature data, it outputs normal type, poly structure, or repeated sequence. Any of the following types: type, spectral elevation type, dispersion type, bubble type, attenuation type, peak set type, no signal type, primer problem type, or interruption type.
  • the output type of the type detection model is identified based on the characteristic data.
  • the working principle of the type detection model may be that when the feature data is input, the type detection model recognizes the input feature data, determines which type of the input feature data is, and then outputs the type. For example, the average base quality extracted from the peak shape diagram of the first-generation sequencing results (take the peak shape diagram of the diffuse type as an example), the length of the base fragment with a continuous longest base quality greater than 20, and the average signal intensity are input to the type
  • the type detection model identified and analyzed these three characteristic data, and obtained that the type of the gene sequencing result was a diffuse type, so that it can be seen that the gene sequencing result was abnormal, the sequencing failed, and the peak shape chart showed a peak shape. Wide and unrecognizable. The possible reason is that the DNA concentration is too high, which in turn provides a basis for adjusting the sequencing scheme when re-sequencing the gene.
  • the characteristic data corresponding to the sequencing result of the gene to be tested is extracted based on the obtained peak shape diagram of the sequencing result of the gene to be tested, and the characteristic data is input into the type detection model, and finally the sequencing with the gene to be tested is obtained.
  • the types that match the results, using the feature data extracted from the peak shape graph and the type detection model solve the waste of human, material and financial resources caused by the manual determination method in the related technology, and the efficiency is low, and the accuracy of the determination is Low problem, realize the automatic determination of the type of gene sequencing results, save manpower, material and financial resources, and improve the efficiency and accuracy of determination.
  • FIG. 2 is a schematic flowchart of a method for establishing a type detection model applicable to Embodiment 2 of the present application. This embodiment is implemented based on the foregoing embodiment. Before the feature data is input into the type detection model to obtain a type that matches the sequencing result of the gene to be tested, the method further includes the following steps:
  • the peak shape sample may be obtained from a result of one generation sequencing.
  • the peak shape sample may be selected from a database of historical sequencing results. Exemplarily, a plurality of different types of peak shapes are selected from the historical sequencing results of one generation sequencing, and then these peak shapes are classified and labeled with corresponding classification labels, so as to obtain a plurality of different types of standard gene sequencing results. Peak shape sample. In an embodiment, there are multiple peak shape sample samples for each type of standard gene sequencing result.
  • the method for classifying the obtained peak shape map may be a manual determination classification method. For example, if it is determined manually that there is no abnormal peak shape in the peak shape map, it is determined as a normal type, and the peak shape is determined.
  • the normal type label is marked on the figure as the peak shape sample of the standard gene sequencing result corresponding to the normal type.
  • the peak shape icon that contains abnormal peak shapes in other peak shapes is marked with the abnormal type label as the standard gene corresponding to the abnormal type. Peak plot sample of sequencing results.
  • the characteristic data sample is data capable of reflecting the peak characteristics in the peak shape sample and / or sequence information data corresponding to a standard gene sequencing result.
  • the characteristic data sample includes, but is not limited to, a base sequence and a base quality value. , The peak width value, the length of a base fragment with a length of more than 20 consecutive longest bases in the sequence, the signal intensity of multiple bases in the sequence, and the average signal intensity.
  • feature data may be directly extracted from the obtained peak shape sample file, such as directly extracting sequence information, base quality values, peak width values, etc .; or, the feature data may be obtained through one or more preset feature extraction algorithms.
  • One or more characteristic data corresponding to the peak shape sample such as a length value of a base fragment with a continuous longest base mass greater than 20 in the sequence, a signal intensity of multiple bases in the sequence, and an average signal intensity.
  • from the abl format file of the peak shape sample directly extract the sequence information, base quality value and peak width value contained in the parameter data in the file; for another example, for the peak shape sample, run the include There is a script file of the base signal intensity extraction algorithm to obtain the signal intensity of each base in the peak shape graph.
  • S230 Use the feature data samples to train the set classification algorithm model to obtain a type detection model.
  • the classification algorithm model in this embodiment may be a training model based on a multi-classification algorithm.
  • the multi-classification algorithm includes, but is not limited to, a gradient boosting tree classification (Gradient Boosting Classifier) algorithm, a decision tree classification (Decision tree classifier) algorithm, and an extreme random tree. Classification (Extra Tree Classifier) algorithm, Random Forest Classification (Random Forest Classifier) algorithm, etc.
  • the process of training the classification algorithm model may be a process of adjusting multiple model parameters. After continuous training, the optimal model parameters are obtained. The classification algorithm model with the optimal model parameters is the final result to be obtained. model.
  • the classification algorithm model is trained using the multiple characteristic data samples, and the model parameters in the classification algorithm model are continuously adjusted so that the classification algorithm model has input characteristic data.
  • the ability to make type determinations to obtain a type detection model is provided.
  • the classification algorithm model can be determined according to the type of data used in the feature data sample. For example, for a base fragment length and average signal intensity that includes an average base quality and a continuous longest base quality greater than 20
  • the characteristic data samples of the three types of data can be selected based on any of Gradient, Boosting, Classifier, Decision, TreeClassifier, Extra Tree, Classifier, and Random Forest Classifier algorithms as the set classification algorithm model.
  • using the feature data samples to train the set classification algorithm model to obtain a type detection model includes: using the feature data samples to train a plurality of different classification algorithm models; obtaining a plurality of different classification algorithms after a set number of trainings The recognition accuracy corresponding to the models respectively; the classification algorithm model with the highest recognition accuracy is determined as the type detection model.
  • multiple different classification algorithm models may be trained at the same time, and the model with the highest recognition accuracy rate among the multiple different classification algorithm models is selected as the type detection model.
  • a feature data sample including three types of data including an average base quality, a base fragment length of more than 20 consecutive base lengths, and an average signal intensity is used.
  • the classification algorithm model is trained, and after a set number of trainings, the recognition accuracy corresponding to multiple classification algorithm models can be obtained.
  • the classification algorithm models with recognition accuracy rates above 60% are all optional classification algorithm models.
  • the classification algorithm model with the highest recognition accuracy in Table 1 may also be selected, for example, a training model based on the Gradient Boosting Classifier algorithm, as the input feature data is the average base quality and the continuous longest base quality is greater than 20 The base fragment length and average signal intensity are the three data when the training model is selected.
  • the peak shape map samples of the standard gene sequencing results corresponding to multiple types are obtained, and the corresponding ones are extracted from the peak shape map samples.
  • the feature data is used as the feature data sample, and then the set classification algorithm model is trained using the feature data sample to obtain the type detection model, and the establishment of the type detection model is realized, so as to provide a model for automatic determination of the type of gene sequencing results. Basically, it improves the accuracy of judgment.
  • FIG. 3 is a schematic structural diagram of a detection device for a type of gene sequencing result provided in Embodiment 3 of the present application.
  • the type detection device for the type of genetic sequencing results includes a peak shape acquisition module 310, a feature extraction module 320, and a type detection module 330. Each module is described below.
  • a peak shape acquisition module 310 is configured to obtain a peak shape map of a sequencing result of a gene to be tested; a feature extraction module 320 is configured to extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map; a type detection module 330, It is configured to input the feature data into a type detection model to obtain a type matching the sequencing result of the gene to be tested.
  • the device for detecting the type of the genetic sequencing result may further include a sample acquisition module configured to input the feature data into a type detection model and obtain a type matching the sequencing result of the gene to be tested.
  • Acquire peak shape map samples of standard gene sequencing results corresponding to multiple types the data extraction module is configured to extract feature data corresponding to each standard gene sequencing result from the peak shape map samples of each standard gene sequencing result, as Feature data samples;
  • a model training module configured to use the feature data samples to train a set classification algorithm model to obtain the type detection model.
  • the model training module is configured to: use the feature data samples to train a plurality of different classification algorithm models; and obtain a recognition accuracy rate corresponding to each of the plurality of different classification algorithm models after a set number of trainings. Determining the classification algorithm model with the highest recognition accuracy as the type detection model.
  • the feature extraction module 320 is configured to: extract feature data from a file of the peak shape map; and / or process the peak shape map according to a preset feature extraction algorithm to obtain corresponding feature data .
  • the types include a normal type and an abnormal type.
  • the type of abnormality includes one of a poly structure or a repeating sequence type, a spectral pull-up type, a diffusion type, a bubble type, an attenuation type, a peak set type, a no signal type, a primer problem type, and an interruption type. Or multiple types.
  • the above product can execute the method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
  • FIG. 4 is a schematic structural diagram of a computer device provided in Embodiment 4 of the present application.
  • a computer device provided in this embodiment includes a processor 41 and a memory 42.
  • the processor in the computer device may be one or more.
  • a processor 41 is used as an example.
  • the processor 41 and the memory 42 in the computer device may be connected through a bus or other methods. Take bus connection as an example.
  • the processor 41 of the computer equipment in this embodiment integrates the detection device of the type of the result of gene sequencing provided in the above embodiment.
  • the memory 42 in the computer device serves as a computer-readable storage medium, and may be configured to store one or more programs.
  • the programs may be software programs, computer-executable programs, and modules, such as genes in the embodiments of the present application.
  • Program instructions / modules corresponding to the detection method of the sequencing result type include a peak shape acquisition module 310, a feature extraction module 320, and a type detection module 330.
  • the processor 41 executes various functional applications and data processing of the device by running software programs, instructions, and modules stored in the memory 42, that is, a method for detecting a type of genetic sequencing result in the foregoing method embodiment.
  • the memory 42 may include a program storage area and a data storage area.
  • the storage program area may store an operating system and application programs required for at least one function; the storage data area may store data created according to the use of the device, and the like.
  • the memory 42 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage device.
  • the memory 42 may include memory remotely set relative to the processor 41, and these remote memories may be connected to the device through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the program when the one or more programs included in the computer device are executed by the one or more processors 41, the program performs the following operations: obtaining a peak shape diagram of the sequencing result of the gene to be tested; and extracting the gene to be tested according to the peak shape diagram.
  • the characteristic data corresponding to the sequencing result; the characteristic data is input into the type detection model to obtain a type that matches the sequencing result of the gene to be tested.
  • Embodiment 5 of the present application further provides a computer-readable storage medium having stored thereon a computer program that, when executed by a detection device for a type of genetic sequencing result, implements detection of the type of gene sequencing result provided in Embodiment 1 of the present application.
  • the method includes: obtaining a peak shape diagram of the sequencing result of the gene to be tested; extracting characteristic data corresponding to the sequencing result of the gene to be tested according to the peak shape diagram; and inputting the characteristic data into the type detection model to obtain a result that is comparable to the sequencing result of the gene to be tested Match type.
  • the computer-readable storage medium provided in the embodiments of the present application is not limited to the implementation of the method operations described above when the computer program stored thereon is executed, and can also implement the type of gene sequencing results provided by any embodiment of the present application. Relevant operations in the detection method.
  • the present application may be implemented by software and general hardware, and may also be implemented by hardware.
  • the technical solution of the present application can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a computer's floppy disk, Read-Only Memory (ROM), Random access memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to enable a computer device (can be a personal computer, a server, or a network device, etc.) to execute any of this application The method described in the examples.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, each The names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.

Abstract

Disclosed are a gene sequencing result type detection method and apparatus, a device, and a storage medium. The method comprises: obtaining a peak map of a gene sequencing result to be detected; according to the peak map, extracting characteristic data corresponding to the gene sequencing result to be detected; inputting the characteristic data into a type detection model, and obtaining a type matched with the gene sequencing result to be detected.

Description

基因测序结果类型的检测方法、装置、设备及存储介质Detection method, device, equipment and storage medium of gene sequencing result type
本申请要求在2018年06月27日提交中国专利局、申请号为201810675765.3的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims priority from a Chinese patent application filed with the Chinese Patent Office on June 27, 2018 with application number 201810675765.3, the entire contents of which are incorporated herein by reference.
技术领域Technical field
本申请涉及基因检测技术领域,例如涉及一种基因测序结果类型的检测方法、装置、设备及存储介质。The present application relates to the technical field of gene detection, for example, to a method, a device, a device, and a storage medium for detecting a type of a result of gene sequencing.
背景技术Background technique
基因检测作为促进人类自身发展的一个重要课题,越来越受到世界上多个国家的重视,特别是基因测序技术逐渐步入一个快速发展的时期。As an important subject to promote human development, gene testing has been paid more and more attention by many countries in the world. In particular, gene sequencing technology has gradually entered a period of rapid development.
基因测序技术,即测定脱氧核糖核酸(Deoxyribonucleic Acid,DNA)序列的技术,在分子生物学研究中,DNA的序列分析是研究和改造目的基因的基础。在DNA测序过程中可能因为多种因素导致测序结果不可用,因此,需要对DNA测序结果是否可用进行判定。Gene sequencing technology, that is, the technology for determining the sequence of Deoxyribonucleic Acid (DNA). In the research of molecular biology, the sequence analysis of DNA is the basis for the research and transformation of the target gene. During the DNA sequencing process, sequencing results may be unavailable due to various factors. Therefore, it is necessary to determine whether the DNA sequencing results are available.
对于基因测序结果的好坏类型,主要依靠人工进行判定,而每天的测序结果可能有成千上百万个,需要大量的技术人员来对测序结果峰图进行查看和分析,因此,会浪费大量的人力、物力和财力,且效率低下。此外,由于是人工判定,因此,会因为人与人之间判定标准不一致,而对测序结果产生误判,导致判定结果不准确。The quality of genetic sequencing results is mainly determined by humans, and there may be millions of sequencing results every day. A large number of technicians are needed to view and analyze the peaks of sequencing results. Therefore, a lot of waste will be wasted. Human, material and financial resources, and inefficient. In addition, because the determination is manual, the sequencing results will be misjudged due to inconsistent determination standards between people, resulting in inaccurate determination results.
发明内容Summary of the invention
本申请实施例提供一种基因测序结果类型的检测方法、装置、设备及存储介质,以实现对基因测序结果的类型进行自动判定,节约人力、物力和财力,提高判定效率和准确率。The embodiments of the present application provide a method, a device, a device, and a storage medium for detecting a type of a gene sequencing result, so as to realize automatic determination of the type of a gene sequencing result, save manpower, material resources, and financial resources, and improve determination efficiency and accuracy.
本申请实施例提供了一种基因测序结果类型的检测方法,包括:An embodiment of the present application provides a method for detecting a type of a gene sequencing result, including:
获取待测基因测序结果的峰形图;Obtain a peak shape diagram of the sequencing result of the gene to be tested;
根据所述峰形图提取所述待测基因测序结果对应的特征数据;Extracting characteristic data corresponding to the sequencing result of the gene to be tested according to the peak shape map;
将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型。The feature data is input into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
本申请实施例还提供了一种基因测序结果类型的检测装置,该装置包括:An embodiment of the present application further provides a detection device for a type of gene sequencing result, and the device includes:
峰形获取模块,设置为获取待测基因测序结果的峰形图;The peak shape acquisition module is configured to obtain a peak shape map of the sequencing result of the gene to be tested;
特征提取模块,设置为根据所述峰形图提取所述待测基因测序结果对应的特征数据;A feature extraction module configured to extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map;
类型检测模块,设置为将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型。The type detection module is configured to input the feature data into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
本申请实施例还提供了一种计算机设备,该设备包括:An embodiment of the present application further provides a computer device, and the device includes:
一个或多个处理器;One or more processors;
存储器,设置为存储一个或多个程序;Memory, set to store one or more programs;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本申请实施例中任一所述的基因测序结果类型的检测方法。When the one or more programs are executed by the one or more processors, the one or more processors enable the method for detecting a type of gene sequencing result described in any one of the embodiments of the present application.
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请实施例中任一所述的基因测序结果类型的检测方法。An embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for detecting a type of gene sequencing result described in any of the embodiments of the present application is implemented.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本申请实施例一提供的一种基因测序结果类型的检测方法的流程示意图;FIG. 1 is a schematic flowchart of a method for detecting a type of a gene sequencing result provided in Embodiment 1 of the present application; FIG.
图2是本申请实施例二提供的一种类型检测模型建立方法的流程示意图;2 is a schematic flowchart of a method for establishing a type detection model provided in Embodiment 2 of the present application;
图3是本申请实施例三提供的一种基因测序结果类型的检测装置的结构示意图;3 is a schematic structural diagram of a detection device for a type of gene sequencing result provided in Embodiment 3 of the present application;
图4是本申请实施例四提供的一种计算机设备的结构示意图。FIG. 4 is a schematic structural diagram of a computer device according to a fourth embodiment of the present application.
具体实施方式detailed description
下面结合附图和实施例对本申请进行说明。此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The following describes the application with reference to the drawings and embodiments. The specific embodiments described herein are only used to explain the present application, rather than limiting the present application. For ease of description, only some, but not all, structures related to the present application are shown in the drawings.
实施例一Example one
图1为本申请实施例一提供的一种基因测序结果类型的检测方法的流程示意图。该方法可适用于对基因测序结果的好坏进行判定的情况,该方法可以由基因测序结果类型的检测装置来执行,该装置可由硬件和/或软件组成,并可集成在计算机以及所有包含数据处理功能的终端中。该方法包括如下步骤:FIG. 1 is a schematic flowchart of a method for detecting a type of a gene sequencing result provided in Embodiment 1 of the present application. The method can be applied to the determination of the quality of genetic sequencing results. The method can be implemented by a detection device of the type of genetic sequencing results. The device can be composed of hardware and / or software, and can be integrated in a computer and all the included data. Processing functions in the terminal. The method includes the following steps:
S110、获取待测基因测序结果的峰形图。S110. Obtain a peak shape diagram of a sequencing result of the gene to be tested.
本实施例中获取的峰形图,可采用针对待测基因的一代测序结果所得到的峰形图。示例性的,可通过对待测基因进行一代测序,得到测序结果对应的峰形图的文件,进而从峰形图的文件中获取峰形图。一实施例中,峰形图的文件为abl格式文件。The peak shape graph obtained in this embodiment may be a peak shape graph obtained by a one-generation sequencing result of a gene to be tested. Exemplarily, a file of the peak shape diagram corresponding to the sequencing result can be obtained by performing one generation sequencing of the gene to be tested, and then the peak shape diagram can be obtained from the file of the peak shape diagram. In one embodiment, the file of the peak graph is an abl format file.
一实施例中,由于不同的基因序列可对应于不同的峰形图,因此,峰形图的好坏直接会影响基因测序结果的使用情况。例如,通常情况下,峰形图中若不存在异常峰型,说明该峰形图对应的基因测序结果可用;峰形图中若存在异常峰型,则说明该峰形图对应的基因测序结果异常,需要对测序结果进行分析或直接重新进行测序。In an embodiment, since different gene sequences may correspond to different peak shape diagrams, the quality of the peak shape diagrams directly affects the use of gene sequencing results. For example, under normal circumstances, if there is no abnormal peak shape in the peak shape graph, it means that the gene sequencing result corresponding to the peak shape graph is available; if there is an abnormal peak shape in the peak shape graph, it means that the gene sequencing result corresponding to the peak shape graph If abnormal, you need to analyze the sequencing results or directly re-sequence.
S120、根据峰形图提取待测基因测序结果对应的特征数据。S120. Extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map.
本实施例中,特征数据为能够反映峰形图中波峰特征的数据和/或待测基因测序结果对应的序列信息数据,例如,特征数据包括但不限于碱基序列、碱基质量值,峰宽值、序列中连续最长碱基质量大于20的碱基片段长度值、序列中多个碱基的信号强度以及平均信号强度等。In this embodiment, the characteristic data is data that can reflect the characteristics of the peaks in the peak shape graph and / or sequence information data corresponding to the sequencing result of the gene to be tested. For example, the characteristic data includes, but is not limited to, a base sequence, a base quality value, and a peak. Wide values, lengths of base fragments with a length of more than 20 consecutive longest bases in the sequence, signal strengths of multiple bases in the sequence, and average signal strength.
示例性的,通过运行特定脚本文件,以提取峰形图中的特征数据。一实施例中,该脚本文件中包含有不同特征数据对应的不同的特征提取算法。在运行该特定脚本文件的过程中,可利用脚本中包含的特征提取算法,提取出对应的特征数据。For example, a specific script file is run to extract the characteristic data in the peak shape graph. In one embodiment, the script file includes different feature extraction algorithms corresponding to different feature data. In the process of running the specific script file, the feature extraction algorithm contained in the script can be used to extract corresponding feature data.
提取特征数据的目的在于,为后续步骤中利用类型检测模型对该待测基因测序结果的类型判定提供判定依据,实现对基因测序结果类型的自动判定,另外,利用特征数据作为对基因测序结果类型的判定依据,相较于人工观察峰形图来进行判定而言,也提高了判定的准确性。The purpose of extracting the feature data is to provide a basis for determining the type of the gene sequencing result to be tested by using a type detection model in the subsequent steps, and to automatically determine the type of the gene sequencing result. In addition, the feature data is used as the type of gene sequencing result. Compared with the manual observation of the peak shape for determination, the accuracy of the determination is also improved.
一实施例中,根据峰形图提取待测基因测序结果对应的特征数据,包括:从峰形图的文件中提取特征数据;和/或,将峰形图按照预设特征提取算法进行处理,得到对应的特征数据。In one embodiment, extracting feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map includes: extracting feature data from a file of the peak shape map; and / or, processing the peak shape map according to a preset feature extraction algorithm, Get the corresponding feature data.
示例性的,可从一代测序得到的峰形图文件中直接提取特征数据,例如直接提取序列信息、碱基质量值、峰宽值等;或者,通过一种或多种预设特征提取算法得到该峰形图对应的一个或多个特征数据,例如序列中连续最长碱基质量大于20的碱基片段长度值、序列中多个碱基的信号强度以及平均信号强度等。举一个实际例子,从abl格式文件中,直接提取文件里的参数数据中所包含的序列信息、碱基质量值和峰宽值;又如,针对待测基因测序结果的峰形图,运行包含有碱基信号强度提取算法的脚本文件,来获取该峰形图中每个碱基的信号 强度。Exemplarily, feature data can be directly extracted from the peak shape file obtained by one-generation sequencing, such as directly extracting sequence information, base quality values, and peak width values; or obtained through one or more preset feature extraction algorithms The one or more characteristic data corresponding to the peak shape graph, such as a length value of a base fragment with a continuous longest base mass greater than 20 in the sequence, a signal intensity of multiple bases in the sequence, and an average signal intensity. For a practical example, from the abl format file, directly extract the sequence information, base quality values, and peak width values contained in the parameter data in the file; for another example, for the peak shape diagram of the sequencing result of the gene to be tested, run the include There is a script file of the base signal intensity extraction algorithm to obtain the signal intensity of each base in the peak shape graph.
S130、将特征数据输入至类型检测模型中,得到与待测基因测序结果相匹配的类型。S130. The feature data is input into a type detection model, and a type matching the sequencing result of the gene to be tested is obtained.
本实施例中,类型检测模型用于对输入的特征数据进行识别,以识别出与待测基因测序结果相匹配的类型。一实施例中,与待测基因测序结果相匹配的类型可以包括:正常类型和异常类型。正常类型可以是待测基因测序结果的峰形图中不存在异常峰形的类型,异常类型可以是待测基因测序结果的峰形图中存在异常峰形的类型。一实施例中,异常类型可以包括:多聚(poly)结构或重复序列类型、光谱拉高类型、弥散类型、气泡类型、衰减类型、套峰类型、无信号类型、引物问题类型以及中断类型中的一种或多种类型。例如,类型检测模型对输入的特征数据进行识别后,输出正常类型或者异常类型中任意一种类型,又如,类型检测模型对输入的特征数据进行识别后,输出正常类型、poly结构或重复序列类型、光谱拉高类型、弥散类型、气泡类型、衰减类型、套峰类型、无信号类型、引物问题类型或者中断类型中的任意一种类型,类型检测模型的输出类型根据特征数据来识别确定。In this embodiment, the type detection model is used to identify the input feature data to identify a type that matches the sequencing result of the gene to be tested. In an embodiment, the types that match the sequencing result of the gene to be tested may include: normal types and abnormal types. The normal type may be a type in which there is no abnormal peak shape in the peak shape diagram of the sequencing result of the gene to be tested, and the abnormal type may be a type in which there is an abnormal peak shape in the peak shape diagram of the gene sequencing result. In an embodiment, the type of abnormality may include: a poly structure or a repeating sequence type, a spectral pull-up type, a diffusion type, a bubble type, an attenuation type, a peak set type, a no signal type, a primer problem type, and an interruption type. One or more types. For example, after the type detection model recognizes the input feature data, it outputs either normal type or abnormal type. For example, after the type detection model recognizes the input feature data, it outputs normal type, poly structure, or repeated sequence. Any of the following types: type, spectral elevation type, dispersion type, bubble type, attenuation type, peak set type, no signal type, primer problem type, or interruption type. The output type of the type detection model is identified based on the characteristic data.
一实施例中,类型检测模型的工作原理可以是,当输入特征数据时,类型检测模型对输入的特征数据进行识别,判断输入的特征数据为哪种类型,进而输出该类型。例如,将从一代测序结果的峰形图(以弥散类型的峰形图为例)中提取的平均碱基质量、连续最长碱基质量大于20的碱基片段长度以及平均信号强度输入至类型检测模型中,类型检测模型对这三个特征数据进行识别分析后,得出该基因测序结果的类型为弥散类型,从而可知该基因测序结果异常,测序失败,峰形图中表现为峰形过宽,无法识别,可能原因为DNA浓度过高,进而在重新进行基因测序时为测序方案提供调整依据。In an embodiment, the working principle of the type detection model may be that when the feature data is input, the type detection model recognizes the input feature data, determines which type of the input feature data is, and then outputs the type. For example, the average base quality extracted from the peak shape diagram of the first-generation sequencing results (take the peak shape diagram of the diffuse type as an example), the length of the base fragment with a continuous longest base quality greater than 20, and the average signal intensity are input to the type In the detection model, the type detection model identified and analyzed these three characteristic data, and obtained that the type of the gene sequencing result was a diffuse type, so that it can be seen that the gene sequencing result was abnormal, the sequencing failed, and the peak shape chart showed a peak shape. Wide and unrecognizable. The possible reason is that the DNA concentration is too high, which in turn provides a basis for adjusting the sequencing scheme when re-sequencing the gene.
本实施例的技术方案,通过根据获取的待测基因测序结果的峰形图,提取待测基因测序结果对应的特征数据,并将特征数据输入至类型检测模型中,最终得到与待测基因测序结果相匹配的类型,利用从峰形图中提取的特征数据以及类型检测模型,解决了相关技术中因采用人工判定方式,而导致的人力、物力、财力的浪费,且效率低下,判定准确率低的问题,实现了对基因测序结果类型的自动判定,节约了人力、物力和财力,提高了判定效率和准确率。In the technical solution of this embodiment, the characteristic data corresponding to the sequencing result of the gene to be tested is extracted based on the obtained peak shape diagram of the sequencing result of the gene to be tested, and the characteristic data is input into the type detection model, and finally the sequencing with the gene to be tested is obtained. The types that match the results, using the feature data extracted from the peak shape graph and the type detection model, solve the waste of human, material and financial resources caused by the manual determination method in the related technology, and the efficiency is low, and the accuracy of the determination is Low problem, realize the automatic determination of the type of gene sequencing results, save manpower, material and financial resources, and improve the efficiency and accuracy of determination.
实施例二Example two
图2为本申请实施例二所适用的一种类型检测模型建立方法的流程示意图。本实施例以上述实施例为基础来实现,在将特征数据输入至类型检测模型中, 得到与待测基因测序结果相匹配的类型之前,还包括如下步骤:FIG. 2 is a schematic flowchart of a method for establishing a type detection model applicable to Embodiment 2 of the present application. This embodiment is implemented based on the foregoing embodiment. Before the feature data is input into the type detection model to obtain a type that matches the sequencing result of the gene to be tested, the method further includes the following steps:
S210、分别获取多个类型对应的标准基因测序结果的峰形图样本。S210. Obtain peak shape map samples corresponding to multiple types of standard gene sequencing results.
一实施例中,峰形图样本可以从一代测序的结果中获取。一实施例中,峰形图样本可从历史测序结果数据库中选取。示例性的,从一代测序的历史测序结果中选取多个不同类型的峰形图,然后对这些峰形图进行分类,并标注对应的分类标签,从而获得多个不同类型对应的标准基因测序结果的峰形图样本。一实施例中,每一类型的标准基因测序结果的峰形图样本为多个。一实施例中,对获取的峰形图进行分类的方式可以是人工判定分类方式,例如,通过人工的方式判定峰形图中没有出现异常峰形,则判定为正常类型,并将该峰形图标注上正常类型标签,作为正常类型对应的标准基因测序结果的峰形图样本,将其他峰形图中包含有异常峰形的峰形图标注上异常类型标签,作为异常类型对应的标准基因测序结果的峰形图样本。In an embodiment, the peak shape sample may be obtained from a result of one generation sequencing. In an embodiment, the peak shape sample may be selected from a database of historical sequencing results. Exemplarily, a plurality of different types of peak shapes are selected from the historical sequencing results of one generation sequencing, and then these peak shapes are classified and labeled with corresponding classification labels, so as to obtain a plurality of different types of standard gene sequencing results. Peak shape sample. In an embodiment, there are multiple peak shape sample samples for each type of standard gene sequencing result. In an embodiment, the method for classifying the obtained peak shape map may be a manual determination classification method. For example, if it is determined manually that there is no abnormal peak shape in the peak shape map, it is determined as a normal type, and the peak shape is determined. The normal type label is marked on the figure as the peak shape sample of the standard gene sequencing result corresponding to the normal type. The peak shape icon that contains abnormal peak shapes in other peak shapes is marked with the abnormal type label as the standard gene corresponding to the abnormal type. Peak plot sample of sequencing results.
S220、从每个标准基因测序结果的峰形图样本中提取所述每个标准基因测序结果对应的特征数据,作为特征数据样本。S220. Extract feature data corresponding to the sequencing result of each standard gene from the peak shape sample of each standard gene sequencing result as a feature data sample.
一实施例中,特征数据样本为能够反映峰形图样本中波峰特征的数据和/或标准基因测序结果对应的序列信息数据,例如,特征数据样本包括但不限于碱基序列、碱基质量值,峰宽值、序列中连续最长碱基质量大于20的碱基片段长度值、序列中多个碱基的信号强度以及平均信号强度等。In an embodiment, the characteristic data sample is data capable of reflecting the peak characteristics in the peak shape sample and / or sequence information data corresponding to a standard gene sequencing result. For example, the characteristic data sample includes, but is not limited to, a base sequence and a base quality value. , The peak width value, the length of a base fragment with a length of more than 20 consecutive longest bases in the sequence, the signal intensity of multiple bases in the sequence, and the average signal intensity.
示例性的,可从获取的峰形图样本文件中直接提取特征数据,例如直接提取序列信息、碱基质量值、峰宽值等;或者,通过一种或多种预设特征提取算法得到该峰形图样本对应的一个或多个特征数据,例如序列中连续最长碱基质量大于20的碱基片段长度值、序列中多个碱基的信号强度以及平均信号强度等。举一个实际例子,从峰形图样本的abl格式文件中,直接提取文件里的参数数据中所包含的序列信息、碱基质量值和峰宽值;又如,针对峰形图样本,运行包含有碱基信号强度提取算法的脚本文件,来获取该峰形图中每个碱基的信号强度。Exemplarily, feature data may be directly extracted from the obtained peak shape sample file, such as directly extracting sequence information, base quality values, peak width values, etc .; or, the feature data may be obtained through one or more preset feature extraction algorithms. One or more characteristic data corresponding to the peak shape sample, such as a length value of a base fragment with a continuous longest base mass greater than 20 in the sequence, a signal intensity of multiple bases in the sequence, and an average signal intensity. For a practical example, from the abl format file of the peak shape sample, directly extract the sequence information, base quality value and peak width value contained in the parameter data in the file; for another example, for the peak shape sample, run the include There is a script file of the base signal intensity extraction algorithm to obtain the signal intensity of each base in the peak shape graph.
S230、使用特征数据样本对设定分类算法模型进行训练,得到类型检测模型。S230: Use the feature data samples to train the set classification algorithm model to obtain a type detection model.
本实施例中分类算法模型可以是基于多分类算法建立的训练模型,例如多分类算法包括但不限于梯度提升树分类(Gradient Boosting Classifier)算法、决策树分类(Decision Tree Classifier)算法、极端随机树分类(Extra Tree Classifier)算法、随机森林分类(Random Forest Classifier)算法等。一实施例中,对分类算法模型进行训练的过程可以是调整多个模型参数的过程,经过不断的训练, 获得最优的模型参数,具有最优模型参数的分类算法模型即为最终要获得的模型。示例性的,在获得多种类型的多个特征数据样本后,使用多个特征数据样本对分类算法模型进行训练,不断调整分类算法模型中的模型参数,使得分类算法模型具有对输入的特征数据进行类型判定的能力,从而得到类型检测模型。The classification algorithm model in this embodiment may be a training model based on a multi-classification algorithm. For example, the multi-classification algorithm includes, but is not limited to, a gradient boosting tree classification (Gradient Boosting Classifier) algorithm, a decision tree classification (Decision tree classifier) algorithm, and an extreme random tree. Classification (Extra Tree Classifier) algorithm, Random Forest Classification (Random Forest Classifier) algorithm, etc. In an embodiment, the process of training the classification algorithm model may be a process of adjusting multiple model parameters. After continuous training, the optimal model parameters are obtained. The classification algorithm model with the optimal model parameters is the final result to be obtained. model. Exemplarily, after obtaining multiple types of multiple characteristic data samples, the classification algorithm model is trained using the multiple characteristic data samples, and the model parameters in the classification algorithm model are continuously adjusted so that the classification algorithm model has input characteristic data. The ability to make type determinations to obtain a type detection model.
一实施例中,设定分类算法模型可根据特征数据样本中采用的数据种类来确定,例如,对于包含有平均碱基质量、连续最长碱基质量大于20的碱基片段长度和平均信号强度三种数据的特征数据样本,可选择基于Gradient Boosting Classifier、Decision Tree Classifier、Extra Tree Classifier、Random Forest Classifier中的任一种算法建立的训练模型,作为设定的分类算法模型。In one embodiment, the classification algorithm model can be determined according to the type of data used in the feature data sample. For example, for a base fragment length and average signal intensity that includes an average base quality and a continuous longest base quality greater than 20 The characteristic data samples of the three types of data can be selected based on any of Gradient, Boosting, Classifier, Decision, TreeClassifier, Extra Tree, Classifier, and Random Forest Classifier algorithms as the set classification algorithm model.
一实施例中,使用特征数据样本对设定分类算法模型进行训练,得到类型检测模型,包括:使用特征数据样本对多个不同分类算法模型进行训练;获取设定次数训练后多个不同分类算法模型分别对应的识别准确率;将识别准确率最高的分类算法模型确定为类型检测模型。In one embodiment, using the feature data samples to train the set classification algorithm model to obtain a type detection model includes: using the feature data samples to train a plurality of different classification algorithm models; obtaining a plurality of different classification algorithms after a set number of trainings The recognition accuracy corresponding to the models respectively; the classification algorithm model with the highest recognition accuracy is determined as the type detection model.
一实施例中,为了提高类型检测模型的识别准确性,可同时对多个不同分类算法模型进行训练,选择多个不同分类算法模型中识别准确率最高的模型作为类型检测模型。示例性的,使用包括平均碱基质量、连续最长碱基质量大于20的碱基片段长度和平均信号强度这三种数据的特征数据样本,对如下表1中所示的基于多种算法的分类算法模型进行训练,经设定次数训练后可得到多个分类算法模型对应的识别准确率。In one embodiment, in order to improve the recognition accuracy of the type detection model, multiple different classification algorithm models may be trained at the same time, and the model with the highest recognition accuracy rate among the multiple different classification algorithm models is selected as the type detection model. Exemplarily, a feature data sample including three types of data including an average base quality, a base fragment length of more than 20 consecutive base lengths, and an average signal intensity is used. The classification algorithm model is trained, and after a set number of trainings, the recognition accuracy corresponding to multiple classification algorithm models can be obtained.
表1.特定特征数据下所采用的算法及基于该算法的分类算法模型被训练后Table 1. The algorithm used under specific feature data and the classification algorithm model based on the algorithm after training
的识别准确率Recognition accuracy
算法algorithm 识别准确率Recognition accuracy
Gradient Boosting ClassifierGradient Boosting Classifier 86.3%86.3%
Bernoulli NBBernoulli NB 49.7%49.7%
Decision Tree ClassifierDecision TreeClassifier 80.5%80.5%
Extra Tree ClassifierExtra TreeClassifier 79.4%79.4%
Gaussian NBGaussian NB 72.7%72.7%
K Neighbors ClassifierK Neighbors Classifier 74.1%74.1%
Label PropagationLabel Propagation 17.2%17.2%
Label SpreadingLabel Spreading 17.2%17.2%
Linear Discriminant AnalysisLinear Discriminant Analysis 71.7%71.7%
Linear SVCLinear SVC 71.7%71.7%
MLPC lassifierMLPC lassifier 61.0%61.0%
Nearest CentroidNearest Centroid 34.2%34.2%
Quadratic Discriminant AnalysisQuadratic Discriminant Analysis 73.5%73.5%
Random Forest ClassifierRandom Forest Classifier 85.1%85.1%
SVCSVC 52.5%52.5%
AdaBoost ClassifierAdaBoost Classifier 63.5%63.5%
Gaussian Process ClassifierGaussian ProcessClassifier 41.6%41.6%
上表1中,识别准确率在60%以上的分类算法模型均为可选择的分类算法模型。一实施例中,也可选择表1中识别准确率最高的分类算法模型,例如基于Gradient Boosting Classifier算法的训练模型,作为针对输入的特征数据为平均碱基质量、连续最长碱基质量大于20的碱基片段长度和平均信号强度这三种数据时,所选择的训练模型。In Table 1 above, the classification algorithm models with recognition accuracy rates above 60% are all optional classification algorithm models. In an embodiment, the classification algorithm model with the highest recognition accuracy in Table 1 may also be selected, for example, a training model based on the Gradient Boosting Classifier algorithm, as the input feature data is the average base quality and the continuous longest base quality is greater than 20 The base fragment length and average signal intensity are the three data when the training model is selected.
本实施例的技术方案,通过在利用分类检测模型进行待测基因测序结果的类型判定之前,获取多个类型对应的标准基因测序结果的峰形图样本,并分别从峰形图样本中提取对应的特征数据作为特征数据样本,进而使用该特征数据样本对设定分类算法模型进行训练,以获取类型检测模型,实现了对类型检测模型的建立,从而使得为基因测序结果类型的自动判定提供模型基础,提高了 判定准确率。In the technical solution of this embodiment, before using a classification detection model to determine the type of the sequencing result of the gene to be tested, the peak shape map samples of the standard gene sequencing results corresponding to multiple types are obtained, and the corresponding ones are extracted from the peak shape map samples. The feature data is used as the feature data sample, and then the set classification algorithm model is trained using the feature data sample to obtain the type detection model, and the establishment of the type detection model is realized, so as to provide a model for automatic determination of the type of gene sequencing results. Basically, it improves the accuracy of judgment.
实施例三Example three
图3为本申请实施例三提供的一种基因测序结果类型的检测装置的结构示意图。参考图3,基因测序结果类型的检测装置包括:峰形获取模块310、特征提取模块320以及类型检测模块330,下面对每个模块进行说明。FIG. 3 is a schematic structural diagram of a detection device for a type of gene sequencing result provided in Embodiment 3 of the present application. Referring to FIG. 3, the type detection device for the type of genetic sequencing results includes a peak shape acquisition module 310, a feature extraction module 320, and a type detection module 330. Each module is described below.
峰形获取模块310,设置为获取待测基因测序结果的峰形图;特征提取模块320,设置为根据所述峰形图提取所述待测基因测序结果对应的特征数据;类型检测模块330,设置为将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型。A peak shape acquisition module 310 is configured to obtain a peak shape map of a sequencing result of a gene to be tested; a feature extraction module 320 is configured to extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map; a type detection module 330, It is configured to input the feature data into a type detection model to obtain a type matching the sequencing result of the gene to be tested.
一实施例中,基因测序结果类型的检测装置还可以包括:样本获取模块,设置为在将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型之前,获取多个类型对应的标准基因测序结果的峰形图样本;数据提取模块,设置为从每个标准基因测序结果的峰形图样本中提取所述每个标准基因测序结果对应的特征数据,作为特征数据样本;模型训练模块,设置为使用所述特征数据样本对设定分类算法模型进行训练,得到所述类型检测模型。In an embodiment, the device for detecting the type of the genetic sequencing result may further include a sample acquisition module configured to input the feature data into a type detection model and obtain a type matching the sequencing result of the gene to be tested. Acquire peak shape map samples of standard gene sequencing results corresponding to multiple types; the data extraction module is configured to extract feature data corresponding to each standard gene sequencing result from the peak shape map samples of each standard gene sequencing result, as Feature data samples; a model training module configured to use the feature data samples to train a set classification algorithm model to obtain the type detection model.
一实施例中,所述模型训练模块是设置为:使用所述特征数据样本对多个不同分类算法模型进行训练;获取设定次数训练后所述多个不同分类算法模型分别对应的识别准确率;将所述识别准确率最高的分类算法模型确定为所述类型检测模型。In an embodiment, the model training module is configured to: use the feature data samples to train a plurality of different classification algorithm models; and obtain a recognition accuracy rate corresponding to each of the plurality of different classification algorithm models after a set number of trainings. Determining the classification algorithm model with the highest recognition accuracy as the type detection model.
一实施例中,特征提取模块320是设置为:从所述峰形图的文件中提取特征数据;和/或,将所述峰形图按照预设特征提取算法进行处理,得到对应的特征数据。In an embodiment, the feature extraction module 320 is configured to: extract feature data from a file of the peak shape map; and / or process the peak shape map according to a preset feature extraction algorithm to obtain corresponding feature data .
一实施例中,所述类型包括:正常类型和异常类型。In an embodiment, the types include a normal type and an abnormal type.
一实施例中,所述异常类型包括:poly结构或重复序列类型、光谱拉高类型、弥散类型、气泡类型、衰减类型、套峰类型、无信号类型、引物问题类型以及中断类型中的一种或多种类型。In an embodiment, the type of abnormality includes one of a poly structure or a repeating sequence type, a spectral pull-up type, a diffusion type, a bubble type, an attenuation type, a peak set type, a no signal type, a primer problem type, and an interruption type. Or multiple types.
上述产品可执行本申请任意实施例所提供的方法,具备执行方法相应的功能模块和有益效果。The above product can execute the method provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the method.
实施例四Example 4
图4为本申请实施例四提供的一种计算机设备的结构示意图,如图4所示,本实施例提供的一种计算机设备,包括:处理器41和存储器42。该计算机设备中的处理器可以是一个或多个,图4中以一个处理器41为例,所述计算机设备中的处理器41和存储器42可以通过总线或其他方式连接,图4中以通过总线连接为例。FIG. 4 is a schematic structural diagram of a computer device provided in Embodiment 4 of the present application. As shown in FIG. 4, a computer device provided in this embodiment includes a processor 41 and a memory 42. The processor in the computer device may be one or more. In FIG. 4, a processor 41 is used as an example. The processor 41 and the memory 42 in the computer device may be connected through a bus or other methods. Take bus connection as an example.
本实施例中计算机设备的处理器41中集成了上述实施例提供的基因测序结果类型的检测装置。此外,该计算机设备中的存储器42作为一种计算机可读存储介质,可设置为存储一个或多个程序,所述程序可以是软件程序、计算机可执行程序以及模块,如本申请实施例中基因测序结果类型的检测方法对应的程序指令/模块(例如,附图3所示的基因测序结果类型的检测装置中的模块,包括:峰形获取模块310、特征提取模块320以及类型检测模块330)。处理器41通过运行存储在存储器42中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述方法实施例中基因测序结果类型的检测方法。The processor 41 of the computer equipment in this embodiment integrates the detection device of the type of the result of gene sequencing provided in the above embodiment. In addition, the memory 42 in the computer device serves as a computer-readable storage medium, and may be configured to store one or more programs. The programs may be software programs, computer-executable programs, and modules, such as genes in the embodiments of the present application. Program instructions / modules corresponding to the detection method of the sequencing result type (for example, the modules in the genetic sequencing result type detection device shown in FIG. 3 include a peak shape acquisition module 310, a feature extraction module 320, and a type detection module 330) . The processor 41 executes various functional applications and data processing of the device by running software programs, instructions, and modules stored in the memory 42, that is, a method for detecting a type of genetic sequencing result in the foregoing method embodiment.
存储器42可包括存储程序区和存储数据区。一实施例中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据设备的使用所创建的数据等。此外,存储器42可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器42可包括相对于处理器41远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 42 may include a program storage area and a data storage area. In one embodiment, the storage program area may store an operating system and application programs required for at least one function; the storage data area may store data created according to the use of the device, and the like. In addition, the memory 42 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage device. In some examples, the memory 42 may include memory remotely set relative to the processor 41, and these remote memories may be connected to the device through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
并且,当上述计算机设备所包括一个或者多个程序被所述一个或者多个处理器41执行时,程序进行如下操作:获取待测基因测序结果的峰形图;根据峰形图提取待测基因测序结果对应的特征数据;将特征数据输入至类型检测模型中,得到与待测基因测序结果相匹配的类型。In addition, when the one or more programs included in the computer device are executed by the one or more processors 41, the program performs the following operations: obtaining a peak shape diagram of the sequencing result of the gene to be tested; and extracting the gene to be tested according to the peak shape diagram. The characteristic data corresponding to the sequencing result; the characteristic data is input into the type detection model to obtain a type that matches the sequencing result of the gene to be tested.
实施例五Example 5
本申请实施例五还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被基因测序结果类型的检测装置执行时实现如本申请实施例一提供的基因测序结果类型的检测方法,该方法包括:获取待测基因测序结果的峰形图;根据峰形图提取待测基因测序结果对应的特征数据;将特征数据输入至类型检测模型中,得到与待测基因测序结果相匹配的类型。Embodiment 5 of the present application further provides a computer-readable storage medium having stored thereon a computer program that, when executed by a detection device for a type of genetic sequencing result, implements detection of the type of gene sequencing result provided in Embodiment 1 of the present application. The method includes: obtaining a peak shape diagram of the sequencing result of the gene to be tested; extracting characteristic data corresponding to the sequencing result of the gene to be tested according to the peak shape diagram; and inputting the characteristic data into the type detection model to obtain a result that is comparable to the sequencing result of the gene to be tested Match type.
本申请实施例所提供的一种计算机可读存储介质,其上存储的计算机程序 被执行时不限于实现如上所述的方法操作,还可以实现本申请任意实施例所提供的基因测序结果类型的检测方法中的相关操作。The computer-readable storage medium provided in the embodiments of the present application is not limited to the implementation of the method operations described above when the computer program stored thereon is executed, and can also implement the type of gene sequencing results provided by any embodiment of the present application. Relevant operations in the detection method.
通过以上关于实施方式的描述,所属领域的技术人员可以了解到,本申请可借助软件及通用硬件来实现,也可以通过硬件实现。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括多个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请任意实施例所述的方法。Through the foregoing description of the embodiments, those skilled in the art may understand that the present application may be implemented by software and general hardware, and may also be implemented by hardware. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a computer's floppy disk, Read-Only Memory (ROM), Random access memory (RAM), flash memory (FLASH), hard disk or optical disk, etc., including multiple instructions to enable a computer device (can be a personal computer, a server, or a network device, etc.) to execute any of this application The method described in the examples.
上述基因测序结果类型的检测装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,每个功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。In the embodiment of the above-mentioned type of genetic sequencing result detection device, the multiple units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, each The names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application.

Claims (10)

  1. 一种基因测序结果类型的检测方法,包括:A method for detecting the type of genetic sequencing results, including:
    获取待测基因测序结果的峰形图;Obtain a peak shape diagram of the sequencing result of the gene to be tested;
    根据所述峰形图提取所述待测基因测序结果对应的特征数据;Extracting characteristic data corresponding to the sequencing result of the gene to be tested according to the peak shape map;
    将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型。The feature data is input into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
  2. 根据权利要求1所述的方法,在所述将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型之前,还包括:The method according to claim 1, before the inputting the characteristic data into a type detection model to obtain a type matching the sequencing result of the gene to be tested, further comprising:
    分别获取多个类型对应的标准基因测序结果的峰形图样本;Obtain the peak shape samples of the standard gene sequencing results corresponding to multiple types respectively;
    从每个标准基因测序结果对应的峰形图样本中提取所述每个标准基因测序结果对应的特征数据,作为特征数据样本;Extracting characteristic data corresponding to each standard gene sequencing result from the peak shape sample corresponding to each standard gene sequencing result as a characteristic data sample;
    使用所述特征数据样本对设定分类算法模型进行训练,得到所述类型检测模型。The set classification algorithm model is trained using the feature data samples to obtain the type detection model.
  3. 根据权利要求2所述的方法,其中,所述使用所述特征数据样本对设定分类算法模型进行训练,得到所述类型检测模型,包括:The method according to claim 2, wherein the training the set classification algorithm model using the feature data samples to obtain the type detection model comprises:
    使用所述特征数据样本对多个不同分类算法模型进行训练;Training multiple different classification algorithm models using the feature data samples;
    获取设定次数训练后所述多个不同分类算法模型分别对应的识别准确率;Obtaining the recognition accuracy rates corresponding to the multiple different classification algorithm models after a set number of trainings;
    将所述识别准确率最高的分类算法模型确定为所述类型检测模型。The classification algorithm model with the highest recognition accuracy rate is determined as the type detection model.
  4. 根据权利要求1-3任一项所述的方法,其中,所述根据所述峰形图提取所述待测基因测序结果对应的特征数据,包括以下至少之一:The method according to any one of claims 1 to 3, wherein the extracting characteristic data corresponding to the sequencing result of the gene to be tested according to the peak shape graph comprises at least one of the following:
    从所述峰形图的文件中提取特征数据;将所述峰形图按照预设特征提取算法进行处理,得到对应的特征数据。Feature data is extracted from a file of the peak shape map; the peak shape map is processed according to a preset feature extraction algorithm to obtain corresponding feature data.
  5. 根据权利要求1-4任一项所述的方法,其中,所述类型包括:正常类型和异常类型。The method according to any one of claims 1-4, wherein the types include a normal type and an abnormal type.
  6. 根据权利要求5所述的方法,其中,所述异常类型包括:多聚poly结构或重复序列类型、光谱拉高类型、弥散类型、气泡类型、衰减类型、套峰类型、无信号类型、引物问题类型以及中断类型中的至少一种类型。The method according to claim 5, wherein the type of abnormality includes: poly-poly structure or repeating sequence type, spectral pull-up type, dispersion type, bubble type, attenuation type, set peak type, no signal type, primer problem At least one of a type and an interrupt type.
  7. 一种基因测序结果类型的检测装置,包括:A detection device for a type of gene sequencing result, including:
    峰形获取模块,设置为获取待测基因测序结果的峰形图;The peak shape acquisition module is configured to obtain a peak shape map of the sequencing result of the gene to be tested;
    特征提取模块,设置为根据所述峰形图提取所述待测基因测序结果对应的特征数据;A feature extraction module configured to extract feature data corresponding to the sequencing result of the gene to be tested according to the peak shape map;
    类型检测模块,设置为将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型。The type detection module is configured to input the feature data into a type detection model to obtain a type that matches the sequencing result of the gene to be tested.
  8. 根据权利要求7所述的装置,还包括:The apparatus according to claim 7, further comprising:
    样本获取模块,设置为在所述将所述特征数据输入至类型检测模型中,得到与所述待测基因测序结果相匹配的类型之前,分别获取多个类型对应的标准基因测序结果的峰形图样本;The sample acquisition module is configured to obtain peak shapes of standard gene sequencing results corresponding to multiple types before the characteristic data is input into a type detection model and a type matching the sequencing result of the gene to be tested is obtained. Sample map
    数据提取模块,设置为从每个标准基因测序结果的峰形图样本中提取所述每个标准基因测序结果对应的特征数据,作为特征数据样本;A data extraction module configured to extract feature data corresponding to each standard gene sequencing result from a peak shape sample of each standard gene sequencing result as a feature data sample;
    模型训练模块,设置为使用所述特征数据样本对设定分类算法模型进行训练,得到所述类型检测模型。A model training module is configured to use the feature data samples to train a set classification algorithm model to obtain the type detection model.
  9. 一种计算机设备,包括:A computer device including:
    至少一个处理器;At least one processor;
    存储器,设置为存储至少一个程序;A memory configured to store at least one program;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-6中任一所述的基因测序结果类型的检测方法。When the at least one program is executed by the at least one processor, the at least one processor implements a method for detecting a type of a gene sequencing result according to any one of claims 1-6.
  10. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-6中任一所述的基因测序结果类型的检测方法。A computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements a method for detecting a type of genetic sequencing result according to any one of claims 1-6.
PCT/CN2019/101096 2018-06-27 2019-08-16 Gene sequencing result type detection method and apparatus, device, and storage medium WO2020001663A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810675765.3A CN110718270B (en) 2018-06-27 2018-06-27 Method, device, equipment and storage medium for detecting type of gene sequencing result
CN201810675765.3 2018-06-27

Publications (2)

Publication Number Publication Date
WO2020001663A2 true WO2020001663A2 (en) 2020-01-02
WO2020001663A3 WO2020001663A3 (en) 2020-02-13

Family

ID=68985854

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/101096 WO2020001663A2 (en) 2018-06-27 2019-08-16 Gene sequencing result type detection method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110718270B (en)
WO (1) WO2020001663A2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117473444B (en) * 2023-12-27 2024-03-01 北京诺赛基因组研究中心有限公司 Sanger sequencing result quality inspection method based on CNN and SVM

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101225441B (en) * 2007-12-05 2010-12-01 浙江大学 Method for detecting genetic constitution of clone-specific T lymphocyte TCR BV CDR3
US9760676B2 (en) * 2014-10-21 2017-09-12 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for endocrine system conditions
CN104462870A (en) * 2015-01-09 2015-03-25 苏州大学 Method and device for identifying human gene promoter
WO2016114009A1 (en) * 2015-01-16 2016-07-21 国立研究開発法人国立がん研究センター Fusion gene analysis device, fusion gene analysis method, and program
CN106845156B (en) * 2017-01-11 2019-03-22 张渠 Classification method, apparatus and system based on blood platelet difference expression gene label
CN107066836A (en) * 2017-06-15 2017-08-18 上海思路迪生物医学科技有限公司 Genetic test management method and system
CN107463797B (en) * 2017-07-26 2021-04-09 广州达安临床检验中心有限公司 Biological information analysis method and device for high-throughput sequencing, equipment and storage medium

Also Published As

Publication number Publication date
WO2020001663A3 (en) 2020-02-13
CN110718270A (en) 2020-01-21
CN110718270B (en) 2023-10-03

Similar Documents

Publication Publication Date Title
CN107967475B (en) Verification code identification method based on window sliding and convolutional neural network
CN102346829A (en) Virus detection method based on ensemble classification
CN103150498B (en) Based on the hardware Trojan horse recognition method of single category support vector machines
Wu et al. End-to-end chromosome Karyotyping with data augmentation using GAN
CN103735253A (en) Tongue appearance analysis system and method thereof in traditional Chinese medicine based on mobile terminal
CN104700033A (en) Virus detection method and virus detection device
JP2012042990A (en) Image identification information adding program and image identification information adding apparatus
CN107798351B (en) Deep learning neural network-based identity recognition method and system
CN109753939B (en) HLA sequencing peak graph identification method
CN112201300B (en) Protein subcellular localization method based on depth image features and threshold learning strategy
CN110210294A (en) Evaluation method, device, storage medium and the computer equipment of Optimized model
CN111079427A (en) Junk mail identification method and system
CN114553591A (en) Training method of random forest model, abnormal flow detection method and device
WO2020001663A2 (en) Gene sequencing result type detection method and apparatus, device, and storage medium
US6337927B1 (en) Approximated invariant method for pattern detection
CN105678342A (en) Combined-skewness-based waveband selection method for hyperspectral image of corn seed
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN108932270B (en) Loquat germplasm resource retrieval contrast method based on Bayes and feedback algorithm
CN113641906A (en) System, method, device, processor and medium for realizing similar target person identification processing based on fund transaction relation data
CN111241930A (en) Method and system for face recognition
CN203970354U (en) A kind of Tongue analytical system based on mobile terminal
CN116072302A (en) Medical unbalanced data classification method based on biased random forest model
CN114925759A (en) Feature analysis method for Ether fishing behavior account
CN114694746A (en) Plant pri-miRNA coding peptide prediction method based on improved MRMD algorithm and DF model
CN114186625A (en) Wood identification method and system based on image feature fusion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19826771

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19826771

Country of ref document: EP

Kind code of ref document: A2