CN116110496A - Method, device, equipment and storage medium for rapidly detecting joint sequence - Google Patents

Method, device, equipment and storage medium for rapidly detecting joint sequence Download PDF

Info

Publication number
CN116110496A
CN116110496A CN202310011410.5A CN202310011410A CN116110496A CN 116110496 A CN116110496 A CN 116110496A CN 202310011410 A CN202310011410 A CN 202310011410A CN 116110496 A CN116110496 A CN 116110496A
Authority
CN
China
Prior art keywords
data
sequence
sequenced
linker
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310011410.5A
Other languages
Chinese (zh)
Inventor
陈实富
许明炎
彭敏琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hepulos Medical System Technology Co ltd
Original Assignee
Shenzhen Hepulos Medical System Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Hepulos Medical System Technology Co ltd filed Critical Shenzhen Hepulos Medical System Technology Co ltd
Priority to CN202310011410.5A priority Critical patent/CN116110496A/en
Publication of CN116110496A publication Critical patent/CN116110496A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Abstract

The invention belongs to the technical field of bioinformatics, and discloses a method, a device, equipment and a storage medium for rapidly detecting a joint sequence. The method comprises the steps of detecting data to be sequenced to obtain the data type of the data to be sequenced; selecting a corresponding type of joint sequence detection strategy according to the data type; according to the method, the device and the system, the connector of the data to be sequenced is rapidly detected according to the connector sequence detection strategy, the corresponding connector sequence detection strategy is determined according to the type of the data to be sequenced, and the connector sequence of the data to be sequenced is detected according to the connector sequence detection strategy.

Description

Method, device, equipment and storage medium for rapidly detecting joint sequence
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a method, a device, equipment and a storage medium for rapidly detecting a joint sequence.
Background
And (3) carrying out high-throughput sequencing data analysis, wherein the first step is to carry out quality control, wherein the quality control comprises removing a linker sequence, removing a low-quality sequence and the like. The linker sequence is a sequence artificially added at both ends of the insert for on-machine sequencing at the stage of library construction, and is read when the sequencing read length exceeds the insert length. Since we only need to care about sequencing results of the insert, all that is first needed is to remove the linker sequence. Two factors should be considered in removing the linker sequence, the first of which is that the resulting linker sequence may have errors of several bases from the original linker sequence due to the sequencing error rate, so that mismatches of bases must be allowed in removing the linker sequence. The second factor is that the sequence may be read from the sequence of the linker only as part of the sequence of the original linker, since the length of the insert varies within a certain range and the linker sequence is present at both ends. cutadapt is a piece of software which is used for mass filtering of high-throughput sequencing data in the current stage, and can effectively remove the joints of the 5 'end and the 3' end. cutadapts are adaptor cutting tools based on sequence pairing, adaptor sequences need to be input, the cutadapts cannot automatically detect adaptors, and adaptors can only be removed based on the input adaptor sequences, so that simple detection of sequencing data by fastqc software is needed, what adaptors are needed to see, and then the cutadapts are used for removal, so that the process is relatively troublesome and takes a long time.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a method, a device, equipment and a storage medium for rapidly detecting a joint sequence, and aims to solve the technical problems that the operation is complex and time-consuming because the joint sequence cannot be automatically detected before high-throughput sequencing data analysis in the prior art.
In order to achieve the above object, the present invention provides a method for rapid detection of a linker sequence, the method comprising the steps of:
detecting the data to be sequenced to obtain the data type of the data to be sequenced;
selecting a corresponding type of joint sequence detection strategy according to the data type;
and rapidly detecting the connector of the data to be sequenced according to the connector sequence detection strategy.
Optionally, the detecting the data to be sequenced to obtain the data type of the data to be sequenced includes:
detecting the tail end of the data to be sequenced, wherein when the tail end of the data to be sequenced is a single tail end, the data type of the data to be sequenced is single-ended data;
when the tail ends of the data to be sequenced are double-ended, the data type of the data to be sequenced is double-ended data.
Optionally, the selecting a corresponding type of linker sequence detection strategy according to the data type includes:
when the data to be sequenced is single-ended data, adjusting a joint sequence detection strategy to be a single-ended joint sequence detection strategy;
and when the data to be sequenced is double-ended data, adjusting the linker sequence detection strategy to be a double-ended linker sequence detection strategy.
Optionally, the rapid detection of the adaptor of the data to be sequenced according to the adaptor sequence detection strategy includes:
when the connector sequence detection strategy is a single-end connector sequence detection strategy, calculating a preset number of monomer units of data to be tested, and counting the occurrence frequency of the monomer units;
setting the monomer units corresponding to the occurrence frequency higher than a preset frequency as candidate linker subsequences;
ordering the candidate joint subsequences according to the frequency of occurrence;
and extending the candidate connector subsequence, and rapidly detecting the connector of the data to be sequenced.
Optionally, the extending the candidate linker subsequence, and the rapidly detecting the linker of the data to be sequenced, includes:
converting the data to be tested into a nucleotide tree, and determining dominant child nodes of the nucleotide tree;
forward extending the nucleotide tree in the presence of the dominant child node;
when it is possible to extend to the tail of the data to be sequenced, the candidate linker subsequence is determined as a valid linker, and the complete linker sequence is obtained by reverse extension.
Optionally, the rapid detection of the adaptor of the data to be sequenced according to the adaptor sequence detection strategy further includes:
when the connector sequence detection strategy is a double-end connector sequence detection strategy, acquiring the total sequence length of DNA and the length of data of the sequence to be detected;
determining an overlapping area according to the length of the DNA total sequence and the length of the data to be tested;
and determining the linker sequence according to the overlapping region.
Optionally, after the rapid detection of the adaptor of the data to be sequenced according to the adaptor sequence detection strategy, the method further comprises:
when the rapid detection result of the data to be sequenced does not meet the preset condition, disabling automatic joint sequence detection;
providing a specific joint sequence setting interface, acquiring the specific joint sequence input by the setting interface, and cutting the specific joint sequence.
In addition, in order to achieve the above object, the present invention also provides a rapid detection device for a linker sequence, comprising:
the data detection module is used for detecting the data to be sequenced to obtain the data type of the data to be sequenced;
the strategy selection module is used for selecting a corresponding type of joint sequence detection strategy according to the data type;
and the joint detection module is used for rapidly detecting the joint of the data to be sequenced according to the joint sequence detection strategy.
In addition, in order to achieve the above object, the present invention also provides a rapid splice sequence detection apparatus comprising: a memory, a processor, and a linker sequence rapid detection program stored on the memory and executable on the processor, the linker sequence rapid detection program configured to implement the steps of the linker sequence rapid detection method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a linker sequence rapid detection program which, when executed by a processor, implements the steps of the linker sequence rapid detection method as described above.
The method comprises the steps of detecting data to be sequenced to obtain the data type of the data to be sequenced; selecting a corresponding type of joint sequence detection strategy according to the data type; according to the method, the device and the system, the connector of the data to be sequenced is rapidly detected according to the connector sequence detection strategy, the corresponding connector sequence detection strategy is determined according to the type of the data to be sequenced, and the connector sequence of the data to be sequenced is detected according to the connector sequence detection strategy.
Drawings
FIG. 1 is a schematic diagram of a device for quickly detecting a joint sequence in a hardware running environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of a method for rapid detection of a linker sequence according to the invention;
FIG. 3 is a flow chart of a second embodiment of the method for rapid detection of a linker sequence according to the invention;
FIG. 4 is a schematic diagram of a linker sequence for double-ended sequencing according to an embodiment of the rapid detection method of the linker sequence of the present invention;
FIG. 5 is a block diagram showing the construction of a first embodiment of the rapid splice sequencing device of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a device for rapid detection of a joint sequence in a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the linker sequence rapid detection apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the configuration shown in FIG. 1 is not limiting of the rapid splice serial detection device and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a joint sequence rapid detection program may be included in the memory 1005 as one type of storage medium.
In the rapid splice sequence detection apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the joint sequence rapid detection device of the present invention may be provided in the joint sequence rapid detection device, where the joint sequence rapid detection device invokes the joint sequence rapid detection program stored in the memory 1005 through the processor 1001, and executes the joint sequence rapid detection method provided by the embodiment of the present invention.
An embodiment of the present invention provides a method for rapidly detecting a linker sequence, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of a method for rapidly detecting a linker sequence according to the present invention.
In this embodiment, the method for rapidly detecting the linker sequence includes the following steps:
step S10: detecting the data to be sequenced to obtain the data type of the data to be sequenced.
It should be noted that, the execution body of the embodiment is a joint sequence rapid detection device, where the joint sequence rapid detection device has functions of data processing, data communication, program running, and the like, and the joint sequence rapid detection device may be an integrated controller, a control computer, and other devices with similar functions, and certainly may also be other devices with similar functions, which is not limited in this embodiment.
It is understood that the data to be tested refers to deoxyribonucleic acid (DNA), the data to be tested is detected, that is, the base sequence of the tree DNA is actually detected, before the data to be tested is detected, the DNA molecule needs to be segmented and amplified, in this process, a linker sequence is artificially introduced, and the types of the data to be tested finally generated are different due to the different linker sequences, wherein the data types of the data to be tested can be classified into single-ended data and double-ended data.
In a specific implementation, the connector sequence rapid detection device can perform preliminary detection on input data to be sequenced, in the detection process, detection needs to be performed on the 5 'end and the 3' end of the data to be sequenced, and as the single-ended data and the double-ended data are different in the position of the connector sequence, whether the data type of the data to be sequenced, which is input currently, is single-ended data or double-ended data can be determined.
Step S20: and selecting a corresponding type of joint sequence detection strategy according to the data type.
It should be noted that, the linker sequence detection policy refers to a method for detecting data to be sequenced, where the linker sequence detection policy is stored in a storage medium of a linker sequence detection device, and when the linker sequence detection device determines a data type to be sequenced, the linker sequence detection policy corresponding to the data type to be sequenced can be invoked according to the data type to be sequenced.
In a specific implementation, when the connector sequence detection device determines that the data type of the data to be sequenced is single-ended data, the connector sequence detection strategy of the single-ended data can be called to detect the single-ended data, and when the connector sequence detection device determines that the data type of the data to be sequenced is double-ended data, the connector sequence detection strategy of the double-ended data can be called to detect the double-ended data.
Step S30: and rapidly detecting the connector of the data to be sequenced according to the connector sequence detection strategy.
In a specific implementation, when the single-ended data joint sequence detection strategy is used for rapidly detecting single-ended data, the joint sequence can be detected in a high-frequency reading mode, when the detection is carried out, the tail part of the high-frequency read needs to be assembled to detect the tail end of the single-ended data, the joint sequence of the single-ended data is determined, and when the double-ended data joint sequence detection strategy is carried out for rapidly detecting double-ended data, the joint sequence can be rapidly found based on the overlapped part of the double-ended data.
In the embodiment, the data type of the data to be sequenced is obtained by detecting the data to be sequenced; selecting a corresponding type of joint sequence detection strategy according to the data type; according to the method, the device and the system, the connector of the data to be sequenced is rapidly detected according to the connector sequence detection strategy, the corresponding connector sequence detection strategy is determined according to the type of the data to be sequenced, and the connector sequence of the data to be sequenced is detected according to the connector sequence detection strategy.
Referring to fig. 3, fig. 3 is a flowchart illustrating a method for rapid detection of a linker sequence according to a second embodiment of the present invention.
Based on the above first embodiment, the method for rapid detection of a linker sequence according to this embodiment further includes, in the step S30:
step S301: when the connector sequence detection strategy is a single-end connector sequence detection strategy, calculating a preset number of monomer units of the data to be tested, and counting the occurrence frequency of the monomer units.
Step S302: and setting the monomer units corresponding to the occurrence frequency higher than the preset frequency as candidate linker subsequences.
Step S303: the candidate adaptor subsequences are ordered according to the frequency of occurrence.
Step S304: and extending the candidate connector subsequence, and rapidly detecting the connector of the data to be sequenced.
It should be noted that, the monomer units refer to short base sequences in the data to be sequenced, wherein the number of bases contained in the monomer units is not fixed, the lengths of different monomer units are not completely consistent, the candidate adaptor subsequences refer to monomer units which may be adaptor sequences, and the preset frequency refers to a frequency limit value determined in advance, that is, when the preset frequency is reached, the frequency can be determined to be high, and the accuracy is high.
In a specific implementation, when sequencing single-ended data, a certain amount of monomer units in high-frequency reads need to be calculated, wherein each high-frequency read corresponds to a reaction tank, a certain amount of monomer units of data to be sequenced are located in each high-frequency read, the number of bases in each monomer unit can be determined to be 10 or other numbers, and particularly according to practical situations, when calculating the monomer units in the high-frequency reads, the frequency and complexity of each sequence occurrence can be counted, and meanwhile, the calculated times need to be counted, so as to obtain the frequency of each monomer unit occurrence, and meanwhile, the frequency of each monomer unit occurrence is compared with a preset frequency, wherein the preset frequency can be reasonably set according to practical situations, preferably is more than 0.0001, the embodiment is illustrated by taking 0.0001 as an example, and (3) reserving monomer units with the occurrence frequency higher than 0.0001, deleting sequences with lower complexity in the reserved monomer units, not counting, comparing monomer units with smaller base numbers with monomer units with larger base numbers when the complexity is compared, deleting monomer units with smaller base numbers if the monomer units with smaller base numbers are part of the monomer units with larger base numbers, reserving the monomer units with larger base numbers, taking the reserved monomer units as candidate joint subsequences, sequencing all the candidate joint subsequences in sequence according to the occurrence frequency, and extending the bases according to the candidate joint subsequences in sequence, so that a truly complete joint is found, and rapid detection of the joint of the data to be sequenced is realized.
Further, in order to achieve the finding of a truly complete joint, the method further comprises the following steps:
converting the data to be tested into a nucleotide tree, and determining dominant child nodes of the nucleotide tree;
forward extending the nucleotide tree in the presence of the dominant child node;
when it is possible to extend to the tail of the data to be sequenced, the candidate linker subsequence is determined as a valid linker, and the complete linker sequence is obtained by reverse extension.
It should be noted that, the nucleotide tree refers to a classification regression tree formed by taking each nucleotide as a node according to a complete set of sequences, each sequence is from heel to leaf, and the dominant child node refers to a child node with occurrence probability of more than 90% in the formed tree.
In a specific implementation, the obtained candidate joint subsequence can be used as a basis to convert a group of data to be sequenced into a tree, meanwhile, bases in the data to be sequenced are used as leaf nodes of the tree, the probability of occurrence of each node is counted, the leaf nodes are continuously continued, the extending direction is adopted to extend forward, if the extending direction can extend to the tail of the data to be sequenced, the current joint is effective, and when the joint is determined to be effective, the complete joint sequence is obtained in a reverse extending mode.
Further, when the rapid detection of the linker sequence of the data to be sequenced is performed, if the data to be sequenced is double-ended data, the method further comprises the following steps:
when the connector sequence detection strategy is a double-end connector sequence detection strategy, acquiring the total sequence length of DNA and the length of data of the sequence to be detected;
determining an overlapping area according to the length of the DNA total sequence and the length of the data to be tested;
and determining the linker sequence according to the overlapping region.
The overlapping region refers to a portion where bases are paired with each other, and for example, when one strand of one double-ended data is ctggctctact.. AGTAATTCC, the other strand is aattccctgctctact..agt, the overlapping portion is ctggctctact..agt.
In a specific implementation, referring to fig. 4, fig. 4 is a schematic diagram of a linker sequence for double ended sequencing. The rapid detection device for adaptor sequence may determine the total sequence length of DNA and the double-end data sequence length first, for convenience of description, the total sequence length of DNA is set to be T, the double-end data sequence length is set to be S, when double-end data sequencing is performed, the size relationship between the total sequence length T of DNA and the double-end data sequence length S is first determined, when T is less than or equal to S, that is, the total sequence length of DNA is less than or equal to the double-end data sequence length, which indicates that all overlap regions are overlapping regions, that is, no adaptor sequence exists, if S < T <2S, the length of the overlap regions is 2S-T, the overlapping portions are portions to be sequenced in the data to be sequenced, and the non-overlapping portions are adaptor sequence portions, wherein the adaptor is a generic name of adaptor sequence, specifically, the adaptor sequence is different according to the preparation method of the data library to be sequenced, and if 2S is less than or equal to S is less than S, which indicates that there is no data to be sequenced in the data to be sequenced, the method is used to find the overlapping portions of each pair of data, and the bases except the overlapping regions are adaptor bases, that are the adaptor bases, and even if one or two adaptor sequences are adaptor bases can be found.
Further, after the rapid detection of the linker sequence of the data to be detected, the method further comprises the following steps:
when the rapid detection result of the data to be sequenced does not meet the preset condition, disabling automatic joint sequence detection;
providing a specific joint sequence setting interface, acquiring the specific joint sequence input by the setting interface, and cutting the specific joint sequence.
It should be noted that, the preset condition refers to a qualified threshold of a rapid detection result of the data to be sequenced, and specific condition setting needs to be determined according to the rapid detection precision, which is not limited in this embodiment.
In a specific implementation, after the rapid detection of the linker sequence is performed on the data to be detected, the detection result can be evaluated to determine whether the expected assumption is satisfied, for example, the set specific linker sequence is not detected, at this time, the current linker sequence detection method is disabled, an interface is provided for inputting the specific linker sequence to the linker sequence rapid detection device, when the specific linker sequence is received by the linker sequence rapid detection device, the linker sequence can be matched in the data to be detected according to the linker sequence, when the linker sequence is detected, the linker sequence can be cut off, the interface can be used for rapid detection of the linker sequences of the single-ended data and the double-ended data, and when the specific linker sequence is input, the specific linker sequence can be input by the format of "-a linker sequence", which is not limited by the embodiment.
According to the embodiment, forward extension is performed on single-ended data in a tree mode, the validity of a joint sequence is determined by taking the tail of the data to be tested as a standard, reverse extension is performed on the premise that the joint sequence is judged to be valid, and then the joint sequence in the single-ended data is determined, in double-ended data, whether an overlapping area exists or not can be determined according to comparison between the data to be tested and the total sequence length of DNA, when the overlapping area exists, the position of the overlapping area can be directly calculated, and according to the characteristics of the double-ended data, the non-overlapping area is the joint sequence.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a joint sequence rapid detection program, and the joint sequence rapid detection program realizes the steps of the joint sequence rapid detection method when being executed by a processor.
Referring to fig. 5, fig. 5 is a block diagram showing the construction of a first embodiment of the rapid splice sequencing device of the present invention.
As shown in fig. 5, the device for rapidly detecting a linker sequence according to the embodiment of the present invention includes:
the data detection module 10 is used for detecting the data to be sequenced to obtain the data type of the data to be sequenced;
a policy selection module 20, configured to select a corresponding type of connector sequence detection policy according to the data type;
and the joint detection module 30 is used for rapidly detecting the joint of the data to be sequenced according to the joint sequence detection strategy.
In the embodiment, the data type of the data to be sequenced is obtained by detecting the data to be sequenced; selecting a corresponding type of joint sequence detection strategy according to the data type; according to the method, the device and the system, the connector of the data to be sequenced is rapidly detected according to the connector sequence detection strategy, the corresponding connector sequence detection strategy is determined according to the type of the data to be sequenced, and the connector sequence of the data to be sequenced is detected according to the connector sequence detection strategy.
In an embodiment, the data detection module 10 is further configured to detect an end of the data to be sequenced, where the data type of the data to be sequenced is single-ended data when the end of the data to be sequenced is single-ended; when the tail ends of the data to be sequenced are double-ended, the data type of the data to be sequenced is double-ended data.
In an embodiment, the policy selection module 20 is further configured to adjust the splice sequence detection policy to a single-ended splice sequence detection policy when the data to be sequenced is single-ended data; and when the data to be sequenced is double-ended data, adjusting the linker sequence detection strategy to be a double-ended linker sequence detection strategy.
In an embodiment, the joint detection module 30 is further configured to calculate a preset number of monomer units to be tested for sequence data and count occurrence frequencies of the monomer units when the joint sequence detection strategy is a single-ended joint sequence detection strategy; setting the monomer units corresponding to the occurrence frequency higher than a preset frequency as candidate linker subsequences; ordering the candidate joint subsequences according to the frequency of occurrence; and extending the candidate connector subsequence, and rapidly detecting the connector of the data to be sequenced.
In one embodiment, the adaptor detection module 30 is further configured to convert the data to be tested into a nucleotide tree, and determine dominant child nodes of the nucleotide tree; forward extending the nucleotide tree in the presence of the dominant child node; when it is possible to extend to the tail of the data to be sequenced, the candidate linker subsequence is determined as a valid linker, and the complete linker sequence is obtained by reverse extension.
In an embodiment, the adaptor detection module 30 is further configured to obtain a DNA total sequence length and a sequence length to be measured when the adaptor sequence detection strategy is a double-ended adaptor sequence detection strategy; determining an overlapping area according to the length of the DNA total sequence and the length of the data to be tested; and determining the linker sequence according to the overlapping region.
In an embodiment, the adaptor detection module 30 is further configured to disable automatic adaptor sequence detection when the rapid detection result of the data to be sequenced does not meet a preset condition; providing a specific joint sequence setting interface, acquiring the specific joint sequence input by the setting interface, and cutting the specific joint sequence.
It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. The rapid detection method of the linker sequence is characterized by comprising the following steps:
detecting the data to be sequenced to obtain the data type of the data to be sequenced;
selecting a corresponding type of joint sequence detection strategy according to the data type;
and rapidly detecting the connector of the data to be sequenced according to the connector sequence detection strategy.
2. The method for rapidly detecting a linker sequence according to claim 1, wherein the detecting the data to be sequenced to obtain a data type of the data to be sequenced comprises:
detecting the tail end of the data to be sequenced, wherein when the tail end of the data to be sequenced is a single tail end, the data type of the data to be sequenced is single-ended data;
when the tail ends of the data to be sequenced are double-ended, the data type of the data to be sequenced is double-ended data.
3. The method for rapid detection of a linker sequence according to claim 1, wherein the selecting a corresponding type of linker sequence detection strategy according to the data type comprises:
when the data to be sequenced is single-ended data, adjusting a joint sequence detection strategy to be a single-ended joint sequence detection strategy;
and when the data to be sequenced is double-ended data, adjusting the linker sequence detection strategy to be a double-ended linker sequence detection strategy.
4. The method for rapid detection of a linker sequence according to claim 1, wherein the rapid detection of the linker of the data to be sequenced according to the linker sequence detection strategy comprises:
when the connector sequence detection strategy is a single-end connector sequence detection strategy, calculating a preset number of monomer units of data to be tested, and counting the occurrence frequency of the monomer units;
setting the monomer units corresponding to the occurrence frequency higher than a preset frequency as candidate linker subsequences;
ordering the candidate joint subsequences according to the frequency of occurrence;
and extending the candidate connector subsequence, and rapidly detecting the connector of the data to be sequenced.
5. The method for rapid detection of a linker sequence according to claim 4, wherein the extending the candidate linker subsequence and rapidly detecting the linker of the data to be sequenced comprises:
converting the data to be tested into a nucleotide tree, and determining dominant child nodes of the nucleotide tree;
forward extending the nucleotide tree in the presence of the dominant child node;
when it is possible to extend to the tail of the data to be sequenced, the candidate linker subsequence is determined as a valid linker, and the complete linker sequence is obtained by reverse extension.
6. The method for rapid detection of a linker sequence according to claim 1, wherein the rapid detection of the linker of the data to be sequenced according to the linker sequence detection strategy further comprises:
when the connector sequence detection strategy is a double-end connector sequence detection strategy, acquiring the total sequence length of DNA and the length of data of the sequence to be detected;
determining an overlapping area according to the length of the DNA total sequence and the length of the data to be tested;
and determining the linker sequence according to the overlapping region.
7. The method for rapid detection of a linker sequence according to any one of claims 1 to 6, further comprising, after rapid detection of the linker of the data to be sequenced according to the linker sequence detection strategy:
when the rapid detection result of the data to be sequenced does not meet the preset condition, disabling automatic joint sequence detection;
providing a specific joint sequence setting interface, acquiring the specific joint sequence input by the setting interface, and cutting the specific joint sequence.
8. A rapid splice sequence detection device, characterized in that the rapid splice sequence detection device comprises:
the data detection module is used for detecting the data to be sequenced to obtain the data type of the data to be sequenced;
the strategy selection module is used for selecting a corresponding type of joint sequence detection strategy according to the data type;
and the joint detection module is used for rapidly detecting the joint of the data to be sequenced according to the joint sequence detection strategy.
9. A device for rapid detection of a linker sequence, the device comprising: memory, a processor and a linker sequence rapid detection 5 program stored on the memory and executable on the processor, the linker sequence rapid detection program configured to implement the steps of the linker sequence rapid detection method of any one of claims 1 to 7.
10. A storage medium having stored thereon a linker sequence rapid detection program which when executed by a processor performs the steps of the linker sequence rapid detection method of any one of claims 1 to 0 7.
CN202310011410.5A 2023-01-05 2023-01-05 Method, device, equipment and storage medium for rapidly detecting joint sequence Pending CN116110496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310011410.5A CN116110496A (en) 2023-01-05 2023-01-05 Method, device, equipment and storage medium for rapidly detecting joint sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310011410.5A CN116110496A (en) 2023-01-05 2023-01-05 Method, device, equipment and storage medium for rapidly detecting joint sequence

Publications (1)

Publication Number Publication Date
CN116110496A true CN116110496A (en) 2023-05-12

Family

ID=86260943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310011410.5A Pending CN116110496A (en) 2023-01-05 2023-01-05 Method, device, equipment and storage medium for rapidly detecting joint sequence

Country Status (1)

Country Link
CN (1) CN116110496A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017054302A1 (en) * 2015-09-30 2017-04-06 中国农业科学院深圳农业基因组研究所 Sequencing library, and preparation and use thereof
CN108229103A (en) * 2018-01-15 2018-06-29 臻和(北京)科技有限公司 The processing method and processing device of Circulating tumor DNA repetitive sequence
CN109439729A (en) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 Detect connector, connector mixture and the correlation method of low frequency variation
CN109576347A (en) * 2018-12-06 2019-04-05 深圳海普洛斯医疗器械有限公司 The sequence measuring joints of the label containing unimolecule and the construction method of sequencing library
CN110283937A (en) * 2019-05-31 2019-09-27 上海奥根诊断技术有限公司 Detection primer group, detection reagent and sequencing library for the infection of sense organ post-transplantation
US20200273576A1 (en) * 2019-02-26 2020-08-27 Tempus Systems and methods for using sequencing data for pathogen detection
CA3111019A1 (en) * 2019-05-31 2020-12-03 Freenome Holdings, Inc. Methods and systems for high-depth sequencing of methylated nucleic acid
CN113990393A (en) * 2021-12-28 2022-01-28 北京优迅医疗器械有限公司 Data processing method and device for gene detection and electronic equipment
CN114277091A (en) * 2021-09-17 2022-04-05 广东省人民医院 Method for constructing high-quality immune repertoire library

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017054302A1 (en) * 2015-09-30 2017-04-06 中国农业科学院深圳农业基因组研究所 Sequencing library, and preparation and use thereof
CN108229103A (en) * 2018-01-15 2018-06-29 臻和(北京)科技有限公司 The processing method and processing device of Circulating tumor DNA repetitive sequence
CN109576347A (en) * 2018-12-06 2019-04-05 深圳海普洛斯医疗器械有限公司 The sequence measuring joints of the label containing unimolecule and the construction method of sequencing library
CN109439729A (en) * 2018-12-27 2019-03-08 上海鲸舟基因科技有限公司 Detect connector, connector mixture and the correlation method of low frequency variation
US20200273576A1 (en) * 2019-02-26 2020-08-27 Tempus Systems and methods for using sequencing data for pathogen detection
CN110283937A (en) * 2019-05-31 2019-09-27 上海奥根诊断技术有限公司 Detection primer group, detection reagent and sequencing library for the infection of sense organ post-transplantation
CA3111019A1 (en) * 2019-05-31 2020-12-03 Freenome Holdings, Inc. Methods and systems for high-depth sequencing of methylated nucleic acid
CN114277091A (en) * 2021-09-17 2022-04-05 广东省人民医院 Method for constructing high-quality immune repertoire library
CN113990393A (en) * 2021-12-28 2022-01-28 北京优迅医疗器械有限公司 Data processing method and device for gene detection and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHIFU CHEN 等: "fastp: an ultra-fast all-in-one FASTQ preprocessor", 《BIOINFORMATICS》, vol. 34, no. 17, pages 884 *
陈实富: "循环肿瘤DNA测序的数据分析方法", 《中国博士学位论文全文数据库 信息科技辑》, no. 2, pages 1 - 124 *

Similar Documents

Publication Publication Date Title
US10991453B2 (en) Alignment of nucleic acid sequences containing homopolymers based on signal values measured for nucleotide incorporations
Aronesty Comparison of sequencing utility programs
CN109994155B (en) Gene variation identification method, device and storage medium
Prjibelski et al. Accurate isoform discovery with IsoQuant using long reads
CN114496077B (en) Methods, devices, and media for detecting single nucleotide variations and indels
Kremer et al. Approaches for in silico finishing of microbial genome sequences
CN112863594A (en) Tumor purity estimation method and device
CN112410408A (en) Gene sequencing method, apparatus, device and computer readable storage medium
Costa-Silva et al. Temporal progress of gene expression analysis with RNA-Seq data: A review on the relationship between computational methods
CN116386718A (en) Method, apparatus and medium for detecting copy number variation
CN116110496A (en) Method, device, equipment and storage medium for rapidly detecting joint sequence
CN111008148A (en) Code testing method and device and computer readable storage medium
CN112333101B (en) Network topology path finding method, device, equipment and storage medium
CN106802860B (en) Useless class detection method and device
CN113327646B (en) Sequencing sequence processing method and device, storage medium and electronic equipment
CN112306041A (en) Vehicle configuration information writing method and device and electronic equipment
US20160026756A1 (en) Method and apparatus for separating quality levels in sequence data and sequencing longer reads
CN111290940A (en) Automatic testing method, device, equipment and medium for APP
CN113343314A (en) Data verification method and device for data flashing
CN110970089B (en) Pretreatment method and pretreatment device for fetal concentration calculation and application of pretreatment device
CN113110991A (en) Page element positioning method and device, storage medium and electronic device
US20140288847A1 (en) Systems and techniques for segmentation of sequential data
Prjibelski et al. IsoQuant: a tool for accurate novel isoform discovery with long reads
CN113904957B (en) Sampling point testing method and system and main control equipment thereof
US20210020268A1 (en) Determination of frequency distribution of nucleotide sequence variants

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination