CN112309501A - Gene comparison technology - Google Patents

Gene comparison technology Download PDF

Info

Publication number
CN112309501A
CN112309501A CN201911046513.5A CN201911046513A CN112309501A CN 112309501 A CN112309501 A CN 112309501A CN 201911046513 A CN201911046513 A CN 201911046513A CN 112309501 A CN112309501 A CN 112309501A
Authority
CN
China
Prior art keywords
gene sequence
sub
gene
detected
reference gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911046513.5A
Other languages
Chinese (zh)
Inventor
方涛
陈夏捷
董晓文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/CN2020/106498 priority Critical patent/WO2021023142A1/en
Priority to JP2022506634A priority patent/JP7286872B2/en
Priority to EP20849621.6A priority patent/EP4006908A4/en
Publication of CN112309501A publication Critical patent/CN112309501A/en
Priority to US17/587,507 priority patent/US20220238185A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A gene comparison technology. The gene alignment technology can be applied to a computer system comprising an optical computing chip. In the process of performing gene comparison, a first group of gene segments may be obtained from a gene database according to a gene sequence to be detected, where the first group of gene segments includes a plurality of reference gene segments matched with partial bases of the gene sequence to be detected. After the first group of gene fragments are obtained, the gene sequence to be detected and a plurality of reference gene fragments in the first group of gene fragments can be input into the optical computing chip for optical comparison. The technology can greatly improve the gene comparison speed and reduce the gene comparison times.

Description

Gene comparison technology
Technical Field
The application relates to the technical field of optics, in particular to a gene comparison technology.
Background
Deoxyribonucleic acid (DNA) is a major chemical component of chromosomes and is also a material constituting genes. Genes (Gene) refer to DNA sequences carrying genetic information, also called genetic elements, which are the basic structural and functional units of genetic material that control biological traits. The gene expresses the genetic information carried by the gene by guiding the synthesis of protein, thereby controlling the character expression of the organism individual. With the advent of DNA sequencing technology, the generation of DNA sequence data has seen exponential growth until the Human Genome Project (HGP) was completed. The DNA sequence comparison is a precondition for carrying out gene identification, information analysis, structure prediction and other problems, and the same or different sites and regions of a plurality of DNA sequences are searched through the comparison of the DNA sequences, thereby helping to judge the homology, variation points and sources of genes to be detected.
With the rapid development of new generation DNA sequencing technology, the explosive accumulation rate of DNA sequencing data is far greater than the processing rate thereof. In the face of the big data analysis task in the field of biological information and the integration of data with various dimensions, a fast and convenient DNA comparison method is urgently needed.
Disclosure of Invention
The gene comparison technology provided by the application can improve the DNA comparison efficiency.
In a first aspect, embodiments of the present invention provide a gene mapping method, which is applied in a computer system including an optical computing chip. According to the method, in the process of realizing gene comparison, a processor of a computer system can obtain a first group of gene segments from a gene database according to a gene sequence to be detected, and input the gene sequence to be detected and a plurality of reference gene segments in the first group of gene segments into the optical computing chip for optical comparison. Wherein the gene database comprises a plurality of reference gene segments of reference gene sequences, and the first group of gene segments comprises a plurality of reference gene segments matched with partial bases of the gene sequences to be detected.
The gene comparison method provided by the embodiment of the invention combines two modes of database search and optical autocorrelation comparison, and screens out a first group of reference gene segments possibly matched with the gene sequence to be detected by performing primary matching on the constructed gene database and the gene sequence to be detected. After the gene database provided by the embodiment of the invention is used for screening the gene segments to be compared, the number of reference gene segments needing to be compared in detail can be greatly reduced. In addition, in the embodiment of the present invention, after the first group of reference gene segments are obtained, the optical calculation chip is further used to perform optical comparison on the gene sequence to be detected and a plurality of reference gene segments in the first group of reference gene segments, and since the optical calculation chip performs optical comparison, compared with a method of performing gene comparison electrically, the comparison speed is faster. Therefore, the gene comparison method provided by the embodiment of the invention also greatly improves the comparison efficiency.
In practical application, the processor may obtain the first group of gene segments from the database according to a part of bases of the gene sequence to be detected. For example, a first group of gene segments is obtained from the database according to the first m bases and the last n bases of the gene sequence to be detected, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than the number of bases in the gene sequence to be detected. Generally, the values of m and n can be determined according to the length of the gene sequence to be detected, the length of the reference gene sequence and other factors.
In one possible implementation, the database may be a key-value database, wherein a key value is a partial base of a plurality of reference gene segments of the reference gene sequence, and a value is a position of the plurality of reference gene segments in the reference gene sequence.
In a possible implementation manner, the method further includes obtaining a plurality of sub-reference gene sequences from the reference gene sequence when it is determined that the similarity between the gene sequence to be detected and a first gene fragment in the first group of gene fragments is smaller than a first threshold and larger than a second threshold according to the output result of the optical computing chip, and inputting the gene sequence to be detected and the first sub-reference gene sequence in the plurality of sub-reference gene sequences into the optical computing chip for optical comparison to obtain the first similarity between the gene sequence to be detected and the first sub-reference gene sequence, where each sub-reference gene sequence is a part of the reference gene sequence.
In the embodiment of the present invention, when the similarity between the gene sequence to be detected and at least one gene fragment in the first group of gene fragments is smaller than the first threshold and larger than the second threshold, it indicates that the gene sequence to be detected is likely to find a matching reference gene fragment in the reference gene sequence, and further alignment is required. Therefore, the multiple sub-reference gene sequences of the test gene sequence and the reference gene sequence can be further optically aligned, so that the reference gene fragment matching with at least a part of the fragments of the test gene sequence can be quickly found.
In yet another possible implementation manner, the method may further include determining that the first similarity is greater than a third threshold and smaller than a fourth threshold, and in response to the determination, obtaining a first candidate gene sequence and a second candidate gene sequence according to the candidate gene sequence, where the fourth threshold is not greater than the first threshold, and a part of bases of the first candidate gene sequence and the second candidate gene sequence are the same. Further, the first to-be-detected sub-gene sequence and the first sub-reference gene sequence are input into the optical computing chip to be optically compared so as to obtain a second similarity, and the second to-be-detected sub-gene sequence and the first sub-reference gene sequence are input into the optical computing chip to be optically compared so as to obtain a third similarity. According to the mode, when the similarity between the gene sequence to be detected and the first sub-reference gene sequence meets the preset condition, the gene sequence to be detected can be further split, and the split first sub-gene sequence to be detected and the split second sub-gene sequence to be detected are respectively compared with the first sub-reference gene sequence, so that partial fragments matched with the first sub-reference gene sequence in the gene sequence to be detected can be located as soon as possible. In addition, the maximum similarity matching method can tolerate the deletion phenomenon of the base, so that the precise positioning of the deletion part or the variant part in the gene sequence to be detected can be realized. In practical applications, the first candidate gene sequence may include a base with a first preset length obtained from the head of the candidate gene sequence toward the tail, the second candidate gene sequence may include a base with a first preset length obtained from the tail of the candidate gene sequence toward the head, and a part of the bases of the first candidate gene sequence and the second candidate gene sequence coincide with each other.
In yet another possible implementation, the method further includes recording a position of the first sub-reference gene sequence in the reference gene sequence when the second similarity is greater than the fourth threshold. In this way, when the second similarity between the first candidate sub-gene sequence and the first sub-reference gene sequence is greater than the fourth threshold, it can be determined that the first candidate sub-gene sequence and the first sub-reference gene sequence are most similar and matched, and thus, the position of the first sub-reference gene sequence in the reference gene sequence can be recorded, and the most similar and matched fragment of the first candidate sub-gene sequence can be obtained.
In yet another possible implementation manner, the method further includes: and when the third similarity is greater than the third threshold and smaller than the fourth threshold, obtaining a first to-be-detected sub-gene sequence unit and a second to-be-detected sub-gene sequence unit according to the second to-be-detected sub-gene sequence, inputting the first to-be-detected sub-gene sequence unit and the first sub-reference gene sequence into the optical computing chip for optical comparison, and inputting the second to-be-detected sub-gene sequence unit and the first sub-reference gene sequence into the optical computing chip for optical comparison. Wherein, the partial basic groups of the first to-be-detected sub-gene sequence unit and the second to-be-detected sub-gene sequence unit are the same. According to the mode, if the matching result of the second gene sequence to be detected and the first sub-reference gene sequence still does not reach the maximum similar matching standard, the second gene sequence to be detected can be continuously split and compared, and therefore, according to the recursive searching mode, the maximum similar matching fragment of at least part of fragments of the second gene sequence to be detected can be rapidly located. The maximum similarity matching method can tolerate the deletion phenomenon of the basic group, thereby realizing the accurate positioning of gene deletion and gene variation points.
In yet another possible implementation manner, the method further includes: inputting the gene sequence to be detected and a second sub-reference gene sequence in the sub-reference gene sequences into the optical computing chip for optical comparison to obtain a fourth similarity between the gene sequence to be detected and the second sub-reference gene sequence, and inputting the gene sequence to be detected and a third sub-reference gene sequence in the sub-reference gene sequences into the optical computing chip for optical comparison to obtain a fifth similarity between the gene sequence to be detected and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence. And when the sum of the fourth similarity and the fifth similarity is larger than the first threshold value, obtaining a fourth sub-reference gene sequence according to the second sub-reference gene sequence and the third sub-reference gene sequence, and inputting the gene sequence to be detected and the fourth sub-reference gene sequence into the optical computing chip for optical comparison. Wherein the fourth sub-reference gene sequence comprises a partial base of the second sub-reference gene sequence and a partial base of the third sub-reference gene sequence.
According to the mode, when the similarity value of the gene sequence to be tested and the second sub-reference gene sequence is determined not to meet the condition that the further matching with the second sub-reference gene sequence is required to be carried out, and the sum of the similarity between the gene sequence to be tested and the second sub-reference gene sequence and the similarity between the gene sequence to be tested and the third sub-reference gene sequence is larger than the first threshold value, the position of the sub-reference gene sequence can be adjusted in time, the fourth sub-reference gene sequence is obtained by taking continuous parts from the second sub-reference gene sequence and the third sub-reference gene sequence, so that the maximum similar matching segment of the gene sequence to be detected can be found from the fourth sub-reference gene sequence as soon as possible, without the need to continuously align the test gene fragment with a sub-reference gene sequence following the third sub-reference gene sequence. The mode of timely adjusting the sub-reference gene sequence according to partial comparison results can improve the probability and speed of obtaining the maximum similar gene segments and reduce the comparison times.
It is understood that, in practical applications, a part of the reference gene segment obtained from the second sub-reference gene sequence and the third sub-reference gene sequence according to the ratio of the fourth similarity to the fifth similarity may constitute the fourth sub-reference gene sequence.
In yet another possible implementation manner, the method further includes determining, according to an output result of the optical computing chip, that a second gene segment in the first group of gene segments matches the gene sequence to be detected, and recording a position of the second gene segment in the reference gene sequence.
In another possible implementation manner, the inputting the gene sequence to be tested and a plurality of reference gene segments in the first group of gene segments into the optical computing chip for optical alignment includes: and respectively carrying out optical coding on the gene sequence to be detected and the plurality of reference gene segments in the first group of gene segments, and respectively inputting the optical coding of the gene sequence to be detected and the optical coding of the plurality of gene segments in the first group of gene sequences into the optical computing chip for optical comparison. In practical applications, the gene sequence to be detected and the reference gene segments can be optically encoded according to the intensity information of light and/or the spatial information of light.
In a second aspect, embodiments of the present invention provide a gene mapping apparatus, including a processor and an optical computing chip. The processor is used for acquiring a first group of gene segments from a database according to a gene sequence to be detected, wherein the database system comprises a plurality of reference gene segments of reference gene sequences, and the first group of gene segments comprises a plurality of reference gene segments matched with partial bases of the gene sequence to be detected. And the optical computing chip is connected with the processor and is used for optically comparing the gene sequence to be detected with a plurality of reference gene segments in the first group of gene segments.
In one possible implementation, the processor may obtain the first set of gene segments from the database according to partial bases of the gene sequence to be detected. For example, a first group of gene segments is obtained from the database according to the first m bases and the last n bases of the gene sequence to be detected, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than the number of bases in the gene sequence to be detected. Specifically, the database may be a key-value (key-value) database, wherein a key value is a partial base of a plurality of reference gene segments of the reference gene sequence, and a value is a position of the plurality of reference gene segments in the reference gene sequence.
In a possible implementation manner, the processor is further configured to determine, according to an output result of the optical computing chip, that a similarity between the gene sequence to be detected and a first gene segment in the first group of gene segments is smaller than a first threshold and larger than a second threshold, and obtain a plurality of sub-reference gene sequences from the reference gene sequence, where each sub-reference gene sequence is a part of the reference gene sequence. The optical computing chip is further used for optically comparing the gene sequence to be detected with a first sub-reference gene sequence in the plurality of sub-reference gene sequences to obtain a first similarity between the gene sequence to be detected and the first sub-reference gene sequence.
In yet another possible implementation manner, the processor is further configured to determine that the first similarity is greater than a third threshold and smaller than a fourth threshold, where the fourth threshold is not greater than the first threshold, and in response to the determination, obtain a first candidate gene sequence and a second candidate gene sequence according to the candidate gene sequence, where partial bases of the first candidate gene sequence and the second candidate gene sequence are the same. The optical computing chip is also used for optically comparing the first to-be-detected sub-gene sequence with the first sub-reference gene sequence to obtain a second similarity; and optically comparing the second to-be-detected sub-gene sequence with the first sub-reference gene sequence to obtain a third similarity.
In yet another possible implementation manner, the processor is further configured to record a position of the first sub-reference gene sequence in the reference gene sequence when the second similarity is greater than the fourth threshold.
In yet another possible implementation manner, the processor is further configured to obtain a first candidate gene sequence unit and a second candidate gene sequence unit according to the second candidate gene sequence when the third similarity is greater than the third threshold and smaller than the fourth threshold, where partial bases of the first candidate gene sequence unit and the second candidate gene sequence unit are the same. The optical computing chip is also used for optically comparing the first to-be-detected sub-gene sequence unit with the first sub-reference gene sequence and optically comparing the second to-be-detected sub-gene sequence unit with the first sub-reference gene sequence.
In yet another possible implementation manner, the optical computing chip is further configured to optically compare the gene sequence to be tested with a second sub-reference gene sequence in the plurality of sub-reference gene sequences; and optically comparing the gene sequence to be detected with a third sub-reference gene sequence in the plurality of sub-reference gene sequences, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence. The processor is further configured to: determining that the sum of a fourth similarity between the gene sequence to be tested and the second sub-reference gene sequence and a fifth similarity between the gene sequence to be tested and the third sub-reference gene sequence is greater than the first threshold; obtaining a fourth sub-reference gene sequence according to the second sub-reference gene sequence and the third sub-reference gene sequence; and inputting the gene sequence to be detected and the fourth sub-reference gene sequence into the optical computing chip for optical comparison. Wherein the fourth sub-reference gene sequence comprises a partial base of the second sub-reference gene sequence and a partial base of the third sub-reference gene sequence.
In yet another possible implementation manner, the processor is further configured to determine, according to an output result of the optical computing chip, that a second gene segment in the first group of gene segments matches the gene sequence to be detected, and record a position of the second gene segment in the reference gene sequence.
In another possible implementation manner, the processor is further configured to optically encode the gene sequence to be detected and the plurality of reference gene segments in the first group of gene segments, and input the optical code of the gene sequence to be detected and the optical codes of the plurality of gene segments in the first group of gene sequences to the optical computing chip for optical comparison.
In a third aspect, an embodiment of the present invention provides a comparison apparatus, including a processor and a light computing chip. The processor is used for acquiring a first group of reference objects from a database according to a first object to be matched, wherein the first group of reference objects comprises a plurality of reference objects with the same partial characteristics as the first object. The optical computing chip is connected with the processor and is used for optically comparing the first object and the plurality of reference objects.
The comparison device provided by the embodiment of the invention combines two modes of database searching and optical comparison, and can greatly reduce the number of reference objects needing detailed comparison after the reference objects to be compared are screened by the database. Moreover, the comparison is carried out by adopting the optical computing chip, so that the comparison speed can be greatly improved. The comparison device provided by the embodiment of the invention can be used in gene detection scenes and can also be applied in various scenes in which mass data comparison is required.
In a possible implementation manner, the processor is further configured to determine, according to an output result of the optical computing chip, that a similarity between the first object and a first reference object in the first group of reference objects is smaller than a first threshold and larger than a second threshold, and obtain a plurality of sub-reference objects according to a standard object, where each sub-reference object is a part of the reference object. The optical computing chip is further configured to optically compare the first object with a first sub-reference object of the plurality of sub-reference objects, and obtain a first similarity between the first object and the first sub-reference object.
In yet another possible implementation manner, the processor is further configured to determine that the first similarity is greater than a third threshold and smaller than a fourth threshold, and in response to the determination, obtain a first sub-object and a second sub-object according to the first object. Wherein the fourth threshold is not greater than the first threshold, and the partial data of the first sub-object and the second sub-object are the same. The optical computing chip is further configured to optically compare the first sub-object with the first sub-reference object to obtain a second similarity, and optically compare the second sub-object with the first sub-reference object to obtain a third similarity.
In yet another possible implementation manner, the processor is further configured to record the position of the first sub-reference object in the standard object when the second similarity is greater than the fourth threshold.
In a fourth aspect, the present application further provides a comparison apparatus, including a function module, such as an obtaining module, a comparison module, a result processing module, and a determining module, for implementing the first aspect or any one of possible implementation manners of the first aspect.
In a fifth aspect, the present application further provides a computer program product, which includes program code, where the program code includes instructions to be executed by a computer to implement the gene alignment method described in the first aspect and any one of the possible implementation manners of the first aspect.
In a sixth aspect, the present application further provides a computer-readable storage medium for storing program code, where the program code includes instructions to be executed by a computer to implement the gene alignment method in the foregoing first aspect and any one of the possible implementations of the first aspect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.
FIG. 1 is a schematic structural diagram of a gene mapping apparatus according to an embodiment of the present invention;
FIG. 2A is a schematic diagram of a gene database according to an embodiment of the present invention;
FIG. 2B is a schematic diagram of an optical encoding according to an embodiment of the present invention;
FIG. 3A is a schematic diagram of a light computing chip according to an embodiment of the present invention;
FIG. 3B is a schematic structural diagram of another optical computing chip according to an embodiment of the present invention;
FIG. 3C is a schematic diagram of an optical alignment scheme according to an embodiment of the present invention;
FIG. 4 is a flowchart of a gene mapping method according to an embodiment of the present invention;
FIGS. 5A, 5B, 5C and 5D are examples of optical encoding provided by embodiments of the present invention;
FIG. 6 is a flowchart of another gene alignment method provided in the embodiments of the present invention;
FIG. 7 is a schematic diagram of a sub-reference gene sequence and a test substance gene sequence provided in the embodiment of the present invention;
FIG. 8 is a flowchart of another gene alignment method provided in the embodiments of the present invention;
fig. 9 is a schematic structural diagram of a comparison apparatus according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of another alignment apparatus provided in the embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of a portion of the invention and not all embodiments.
As described above, DNA sequencing data has been explosively increased due to rapid development of Deoxyribonucleic acid (DNA) sequencing technology. Therefore, how to increase the speed of DNA alignment is a technical problem that needs to be solved urgently. In the prior art, the search rate is usually accelerated by indexing the reference gene sequence in a computer system. The nature of indexing is to improve the efficiency of the lookup by optimizing the data structure. Index optimization then presents itself as a bottleneck, and creating many responsible indexes at the same time can be very costly to implement. Therefore, the efficiency of this gene alignment method is not able to bear the massive increase of DNA sequencing data. The gene comparison scheme provided by the embodiment of the invention can greatly improve the gene comparison speed, and can quickly realize gene comparison even facing massive gene sequencing data.
In order to understand the present solution more clearly, several technical terms related to the embodiments of the present invention are described first.
Gene: refers to genetic information that controls a biological trait, usually carried by a DNA sequence. A gene can also be regarded as a basic genetic unit, i.e., a segment of DNA or Ribonucleic acid (RNA) sequence that has functionality. The process of clarifying the sequence itself is called gene sequencing.
The gene sequence to be tested: also known as reads, are small sequencing fragments, a type of sequencing data generated by high throughput sequencing platforms. In the process of sequencing the whole genome, hundreds of reads are generated, and then the reads are spliced together to obtain the complete sequence of the genome.
Reference gene sequence (which may also be referred to as reference sequence): is a standard sequence that is validated and edited. The reference gene sequence may provide a basis for functional annotation of the human genome. Provides a stable reference point for mutation analysis, gene expression research and polymorphism discovery.
Base pair: is the chemical structure that forms DNA, RNA monomers, and encodes genetic information. Bases constituting the base pairs include adenine A, guanine G, thymine T, cytosine C, uracil U. Strictly speaking, a base pair is a pair of bases that match each other (i.e., A-T, G-C, A-U interactions) and are joined by hydrogen bonds. It is often used to measure the length of DNA and RNA (although RNA is single stranded).
The following describes embodiments of the present invention in detail. FIG. 1 is a schematic diagram of gene alignment achieved by an optical system according to an embodiment of the present invention. As shown, the genetic comparison device 100 may include a processor 102, a memory 104, and a light computing chip 106. Where the processor 102 and memory 104 may be considered part of the host 101. The optical computing chip 106 may be connected to the host 101 through a host interface. The host interface may include a standard host interface as well as a network interface (network interface). For example, the host interface may include a Peripheral Component Interconnect Express (PCIE) interface. The data may be sent to the optical computing chip 106 through the host interface, and the data processed by the optical computing chip 106 may also be sent to the processor 102 through the host interface. The processor 102 may also monitor the operating state of the optical computing chip 106 through the host interface. In practical applications, the processor 102 and the memory 104 may not be part of the host, and the processor 102, the memory 104 and the light computing Chip 106 may be part of a System on a Chip (SOC).
The Processor (Processor)102 is an arithmetic core and a Control core (Control Unit) of the gene mapping apparatus 100. Multiple processor cores (cores) may be included in processor 102. Processor 102 may be an ultra-large scale integrated circuit. An operating system and other software programs are installed in the processor 102 so that the processor 102 can access the memory 1042, cache, disks, and peripheral devices (e.g., the optical computing chip 106 in FIG. 1). It is understood that, in the embodiment of the present invention, the Core in the processor 102 may be, for example, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), or other Application Specific Integrated Circuit (ASIC).
The memory 104 is used to store data. The storage 104 may include a memory 1042, a disk, or other storage device for storing data. The memory 1042 is the main memory of the host 101. Memory 1042 may be coupled to processor 102 via a Double Data Rate (DDR) bus. Memory 1042 is typically used to store various operating systems software, input and output data, and information exchanged with external memory. To increase the access speed of the processor 102, the memory 1042 needs to have an advantage of high access speed. In practical applications, a Dynamic Random Access Memory (DRAM) is usually used as the Memory 1042. The processor 102 can access the memory 1042 at a high speed through a memory controller (not shown in fig. 1), and perform read and write operations on any memory location in the memory 1042.
In an embodiment of the present invention, the memory 104 may be used to store a gene database 1044. The gene database 1044 may be a key-value database established based on reference sequences. Wherein the key value may be obtained from a partial base of the gene fragment. The value may include a position of the reference gene segment corresponding to the key value in the memory, and may further include a position of the reference gene segment corresponding to the key value in the reference gene sequence.
In the embodiment of the present invention, a part of bases of the reference gene sequence may be used as the key value, for example, the first m bases and the last n bases of a reference gene fragment of a predetermined length may be used as the key value. m and n may be the same or different and are not limited herein. And positioning all reference gene segments meeting the key value by traversing the reference gene sequence, and recording the position information of all the reference gene segments as the value values corresponding to the key value. FIG. 2A is a schematic diagram of a gene database according to an embodiment of the present invention. As shown in fig. 2A, gene database 1044 can include a key 1044_1 and a value 1044_ 2. The bond 1044_1 is exemplified by 10 bases, and specifically, 5 bases at the head and 5 bases at the tail of the reference fragment can be taken as key values, respectively. In the present embodiment, how to build the gene database 1044 is described by taking the length of a reference gene fragment of 150 bases as an example. Specifically, an index table (only key value) of an empty set is constructed, and the number of rows is 45+5The ordered combination of keys is the alphabetical ordering of AAAAAAAAAA through TTTTTTTTTT. The mapping is shown in fig. 2B. Specifically, the head base is arranged in the order of the higher order and the tail base is arranged in the order of the lower order. The bases per unit are advanced A, C, G, T forward, and full T is one unit base forward. When all tail bases are TTTTT, the tail bases enter the head base part for one baseAnd (4) a base. In this manner, bases in the order of AAAAAAAAAA, AAAAAAAAAC, AAAAAAAAAG, AAAAAAAAAT, AAAAAAAACA, AAAAAAAACC, AAAAAAAACG, AAAAAAAACT, etc. can be used. Thereby, the key 1044_1 as shown in fig. 2A can be obtained.
After the key value index table is established, a plurality of reference gene segments can be obtained by sequentially sliding bases (i.e., 1 base) on a reference gene sequence in steps with a preset base length as a unit window. In the process of obtaining each reference gene segment, a key value of the reference gene segment can be obtained according to 5 bases at the head and 5 bases at the tail of the reference gene segment, respectively, and the position of the reference gene segment in the reference gene sequence can be recorded in the value 1044_2 corresponding to the key value. For example, the location of the first base of the reference gene fragment can be recorded. In this manner, the value values of all reference gene segments of the reference gene sequence (i.e., positional information of the reference gene segments) are obtained by sliding up to the end of the reference gene sequence. Thus, a gene database 1044 as shown in FIG. 2A can be established.
In practical applications, the way in which the key values are mapped depends on the form of the permutation combination. Let the sequence segments m bases after the first n be Seq1、Seq2The mapping of key values is defined as:
Figure BDA0002254264830000091
for example, if there is a DNA sequence of GTGGA … ….. CGAGC, giving A, C, G, T values of 0, 1, 2, 3, then the sequence corresponds to Key values of:
KeyGTG……..AGC
=(Seq1[4]×44+Seq1[3]×43+Seq1[2]×42+Seq1[1]
×41+Seq1[0]×40)×45+Seq2[4]×44+Seq2[3]
×43+Seq2[2]×42+Seq2[1]×41+Seq2[0]×40
=728×45+393=745865
it will be appreciated that the choice of the number of n and m bases directly affects the efficiency of the algorithm itself, and that an increase in n and m will result in a decrease in the value (i.e. location information) of the key value store. If hardware factors are not considered, the addressing rate of each gene sequence to be tested is improved by 4 times on average per increasing unit base. However, increasing n and m may reduce Key value reliability due to sequencing errors and genetic mutations that limit n and m from increasing indefinitely. Therefore, the values of m and n can be determined by self according to needs, and the length of the reference gene segment can be set according to actual needs. Generally, the values of m and n can be determined according to the length of the gene sequence to be detected, the length of the reference gene sequence and other factors. In practical applications, the length of the reference gene fragment is usually the same as the base length of the gene sequence to be detected.
The light computing chip 106 may be an on-chip computing system. Fig. 3A is a schematic structural diagram of a light computing chip according to an embodiment of the present invention. As shown in fig. 3A, the light computing chip 106 may include a light source array 202, a modulator array 204, a detector array 206, a first concave mirror 208, and a second concave mirror 210. Where light source array 202 is located at the object plane focal plane of first concave mirror 208. The modulator array 204 is located at the image plane focal plane of the first concave mirror 208 and the modulator array 204 is also located at the object plane focal plane of the second concave mirror 210. The detector array 206 is located at the image plane focal plane of the second concave mirror 210.
The light source array 202 is used for modulation and transmission of data as a data input unit of the light computing chip 106. The light source array 202 may generate a plurality of light signals of different light intensities from the input data. First concave mirror 208 is used to implement a standard fourier transform on the data light signals transmitted by light source array 202. The modulator array 204 has two modes of operation: a recording mode and a modulation mode. The recording mode is used to obtain an image of the spectral plane of the data optical signal transmitted by the light source array 202 after passing through the first concave mirror 208. The modulation scheme is used to modulate the spectral aspect of the data optical signal transmitted by the array of light sources 202 onto the array of modulators 204. The second concave mirror 210 is used to perform a standard inverse fourier transform on the optical signal after passing through the modulator array 204. The detector array 206 is used for light intensity signal detection as a result output unit of the optical computing chip 106.
Fig. 3B is a schematic structural diagram of another optical computing chip according to an embodiment of the present invention. Unlike the on-chip integrated optical computing chip provided in FIG. 3A, the optical computing chip shown in FIG. 3B has the light source array 202 and the detector array 206 disposed on the same side of the chip, so that the overall computing chip is more compact and the chip size can be reduced. As shown in fig. 3B, the positions of first concave mirror 208, second concave mirror 210, and modulator array 204 are not changed, and the focal positions of light source array 202, modulator array 204, and detector array 206 with respect to first concave mirror 208 and second concave mirror 210, respectively, are also not changed, as compared to the light calculation chip shown in fig. 3A. The implementation of the various devices shown in FIG. 3B may be referred to in the description of the various devices in the light computing chip shown in FIG. 3A. And will not be described in detail herein.
Fig. 3A and fig. 3B are only schematic structural diagrams of the optical computing chip according to the embodiment of the present invention, and in practical applications, the specific structure of the optical computing chip 106 is not limited, and optical computing chips with other structures may also be used. For example, the light computing chip 106 may also be a light computing chip of other configurations implemented using 4F light computing system principles. FIG. 3C is a schematic diagram of a 4F light computing system. As shown in fig. 3C, the first modulator 302 is located at the object plane focal point of the first convex lens 304. The second modulator 306 is located at the focal position of the image plane of the first convex lens 304 and at the focal position of the object plane of the second convex lens 308. The interval between the first convex lens 304 and the second convex lens 308 is the sum of the focal lengths of the two convex lenses (304 and 308). The detector 310 is at the focal position of the image plane of the second convex lens 308, and the length of the whole system is 4 times of the focal length. When data alignment is performed by using the 4F optical system shown in fig. 3C, first data to be aligned may be loaded on the first modulator 302, and spectrum data of the inverted second data may be loaded on the second modulator 306, so that an optical signal generated according to the first data passes through the first convex lens 304, fourier transform occurs, and the optical signal is converted into a spectrum optical signal at the position of the second modulator 306, and the spectrum data of the inverted second data on the second modulator 306 is multiplied in an optical space. The optical field energy distribution of the spectral optical signal of the first data in the optical space is essentially changed. The multiplied spectral optical signal is subjected to inverse fourier transform by the second convex lens 308 and is converted back to a time-domain optical signal. The detector 310 may obtain an autocorrelation result of the two data according to the intensity of the time-domain optical signal detected to pass through the second convex lens 308. It should be noted that the first data and the second data loaded on the optical computing chip may be both vectors.
It can be understood that the above-mentioned process of comparing data implemented by the optical computing chip of fig. 3A-3C is obtained by detecting the autocorrelation result of the optical signals of the two data in the optical space. As will be appreciated by those skilled in the art, autocorrelation, also known as sequence correlation, is the cross-correlation of a signal with itself at various points in time. Stated another way, autocorrelation is a function of the similarity between two observations versus the time difference between them. Autocorrelation is a mathematical tool to find a repeating pattern of a sequence of random variables. When sequence identification is actually carried out, an obvious maximum value position can be ensured to appear in the autocorrelation result by using autocorrelation operation when the sequence to be detected is the same as the target sequence, and the comparison of the sequences can be easily realized by monitoring the appearance of the maximum value.
How to realize gene comparison by using the gene comparison apparatus shown in FIG. 1 will be described in detail below to increase the gene comparison speed. FIG. 4 is a flowchart of a gene mapping method according to an embodiment of the present invention. The method shown in fig. 4 will be described in detail with reference to fig. 1. For clarity and convenience of description, the embodiment of the present invention is described by taking the detection of a gene sequence to be detected as an example. It is understood that even though a plurality of gene sequences to be detected may be detected at one time in practical applications, each gene sequence to be detected may be compared with the embodiment of the present invention. As shown in fig. 4, the method includes the following steps.
In step 402, the processor 102 obtains a first set of gene segments from the database according to partial bases of the gene sequence to be tested. Specifically, the key value of the gene sequence to be detected can be obtained in the manner of obtaining the key 1044_1 of the gene database 1044. For example, 5 bases at the head and 5 bases at the tail of the gene sequence to be detected can be used as the key value of the gene sequence to be detected. And searching the gene database 1044 according to the key value of the gene sequence to be detected, and obtaining a plurality of value values matched with the key value, wherein the value values are used for indicating the possible positions of the gene sequence to be detected on the reference gene sequence. Since the value corresponding to a certain key value in the gene database 1044 indicates the position information of the corresponding reference gene segment in the reference gene sequence, a plurality of reference gene segments can be obtained according to the matched plurality of value values. In the embodiments of the present invention, a plurality of reference gene fragments that match the key values of the gene sequences to be tested are referred to as a first group of gene fragments.
In step 404, the optical computing chip 106 optically compares the gene sequence to be tested with a plurality of reference gene segments in the first set of gene segments. Specifically, the processor 102 may perform optical encoding on the gene sequence to be detected and the plurality of reference gene segments, and load the optical encoding of the gene sequence to be detected and the optical encoding of the plurality of reference gene segments to the optical computing chip for comparison. In the process of optically encoding the gene sequence to be detected and the reference gene segment, the base character strings in the gene sequence to be detected and the reference gene segment can be respectively encoded. For example, a unit cluster with 4 point light sources as a single base shows four different bases with different light levels (0 indicating that the light source is off and 1 indicating that the light source is on), and the coding schemes of A, C, G, T are 0001, 0010, 0100, and 1000, as shown in FIG. 5A. According to the coding mode of the single base A, C, G, T, the optical codes of the gene sequence to be tested and a plurality of reference gene segments in the first group of gene segments can be obtained. So that the obtained optical codes of the gene sequence to be tested and a plurality of reference gene segments in the first group of gene segments can be sent to the optical computing chip 106 for optical comparison.
In practical application, different encoding modes directly influence the decoding difficulty and the reliability of autocorrelation result output. In yet another case, intensity information of the light and/or spatial information of the light may also be included in the encoding process. In the embodiment of the present invention, a method of encoding using light intensity information may be referred to as an intensity encoding method, and a method of encoding using light spatial information may be referred to as a spatial encoding method. In practical applications, the two encoding schemes may be combined, and such a combination scheme may be referred to as a hybrid encoding scheme. The intensity coding mode can modulate light intensity by using different voltage amplitudes, and four different bases are represented by light signals with different intensities. The intensity encoding may be as shown in fig. 5B. The spatial coding scheme can use a plurality of point light sources as a unit cluster of single bases, and represent four different bases with different light levels (for example, 0 represents that the light source is off, and 1 represents that the light source is on). The spatial coding scheme can be as shown in FIG. 5C, and can use a plurality of light signals with the same voltage and different light intensities to represent different bases. The hybrid coding method may be a combination intensity coding method and a spatial coding method, for example, as shown in fig. 5D, a plurality of optical signals with different voltages and different light intensities may be used to combine and represent different bases. The specific encoding method is not limited in the embodiment of the present invention.
In the process of comparing the genes of the light computing chip 106, the light source array 202 may first send a first light signal according to the code of the inverted gene sequence to be detected, the first light signal is reflected by the first concave mirror 208 and then undergoes fourier transform to become a spectrum light signal, and the modulator array 204 receives the reflection spectrum light signal of the first light signal and modulates the reflection spectrum light signal of the first light signal on the modulator array 204. Then, the light source array 202 transmits a plurality of optical signals according to the optical codes of the plurality of reference gene segments in the first group of reference gene segments, respectively, so that the optical signals transmitted according to the optical codes of the reference gene segments are converted into spectrum optical signals at the modulator array 204 position through the first concave mirror 208, and then are multiplied with the reflection signals of the first optical signals in the optical space. The spectrum optical signal output by the modulator array 204 is subjected to inverse fourier transform by the second concave mirror 210 to become a time-domain optical signal, and finally the detector array 206 can respectively obtain the matching results of the first optical signal and the optical signals of the plurality of reference gene segments by detecting the light intensity of the time-domain optical signal output by the second concave mirror 210. As known to those skilled in the art, the result of the autocorrelation of two data is obtained after the spectral data are multiplied and subjected to inverse Fourier transform.
In step 406, the processor 102 determines the similarity between the gene sequence to be tested and the reference gene segments according to the output result of the optical computing chip. In practice, after the detector array 206 obtains the matching result, the light computing chip 106 may send the matching result to the processor 102. For example, the acquired light intensity signal detected by the detector array 206 may be collected by some peripheral circuits, the collected light intensity signal is converted into an electrical signal, and the electrical signal is converted into a digital signal and then sent to the processor 102, so that the processor 102 can obtain the comparison result of the optical computing chip 106 on the gene sequence to be detected and the reference gene fragment. It is understood that in practical applications, the detector array 206 may generate feedback every time a comparison result is obtained, or may generate feedback when the similarity reaches a preset threshold. It should be noted that the similarity of the embodiments of the present invention is used to indicate the matching degree between the gene sequence to be tested and the reference gene fragment.
In step 408, processor 102 determines whether the similarity between the gene segment to be tested and a first reference gene segment in the plurality of reference gene segments is greater than or equal to a first threshold, and if so, proceeds to step 410, and when it is determined that the similarity between the gene segment to be tested and the first reference gene segment is less than the first threshold, the method proceeds to step 412. In this step, the processor 102 may compare the comparison result with a set threshold after obtaining the comparison result. The matching result of the gene sequence to be tested and any reference gene segment can be compared with the set threshold value. The embodiment of the invention is described by taking the gene sequence to be detected and a first reference gene segment in a first group of reference gene segments as an example, wherein the first reference gene segment is any one reference gene segment in the first group of reference gene segments. If the similarity between the gene segment to be tested and the first reference gene segment is greater than or equal to the first threshold, the method proceeds to step 410, otherwise the method proceeds to step 412.
In step 410, processor 102 records the position of the first reference gene segment in the reference gene sequence, and ends the matching of the test gene sequence. In the embodiment of the present invention, the matching result with the similarity greater than or equal to the first threshold may be regarded as a successful matching. When processor 102 determines that the matching of the test gene sequence and the first reference gene segment is successful, the position of the first gene segment in the reference gene sequence can be recorded. And finishing the matching of the gene sequence to be detected, and finishing the matching process. It will be appreciated that, in the present example, similarity is used to indicate how well the test gene sequence matches the reference gene segment. The first threshold is used to indicate whether the required matching criteria are met. In practical applications, the first threshold may be used to indicate a perfect match or may be used to indicate a maximum similarity match. And if the similarity is greater than or equal to a set first threshold value, the gene sequence to be detected is considered to be matched with the reference gene sequence or the maximum similarity matching. For example, the first threshold may be 100% or 95%, and is not limited herein.
If in step 408, the processor determines that the similarity between the gene segment to be detected and the first gene segment is smaller than the first threshold, in step 412, the processor 102 further determines whether the similarity between the gene segment to be detected and the first gene segment is greater than a second threshold, and when the similarity between the gene segment to be detected and the first gene segment is greater than the second threshold, the method proceeds to step 414, and enters a maximum similarity matching process. Otherwise, the method proceeds to step 416, where it is determined that the test gene sequence is not matched with the first reference gene segment, and the matching of the test gene segment and the first gene segment is ended. In an embodiment of the present invention, the second threshold may be set to 50%. When the similarity between the gene fragment to be detected and the first reference gene fragment is smaller than a first threshold and larger than a second threshold, it indicates that the possibility that the gene sequence to be detected and the reference gene sequence can be matched is high, or that a part of the fragments in the gene sequence to be detected and the reference gene sequence can be matched. Therefore, the gene sequence to be tested needs to be further compared with the reference gene sequence, and the method enters a maximum similarity matching process.
It is understood that steps 408 to 416 of FIG. 4 are described by way of example of the matching of the test gene sequence to the first reference gene segment. In practical application, after the similarity between the gene sequence to be detected and a plurality of reference gene segments is obtained through steps 404 and 406, processing is performed according to steps 408 and 416 respectively according to the similarity between the gene sequence to be detected and each reference gene segment. Of course, after the first set of reference gene segments is obtained, the operations from step 404 to step 416 may also be performed on the gene sequence to be tested and each reference gene segment in the first set of reference gene segments in turn. Here, the specific implementation is not limited.
According to the gene comparison method provided by the embodiment of the invention, the constructed gene database is primarily matched with the gene sequence to be detected, so that a first group of reference gene segments possibly matched with the gene sequence to be detected is screened out. As known to those skilled in the art, taking the human reference gene fragment as an example, the human reference gene fragment has 30 hundred million bases, and it takes much time to directly align the gene fragment to be tested with the reference gene fragment. After the gene database provided by the embodiment of the invention screens the gene segments to be compared, the reference gene segments to be compared can be reduced from 30 hundred million to several hundred times, thereby greatly reducing the number of the reference gene segments to be compared. In addition, in the embodiment of the present invention, after the first group of reference gene segments are obtained, the optical calculation chip is further used to perform optical comparison on the gene sequence to be detected and a plurality of reference gene segments in the first group of reference gene segments, and since the optical calculation chip performs optical comparison, compared with a method of performing gene comparison electrically, the comparison speed is faster. Therefore, the gene comparison method provided by the embodiment of the invention also greatly improves the comparison efficiency.
It should be noted that, in the embodiment of the present invention, as long as the similarity between the gene sequence to be tested and any one of the reference gene segments in the first group of reference gene segments is smaller than the first threshold and larger than the second threshold, the gene sequence to be tested may be further compared according to the maximum similarity matching method shown in fig. 6. FIG. 6 is a flowchart of another gene alignment method provided in the embodiments of the present invention. The method shown in fig. 6 is still performed by the gene matching apparatus 100. As shown in fig. 6, the method may include the following steps.
In step 602, processor 102 obtains a plurality of sub-reference gene sequences from a reference gene sequence. Specifically, the processor 102 obtains a plurality of sub-reference gene sequences from the reference gene sequence according to the length of the gene sequence to be detected. For example, a plurality of sub-reference gene sequences can be obtained from the reference gene sequence by using the length of the gene sequence to be detected as a window and a sliding step length. The reference gene sequence can also be divided into a plurality of sub-reference gene sequences according to the base length of the gene sequence to be detected. For example, as shown in FIG. 7, a plurality of sub-reference gene sequences can be obtained from the reference gene sequence 700 according to the length of the test gene sequence 702. Taking 30 hundred million bases as an example of the reference gene sequence, if the gene sequence to be tested is 150 bases, 0.2 hundred million sub-reference gene sequences can be obtained.
In step 604, the input light of the gene sequence to be tested and the ith sub-reference gene sequence obtained in step 602 are optically compared with each other by the input light calculation chip 106. The initial value of i is 1, and the value of i is not greater than the number of sub-reference gene sequences obtained in step 602. Specifically, the processor 102 may perform optical encoding on the gene sequence to be detected and the ith sub-reference gene sequence, and load the optical encoding of the gene sequence to be detected and the ith sub-reference gene sequence into the optical computing chip 106 for optical comparison, so as to obtain the similarity between the gene sequence to be detected and the ith sub-reference gene sequence, and the optical computing chip 106 may send the comparison result to the processor 102. In the embodiment of the present invention, the similarity between the test gene sequence and the first sub-reference gene sequence in the plurality of sub-reference gene sequences may be referred to as a first similarity.
In step 606, the processor 102 determines whether the similarity between the gene sequence to be detected and the ith sub-reference gene sequence is greater than a set third threshold. If the value is not greater than the third threshold, the to-be-detected gene sequence is not matched with the ith sub-reference gene sequence, the method proceeds to step 608, let i be i +1, and returns to step 604, and continues to compare the to-be-detected gene sequence with the next sub-reference gene sequence until the to-be-detected gene sequence and all the sub-reference gene sequences obtained in step 602 are optically compared by the optical computing chip 106. If in step 606, the processor 102 determines that the similarity between the gene sequence to be tested and the ith sub-reference gene sequence is greater than the third threshold, then the method proceeds to step 610. In the embodiment of the present invention, in order to find a reference gene fragment matching at least a partial fragment of the test gene sequence as much as possible, the third threshold may be set to be less than 50% of the similarity, for example, the third threshold may be set to be 20%. It is to be understood that, in practical applications, the third threshold may also be the same as the second threshold, and is not limited herein.
If the similarity between the gene sequence to be tested and the ith sub-reference gene sequence is greater than the third threshold, in step 610, the processor 102 further determines whether the similarity between the gene sequence to be tested and the ith sub-reference gene sequence is greater than a fourth threshold. If the similarity between the gene sequence to be tested and the ith sub-reference gene sequence is greater than the fourth threshold, the method proceeds to step 612. In the embodiment of the present invention, the fourth threshold is not greater than the first threshold, the first threshold may be a threshold for indicating a perfect match setting, and the fourth threshold is a threshold for indicating a maximum similarity match. Typically the first threshold may be set at 100% and the fourth threshold may be set at 95%. It will be appreciated that in practical applications, the fourth threshold may also be the same as the first threshold, for example, both the first threshold and the fourth threshold may be set to 95% for indicating that the threshold of maximum similarity match is reached. And are not limited herein. In step 612, the processor 102 determines that the ith sub-reference gene sequence is a gene fragment having the maximum similarity with the gene sequence to be detected, records the position of the ith sub-reference gene sequence in the reference gene sequence, and ends the comparison process of the gene sequence to be detected. If the similarity between the gene sequence to be tested and the ith sub-reference gene sequence is not greater than the fourth threshold, the method proceeds to step 614.
In step 614, processor 102 obtains a first candidate gene sequence and a second candidate gene sequence according to the candidate gene sequence. With continued reference to fig. 7, in this step, processor 102 can obtain a first candidate gene sequence 7022 and a second candidate gene sequence 7024 from candidate gene sequence 702. Wherein, the partial basic groups of the first gene sequence 7022 and the second gene sequence 7024 are the same. For example, the first candidate gene sequence 7022 may include bases of a first predetermined length obtained from the head of the candidate gene sequence 702 toward the tail, the second candidate gene sequence 7024 may include bases of a first predetermined length obtained from the tail of the candidate gene sequence 702 toward the head, and some bases of the first candidate gene sequence 7022 and the second candidate gene sequence 7024 are the same. The method proceeds to step 616.
In step 616, the j to-be-detected gene sequence and the i sub-reference gene sequence are optically compared by the optical computing chip 106. Wherein the initial value of j is 1, and the value of j may not be greater than the number of the gene sequences to be detected. Since two test gene sequences were obtained from the test gene sequence in the present example, the value of j was not more than 2 in the present example. It can be understood that if p (p is greater than 2) gene sequences to be detected need to be obtained in practical application, the value of j is not greater than p. In this step, the processor 102 also needs to perform optical encoding on the jth sub-gene sequence unit to be tested, and then loads the optical encoding of the jth sub-gene sequence unit to be tested and the optical encoding of the ith sub-reference gene sequence into the optical computing chip 106 for optical comparison, so as to obtain the similarity between the jth sub-gene sequence to be tested and the ith sub-reference gene sequence. The method proceeds to step 618. In step 618, the processor 102 determines whether the similarity between the jth test-target gene sequence and the ith sub-reference gene sequence is greater than the third threshold, and if not, the method proceeds to step 620, where j is j +1, and proceeds to step 616, where the jth +1 test-target gene sequence and the ith sub-reference gene sequence are optically aligned to obtain the similarity between the jth +1 test-target gene sequence and the ith sub-reference gene sequence. If, in step 618, the processor 102 determines that the similarity between the jth test-target gene sequence and the ith sub-reference gene sequence is greater than the third threshold, the method proceeds to step 622, and further determines whether the similarity between the jth test-target gene sequence and the ith sub-reference gene sequence is greater than the fourth threshold. In the embodiment of the present invention, for clarity and convenience of description, a matching result of the optical computing chip on the first candidate gene sequence and the first sub-reference gene sequence may be referred to as a second similarity, and a matching result of the optical computing chip on the second candidate gene sequence and the first sub-reference gene sequence may be referred to as a third similarity.
If in step 622, the processor 102 determines that the similarity between the jth test-target sub-gene sequence and the ith sub-reference gene sequence is greater than the fourth threshold, the method proceeds to step 624, records the position of a reference gene fragment in the ith sub-reference gene sequence, which is matched with the jth test-target sub-gene sequence, in the reference gene sequence, and ends the matching of the test-target gene sequence. It should be noted that, in practical applications, if it is determined that the similarity of partial fragments of the jth test-target gene sequence and the ith sub-reference gene sequence is greater than the fourth threshold, in order to increase the matching speed, the comparison process of the test-target gene sequence may be directly ended without continuously matching the jth +1 test-target gene sequence and the ith sub-reference gene sequence. Of course, it is understood that, in practical applications, the j +1 th test-target gene sequence and the ith sub-reference gene sequence may be optically aligned as needed.
If, in step 622, processor 102 determines that the similarity between the jth candidate sub-gene sequence and the ith sub-reference gene sequence is not greater than the fourth threshold, then the method proceeds to step 626. In step 626, the processor 102 obtains a first to-be-detected gene sequence unit and a second to-be-detected gene sequence unit of the jth to-be-detected sub-gene sequence, wherein partial bases of the first to-be-detected gene sequence unit and the second to-be-detected gene sequence unit are the same. Specifically, reference may be made to the method for obtaining the first candidate gene sequence and the second candidate gene sequence from the candidate gene sequence in step 614. For example, the first test gene sequence unit may include a second predetermined length of base obtained from the head of the jth test sub-gene sequence toward the tail, and the second test gene sequence unit may include a second predetermined length of base obtained from the tail of the jth test sub-gene sequence toward the head.
In step 628, the kth gene sequence unit to be tested is optically compared with the ith sub-reference gene sequence by the optical computing chip 106. Wherein the initial value of k is 1, and the value of k is not more than the number of the gene sequence units to be detected. In the embodiment of the present invention, since two units of the gene sequence to be tested are obtained according to the j-th test sub-gene sequence, the value of k is not greater than 2. Specifically, in step 628, the processor 102 may optically encode the kth gene sequence unit to be tested, and load the optical code of the kth gene sequence unit to be tested and the optical code of the ith sub-reference gene sequence to the optical computing chip 106 for optical comparison, respectively. The method proceeds to step 630. In step 630, the processor 102 determines whether the similarity between the kth gene sequence unit to be tested and the ith sub-reference gene sequence is greater than the third threshold. If the number of the second gene sequence units is not greater than the third threshold, step 632 is performed, k is set to k +1, and step 628 is performed, in which the optical calculation chip 106 optically compares the second gene sequence unit to be detected with the ith sub-reference gene sequence.
If in step 630, the processor 102 determines that the similarity between the kth gene sequence unit to be tested and the ith sub-reference gene sequence is greater than the third threshold, the method proceeds to step 634, determines whether the similarity between the kth gene sequence unit to be tested and the ith sub-reference gene sequence is greater than a fourth threshold, and if so, the method proceeds to step 636, records the position of the gene fragment in the ith sub-reference gene sequence, which is matched with the kth gene sequence unit to be tested, in the reference gene sequence, and ends the matching. Specifically, in one case, in order to increase the matching speed, after the gene segment with the maximum similarity is obtained, the matching of the gene sequence to be tested may be ended. In another case, the matching of the jth gene sequence to be detected may be ended, or the matching of the kth gene sequence unit to be detected may be ended, and the matching of the k +1 gene sequence unit to be detected may be continued, or the matching of the jth +1 gene sequence to be detected may be continued.
If in step 634, the processor 102 determines that the similarity between the kth gene sequence unit to be detected and the ith sub-reference gene sequence is not greater than the fourth threshold, the method proceeds to step 638, continues to split the kth gene sequence unit to be detected in a recursive manner, and optically compares the sub-units of the kth gene sequence unit to be detected and the ith sub-reference gene sequence until a gene fragment to be detected whose similarity with the ith sub-reference gene sequence is greater than the fourth threshold is found. In the embodiment of the present invention, a reference gene fragment having a similarity greater than the fourth threshold with respect to a part of the test gene fragments in the test gene sequence among the sub-reference gene fragments may be referred to as a maximum similar gene fragment.
The gene comparison method provided by the embodiment of the invention can continue to perform maximum similarity matching on the gene segment to be detected by the gene comparison method shown in fig. 6 for the gene segment to be detected which cannot be accurately matched by the gene comparison method shown in fig. 4. Since the method shown in FIG. 6 can allow the gene to be tested to be not completely consistent with the obtained maximum similar gene fragment, there may be partial base deletion in the gene sequence to be tested or difference from the reference gene fragment, thereby realizing accurate positioning of the deleted gene or variant gene in the gene sequence to be tested.
In another case, the gene alignment method provided in the embodiment of the present invention may further include the process shown in fig. 8. The method shown in fig. 8 may follow step 604 shown in fig. 6. As shown in fig. 8, the method may include the following steps. In step 802, processor 102 determines that a first similarity between the test gene sequence and the ith sub-reference gene sequence is less than a third threshold. And, in step 804, when the processor 102 further determines that the second similarity between the gene sequence to be tested and the i +1 th sub-reference gene sequence is greater than the third threshold, the method proceeds to step 806. It should be noted that the description of step 802 and step 804 may refer to the description of step 606 in fig. 6, and the third threshold may be the same as the third threshold set in step 606, and may be 50%, for example.
In step 806, the processor further determines whether the sum of the first similarity and the second similarity is greater than 100%. If the sum of the first similarity and the second similarity is not greater than 100%, the method proceeds to step 808, and the optical calculation chip 106 optically compares the gene sequence to be tested with the i +2 th sub-reference gene sequence. If the sum of the first similarity and the second similarity is greater than 100%, the method proceeds to step 810. In step 810, the processor 102 obtains a new sub-reference gene sequence according to the ith sub-reference gene sequence and the (i + 1) th sub-reference gene sequence. In step 810, a new sub-reference gene sequence may be composed by obtaining a part of the reference gene segments from the i-th sub-reference gene sequence and the i + 1-th sub-reference gene sequence according to the ratio of the first similarity to the second similarity. For example, if the first similarity is 40%, the second similarity is 80%, and the length of a reference gene sequence is 150 base pairs, then the 50 base pairs at the tail of the i-th sub-reference sequence and the 100 base pairs at the head of the i + 1-th sub-reference sequence can be combined to form a continuous new sub-reference sequence with a length of 150 base pairs. After obtaining the new sub-reference sequence, the method proceeds to step 812, and the optical computing chip 106 optically compares the gene sequence to be tested with the obtained new sub-reference sequence, where the specific optical comparison method can be described with reference to step 604 in fig. 6. In addition, in the process of comparing the gene sequence to be tested with the obtained new sub-reference sequence, reference may be made to the process of comparing the gene sequence to be tested with the ith sub-reference sequence in fig. 6. In this way, if the similarity between the test gene sequence and the new sub-reference gene sequence is greater than the third threshold, the method of steps 610 to 638 in fig. 6 can be continued to search for a reference gene segment from the new sub-reference sequence whose similarity to the test gene sequence is greater than the fourth threshold. In the embodiment of the present invention, the reference gene segments found in the reference gene sequences according to the comparison methods shown in fig. 6 and 8 and having similarity with the to-be-detected gene sequence greater than the fourth threshold may be referred to as maximum similar gene segments.
The method shown in fig. 8 may be used in combination with the method shown in fig. 6. For example, when it is determined that the similarity between the test gene sequence and the ith sub-reference gene sequence is low and the similarity between the test gene sequence and the (i + 1) th sub-reference gene sequence is high, the method shown in fig. 8 may be performed instead, so that the new reference gene sequence obtained from the ith sub-reference gene sequence and the (i + 1) th sub-reference gene sequence can be aligned with the test gene sequence. The mode of timely adjusting the sub-reference gene sequence according to partial comparison results can improve the probability and speed of obtaining the maximum similar gene segments and reduce the comparison times. It is understood that, in practical applications, the method shown in fig. 8 may be executed after the to-be-detected gene sequence is aligned with the plurality of sub-reference gene sequences obtained in step 602 according to the method shown in fig. 6, and the sub-reference gene sequences are adjusted and then aligned. In the embodiment of the present invention, a specific execution manner is not limited.
It should be noted that fig. 8 is described by taking the gene sequence to be tested and the ith sub-reference gene sequence as an example, and in practical application, the ith sub-reference gene sequence may be any one of the sub-reference gene sequences. For example, in step 802, the processor may take the case of aligning the test gene sequence with a second sub-reference gene sequence of the sub-reference gene sequences, if the similarity between the test gene sequence and the second sub-reference gene sequence is a fourth similarity, and the fourth similarity is smaller than the third threshold. In step 804, processor 102 determines that a similarity between the gene sequence to be tested and a third sub-reference gene sequence in the plurality of sub-reference gene sequences is a fifth similarity, and the fifth similarity is greater than the third threshold. If the processor further determines that the sum of the fourth similarity and the fifth similarity is greater than 100% in step 806, the processor 102 may obtain a new sub-reference gene sequence according to the method shown in fig. 8 according to the second sub-reference gene sequence and the third sub-reference gene sequence.
In the embodiment of the present invention, after the maximum similar gene segment of the gene sequence to be detected is found by the methods shown in fig. 6 and fig. 8, the maximum similar gene segment may be further expanded on the gene sequence to be detected and the reference sequence by using a Smith-Waterman local comparison algorithm, so as to obtain a longer maximum similar gene segment, thereby facilitating further gene analysis work on the gene segment to be detected.
It is understood that the method shown in the above examples is described by taking the alignment of the gene sequence to be tested and one of the sub-reference gene sequences as an example. In practical applications, the sequences may be aligned with a plurality of sub-reference gene sequences, which is not limited herein. The embodiments of the present application refer to ordinal numbers such as "first", "second", etc. for distinguishing a plurality of objects, and do not limit the order, sequence, priority, or importance of the plurality of objects.
It is understood that the alignment method of the present invention is only an example of gene alignment. In practical applications, the comparison method combining the electrical comparison method realized based on the database and the optical comparison method based on the optical computing chip provided by the embodiment of the present invention can be applied to various other application scenarios. Fig. 9 is a schematic diagram of a comparison apparatus according to an embodiment of the present invention. The alignment device can be used for realizing various data alignment scenes including gene alignment.
As shown in fig. 9, the alignment apparatus 900 may include a processor 902, a memory 904, and a light computing chip 906. The processor 902 is configured to obtain a first set of reference objects from a database stored in the memory 904 according to a first object to be matched, where the first set of reference objects includes a plurality of reference objects having a same partial feature as the first object. The light computing chip 906 is connected to the processor and is used for optically aligning the first object and the plurality of reference objects. The processor 902 may be further configured to determine similarity between the first object and the plurality of reference objects according to an output of the optical computing chip.
In yet another case, the processor 902 may be further configured to determine, according to the output result of the optical computing chip, that the similarity between the first object and a first reference object in the first set of reference objects is smaller than a first threshold and larger than a second threshold, and obtain a plurality of sub-reference objects according to a standard object, where each sub-reference object is a part of the reference object. The optical computing chip 906 may be further configured to optically compare the first object with a first sub-reference object of the plurality of sub-reference objects, so as to obtain a first similarity between the first object and the first sub-reference object.
In yet another case, the processor 902 may be further configured to determine that the first similarity is greater than a third threshold and less than a fourth threshold, and in response to the determination, obtain a first sub-object and a second sub-object according to the first object, where the fourth threshold is not greater than the first threshold, and partial data of the first sub-object and the second sub-object are the same. The optical computing chip 906 may be further configured to optically compare the first sub-object and the first sub-reference object to obtain a second similarity, and optically compare the second sub-object and the first sub-reference object to obtain a third similarity. In practical applications, the processor 902 may be further configured to record the position of the first sub-reference object in the standard object when the second similarity is greater than the fourth threshold.
It is understood that the comparison device shown in fig. 9 can be used to realize the functions of the comparison device shown in fig. 1, and the description of the comparison device shown in fig. 9 can be the description of fig. 1 to fig. 8 in the embodiment of the present invention. The alignment apparatus shown in fig. 9 can be applied to various scenes including gene alignment that require data alignment or feature alignment. It can be said that the gene alignment apparatus shown in FIG. 1 is a specific application of the alignment apparatus shown in FIG. 9. The alignment apparatus shown in fig. 9 and the alignment method provided in the embodiment of the present invention can also be used in scenes such as image alignment, image searching, sequence alignment, fuzzy matching, and the like, which are not limited herein.
Fig. 10 is a schematic diagram of another alignment apparatus provided in the embodiment of the present invention. As shown in fig. 10, the alignment apparatus 1000 may include an obtaining module 1002, an alignment module 1004, and a result processing module 1006. The obtaining module 1002 is configured to obtain a first group of gene segments from a database according to a gene sequence to be detected, where the database system includes a plurality of reference gene segments of a reference gene sequence, and the first group of gene segments includes a plurality of reference gene segments matched with a part of bases of the gene sequence to be detected. The alignment module 1004 is configured to optically align the gene sequence to be tested with a plurality of reference gene segments in the first set of gene segments. The result processing module 1006 is configured to determine similarities between the gene sequence to be detected and the reference gene segments in the first set of gene segments according to the output result of the comparing module 1004.
In yet another case, the comparing apparatus 1000 may further include a determining module 1008. The judging module 1008 is configured to determine, according to the output result of the comparing module 1004, that the similarity between the gene sequence to be detected and a first gene segment in the first group of gene segments is smaller than a first threshold and larger than a second threshold. The obtaining module 1002 is further configured to obtain a plurality of sub-reference gene sequences from the reference gene sequence when the determining module 1008 determines that the similarity between the to-be-detected gene sequence and a first gene segment in the first group of gene segments is smaller than a first threshold and larger than a second threshold, where each sub-reference gene sequence is a part of the reference gene sequence. The alignment module 1004 is further configured to optically align the gene sequence to be tested with a first sub-reference gene sequence of the plurality of sub-reference gene sequences. The result processing module 1006 is further configured to obtain a first similarity between the gene sequence to be detected and the first sub-reference gene sequence according to an output result of the optical computing chip.
In yet another case, the determining module 1008 is further configured to determine that the first similarity is greater than a third threshold and smaller than a fourth threshold, where the fourth threshold is not greater than the first threshold. The obtaining module 1002 is further configured to, in response to the judgment of the judging module 1008, obtain a first to-be-detected sub-gene sequence and a second to-be-detected sub-gene sequence according to the to-be-detected gene sequence, where partial bases of the first to-be-detected sub-gene sequence and the second to-be-detected sub-gene sequence are the same. The comparison module 1004 is further configured to optically compare the first sub-gene sequence to be detected with the first sub-reference gene sequence to obtain a second similarity, and optically compare the second sub-gene sequence to be detected with the first sub-reference gene sequence to obtain a third similarity.
In yet another case, the results processing module 1006 is further configured to record the location of the first sub-reference gene sequence in the reference gene sequence when the second degree of similarity is greater than the fourth threshold.
In another case, the obtaining module 1002 is further configured to, when the determining module 1008 determines that the third similarity is greater than the third threshold and smaller than the fourth threshold, obtain a first to-be-detected sub-gene sequence unit and a second to-be-detected sub-gene sequence unit according to the second to-be-detected sub-gene sequence, where partial bases of the first to-be-detected sub-gene sequence unit and the second to-be-detected sub-gene sequence unit are the same. The alignment module 1004 is further configured to optically align the first sub-gene sequence unit to be tested with the first sub-reference gene sequence, and optically align the second sub-gene sequence unit to be tested with the first sub-reference gene sequence.
In another case, the comparing module 1004 is further configured to optically compare the gene sequence to be tested with a second sub-reference gene sequence in the plurality of sub-reference gene sequences, so as to obtain a fourth similarity between the gene sequence to be tested and the second sub-reference gene sequence; and optically comparing the gene sequence to be detected with a third sub-reference gene sequence in the plurality of sub-reference gene sequences to obtain a fifth similarity between the gene sequence to be detected and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence. When the determining module 1008 determines that the sum of the fourth similarity and the fifth similarity is greater than the first threshold, the obtaining module 1002 is further configured to obtain a fourth sub-reference gene sequence according to the second sub-reference gene sequence and the third sub-reference gene sequence, where the fourth sub-reference gene sequence includes a part of bases of the second sub-reference gene sequence and a part of bases of the third sub-reference gene sequence. The comparison module 1004 is further configured to input the to-be-detected gene sequence and the fourth sub-reference gene sequence into the optical computing chip for optical comparison.
In another case, the result processing module 1006 is further configured to determine that a second gene segment in the first group of gene segments matches the gene sequence to be detected according to the output result of the optical computing chip, and record the position of the second gene segment in the reference gene sequence.
It will be appreciated that the alignment apparatus shown in FIG. 10 may be used to perform the functions of the gene alignment apparatus shown in FIG. 1. Reference may be made in particular to the preceding description of the function of the relevant blocks of fig. 1. And will not be described in detail herein. It is understood that the above-described apparatus embodiments are merely illustrative, and that, for example, the division of the modules is only one logical functional division, and that in actual implementation, there may be other divisions. For example, multiple modules or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the connection between the modules discussed in the above embodiments may be electrical, mechanical or other. The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules. In addition, each functional module in each embodiment of the application embodiments may exist independently, or may be integrated into one processing module.
Embodiments of the present invention further provide a computer program product for implementing gene alignment, including a computer-readable storage medium storing program code, where the program code includes instructions for executing the method described in any of the foregoing method embodiments. It will be understood by those of ordinary skill in the art that the foregoing storage media include: various non-transitory machine-readable media that can store program code, such as a usb disk, a removable hard disk, a magnetic disk, an optical disk, a random-access memory (RAM), a Solid State Disk (SSD), or a non-volatile memory (non-volatile memory), may be used.
It should be noted that the examples provided in this application are only illustrative. It will be apparent to those skilled in the art that, for convenience and brevity of description, the description of the various embodiments has been focused on, and for parts of one embodiment that are not described in detail, reference may be made to the description of other embodiments. The features disclosed in the embodiments of the invention, in the claims and in the drawings may be present independently or in combination. Features described in hardware in embodiments of the invention may be implemented by software and vice versa. And are not limited herein.

Claims (24)

1. A method of gene alignment, the method performed by a computer system comprising an optical computing chip, the method comprising:
acquiring a first group of gene segments from a gene database according to a gene sequence to be detected, wherein the gene database comprises a plurality of reference gene segments of a reference gene sequence, and the first group of gene segments comprise a plurality of reference gene segments matched with partial bases of the gene sequence to be detected;
and inputting the gene sequence to be detected and a plurality of reference gene segments in the first group of gene segments into the optical computing chip for optical comparison.
2. The method of gene alignment according to claim 1, further comprising:
according to the output result of the optical computing chip, determining that the similarity between the gene sequence to be detected and a first gene fragment in the first group of gene fragments is smaller than a first threshold value and larger than a second threshold value;
obtaining a plurality of sub-reference gene sequences from the reference gene sequence, wherein each sub-reference gene sequence is a portion of the reference gene sequence;
inputting the gene sequence to be detected and a first sub-reference gene sequence in the sub-reference gene sequences into the optical computing chip for optical comparison to obtain a first similarity between the gene sequence to be detected and the first sub-reference gene sequence.
3. The method of gene alignment according to claim 2, further comprising:
determining that the first similarity is greater than a third threshold and less than a fourth threshold, wherein the fourth threshold is not greater than the first threshold;
responding to the determination, and obtaining a first to-be-detected sub-gene sequence and a second to-be-detected sub-gene sequence according to the to-be-detected gene sequence, wherein partial bases of the first to-be-detected sub-gene sequence and the second to-be-detected sub-gene sequence are the same;
inputting the first sub-gene sequence to be detected and the first sub-reference gene sequence into the optical computing chip for optical comparison to obtain a second similarity;
and inputting the second to-be-detected sub-gene sequence and the first sub-reference gene sequence into the optical computing chip for optical comparison to obtain a third similarity.
4. The method of claim 3, further comprising:
when the second similarity is greater than the fourth threshold, recording the position of the first sub-reference gene sequence in the reference gene sequence.
5. The method of claim 3 or 4, further comprising:
when the third similarity is greater than the third threshold and smaller than the fourth threshold, obtaining a first to-be-detected sub-gene sequence unit and a second to-be-detected sub-gene sequence unit according to the second to-be-detected sub-gene sequence, wherein partial bases of the first to-be-detected sub-gene sequence unit and the second to-be-detected sub-gene sequence unit are the same;
inputting the first to-be-detected sub-gene sequence unit and the first sub-reference gene sequence into the optical computing chip for optical comparison;
and inputting the second to-be-detected sub-gene sequence unit and the first sub-reference gene sequence into the optical computing chip for optical comparison.
6. The method of any one of claims 2 to 5, wherein the method further comprises:
inputting the gene sequence to be detected and a second sub-reference gene sequence in the plurality of sub-reference gene sequences into the optical computing chip for optical comparison to obtain a fourth similarity between the gene sequence to be detected and the second sub-reference gene sequence;
inputting the gene sequence to be detected and a third sub-reference gene sequence in the plurality of sub-reference gene sequences into the optical computing chip for optical comparison to obtain a fifth similarity between the gene sequence to be detected and the third sub-reference gene sequence, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence;
determining that a sum of the fourth similarity and the fifth similarity is greater than the first threshold;
obtaining a fourth sub-reference gene sequence according to the second sub-reference gene sequence and the third sub-reference gene sequence, wherein the fourth sub-reference gene sequence comprises partial bases of the second sub-reference gene sequence and partial bases of the third sub-reference gene sequence;
and inputting the gene sequence to be detected and the fourth sub-reference gene sequence into the optical computing chip for optical comparison.
7. The method of gene alignment according to claim 1, further comprising:
determining that a second gene segment in the first group of gene segments is matched with the gene sequence to be detected according to the output result of the optical computing chip;
recording the position of the second gene segment in the reference gene sequence.
8. The method of claim 1, wherein inputting the gene sequence to be tested and the plurality of reference gene segments of the first set of gene segments into the optical computing chip for optical alignment comprises:
optically encoding the gene sequence to be detected and the plurality of reference gene segments in the first group of gene segments respectively;
and respectively inputting the optical codes of the gene sequences to be detected and the optical codes of the gene segments in the first group of gene sequences into the optical computing chip for optical comparison.
9. The method of any one of claims 1 to 7, wherein the obtaining the first set of gene fragments from the database according to the sequence of the gene to be detected comprises:
and acquiring a first group of gene fragments from the database according to the first m bases and the last n bases of the gene sequence to be detected, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than the number of bases in the gene sequence to be detected.
10. A gene alignment apparatus comprising:
the processor is used for acquiring a first group of gene segments from a database according to a gene sequence to be detected, wherein the database system comprises a plurality of reference gene segments of a reference gene sequence, and the first group of gene segments comprise a plurality of reference gene segments matched with partial bases of the gene sequence to be detected;
and the optical computing chip is connected with the processor and is used for optically comparing the gene sequence to be detected with a plurality of reference gene segments in the first group of gene segments.
11. A gene alignment apparatus as claimed in claim 10 wherein the processor is further configured to:
according to the output result of the optical computing chip, determining that the similarity between the gene sequence to be detected and a first gene fragment in the first group of gene fragments is smaller than a first threshold value and larger than a second threshold value;
obtaining a plurality of sub-reference gene sequences from the reference gene sequence, wherein each sub-reference gene sequence is a portion of the reference gene sequence;
the optical computing chip is further configured to: and optically comparing the gene sequence to be detected with a first sub-reference gene sequence in the plurality of sub-reference gene sequences to obtain a first similarity between the gene sequence to be detected and the first sub-reference gene sequence.
12. The apparatus of claim 11, wherein the processor is further configured to:
determining that the first similarity is greater than a third threshold and less than a fourth threshold, wherein the fourth threshold is not greater than the first threshold;
responding to the determination, and obtaining a first to-be-detected sub-gene sequence and a second to-be-detected sub-gene sequence according to the to-be-detected gene sequence, wherein partial bases of the first to-be-detected sub-gene sequence and the second to-be-detected sub-gene sequence are the same;
the optical computing chip is further configured to:
optically comparing the first sub-gene sequence to be detected with the first sub-reference gene sequence to obtain a second similarity; and
and optically comparing the second to-be-detected sub-gene sequence with the first sub-reference gene sequence to obtain a third similarity.
13. The apparatus of claim 12, wherein the processor is further configured to:
when the second similarity is greater than the fourth threshold, recording the position of the first sub-reference gene sequence in the reference gene sequence.
14. A gene alignment apparatus as claimed in claim 12 or 13 wherein the processor is further configured to:
when the third similarity is greater than the third threshold and smaller than the fourth threshold, obtaining a first to-be-detected sub-gene sequence unit and a second to-be-detected sub-gene sequence unit according to the second to-be-detected sub-gene sequence, wherein partial bases of the first to-be-detected sub-gene sequence unit and the second to-be-detected sub-gene sequence unit are the same;
the optical computing chip is further configured to:
optically comparing the first to-be-detected sub-gene sequence unit with the first sub-reference gene sequence;
and optically comparing the second to-be-detected gene sequence unit with the first sub-reference gene sequence.
15. A gene alignment apparatus as claimed in any one of claims 11 to 14, wherein the optical computing chip is further configured to:
optically comparing the gene sequence to be tested with a second sub-reference gene sequence in the plurality of sub-reference gene sequences;
optically comparing the gene sequence to be detected with a third sub-reference gene sequence in the plurality of sub-reference gene sequences, wherein the third sub-reference gene sequence is a sub-reference gene sequence continuous with the second sub-reference gene sequence;
the processor is further configured to:
determining that the sum of a fourth similarity between the gene sequence to be tested and the second sub-reference gene sequence and a fifth similarity between the gene sequence to be tested and the third sub-reference gene sequence is greater than the first threshold;
obtaining a fourth sub-reference gene sequence according to the second sub-reference gene sequence and the third sub-reference gene sequence, wherein the fourth sub-reference gene sequence comprises partial bases of the second sub-reference gene sequence and partial bases of the third sub-reference gene sequence;
and inputting the gene sequence to be detected and the fourth sub-reference gene sequence into the optical computing chip for optical comparison.
16. The apparatus of claim 10, wherein the processor is further configured to:
determining that a second gene segment in the first group of gene segments is matched with the gene sequence to be detected according to the output result of the optical computing chip;
recording the position of the second gene segment in the reference gene sequence.
17. The apparatus of claim 11, wherein the processor is further configured to:
optically encoding the gene sequence to be detected and the plurality of reference gene segments in the first group of gene segments respectively;
and respectively inputting the optical codes of the gene sequences to be detected and the optical codes of the gene segments in the first group of gene sequences into the optical computing chip for optical comparison.
18. A gene alignment apparatus as claimed in any one of claims 11 to 17 wherein the processor is configured to:
and acquiring the first group of gene fragments from the database according to the first m bases and the last n bases of the gene sequence to be detected, wherein the value of m and the value of n are both greater than 0, and the sum of m and n is less than the number of bases in the gene sequence to be detected.
19. An alignment apparatus, comprising:
the processor is used for acquiring a first group of reference objects from a database according to a first object to be matched, wherein the first group of reference objects comprises a plurality of reference objects with the same partial characteristics as the first object;
and the optical computing chip is connected with the processor and is used for optically comparing the first object and the plurality of reference objects.
20. The alignment device of claim 19, wherein the processor is further configured to:
according to the output result of the optical computing chip, determining that the similarity between the first object and a first reference object in the first group of reference objects is smaller than a first threshold value and larger than a second threshold value;
obtaining a plurality of sub-reference objects according to a standard object, wherein each sub-reference object is a part of the reference object;
the optical computing chip is further configured to: and optically comparing the first object with a first sub-reference object in the plurality of sub-reference objects to obtain a first similarity between the first object and the first sub-reference object.
21. The alignment device of claim 19, wherein the processor is further configured to:
determining that the first similarity is greater than a third threshold and less than a fourth threshold, wherein the fourth threshold is not greater than the first threshold;
responding to the determination, obtaining a first sub-object and a second sub-object according to the first object, wherein partial data of the first sub-object and the second sub-object are the same;
the optical computing chip is further configured to:
optically comparing the first sub-object with the first sub-reference object to obtain a second similarity; and
and optically comparing the second sub-object with the first sub-reference object to obtain a third similarity.
22. The alignment device of claim 21, wherein the processor is further configured to:
when the second similarity is greater than the fourth threshold, recording the position of the first sub-reference object in the standard object.
23. A computer program product comprising program code comprising instructions for execution by a computer to perform a gene alignment method as claimed in any one of claims 1 to 9.
24. A computer readable storage medium comprising computer program instructions which, when run on a computer, cause the computer to perform a gene alignment method as claimed in any one of claims 1 to 9.
CN201911046513.5A 2019-08-02 2019-10-30 Gene comparison technology Pending CN112309501A (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/CN2020/106498 WO2021023142A1 (en) 2019-08-02 2020-08-03 Gene alignment technique
JP2022506634A JP7286872B2 (en) 2019-08-02 2020-08-03 Gene alignment technology
EP20849621.6A EP4006908A4 (en) 2019-08-02 2020-08-03 Gene alignment technique
US17/587,507 US20220238185A1 (en) 2019-08-02 2022-01-28 Gene Alignment Technology

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019107136895 2019-08-02
CN201910713689 2019-08-02

Publications (1)

Publication Number Publication Date
CN112309501A true CN112309501A (en) 2021-02-02

Family

ID=74486806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911046513.5A Pending CN112309501A (en) 2019-08-02 2019-10-30 Gene comparison technology

Country Status (5)

Country Link
US (1) US20220238185A1 (en)
EP (1) EP4006908A4 (en)
JP (1) JP7286872B2 (en)
CN (1) CN112309501A (en)
WO (1) WO2021023142A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268461A (en) * 2021-07-19 2021-08-17 广州嘉检医学检测有限公司 Method and device for gene sequencing data recombination packaging

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4892408A (en) * 1988-03-03 1990-01-09 Grumman Aerospace Corporation Reference input patterns for evaluation and alignment of an optical matched filter correlator
US20130091121A1 (en) * 2011-08-09 2013-04-11 Vitaly L. GALINSKY Method for rapid assessment of similarity between sequences
CN104217134A (en) * 2013-05-29 2014-12-17 诺布里斯股份有限公司 Systems and methods for SNP analysis and genome sequencing
CN107533589A (en) * 2015-03-17 2018-01-02 新加坡科技研究局 Bioinformatic data processing system
CN108604260A (en) * 2016-01-11 2018-09-28 艾迪科基因组公司 For scene or the genomics architecture of DNA based on cloud and RNA processing and analysis
CN109690359A (en) * 2016-04-22 2019-04-26 伊鲁米那股份有限公司 Equipment and constituent used in the luminescence imaging in multiple sites in pixel based on photon structure and the method using it

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150353921A9 (en) * 2012-04-16 2015-12-10 Jingdong Tian Method of on-chip nucleic acid molecule synthesis
CN107653299A (en) * 2016-07-23 2018-02-02 成都十洲科技有限公司 A kind of acquisition methods of the gene chip probes sequence based on high-flux sequence

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4892408A (en) * 1988-03-03 1990-01-09 Grumman Aerospace Corporation Reference input patterns for evaluation and alignment of an optical matched filter correlator
US20130091121A1 (en) * 2011-08-09 2013-04-11 Vitaly L. GALINSKY Method for rapid assessment of similarity between sequences
CN104217134A (en) * 2013-05-29 2014-12-17 诺布里斯股份有限公司 Systems and methods for SNP analysis and genome sequencing
CN107533589A (en) * 2015-03-17 2018-01-02 新加坡科技研究局 Bioinformatic data processing system
CN108604260A (en) * 2016-01-11 2018-09-28 艾迪科基因组公司 For scene or the genomics architecture of DNA based on cloud and RNA processing and analysis
CN109690359A (en) * 2016-04-22 2019-04-26 伊鲁米那股份有限公司 Equipment and constituent used in the luminescence imaging in multiple sites in pixel based on photon structure and the method using it

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FERESHTE MOZAFARI等: "Speeding up DNA sequence alignment by optical correlator", OPTICS AND LASER TECHNOLOGY, vol. 108, pages 124 - 135, XP085443749, DOI: 10.1016/j.optlastec.2018.06.027 *
N. BROUSSEAU等: "Analysis of DNA sequences by an optical time-integrating correlator", APPLIED OPTICS, vol. 31, no. 23, pages 4802 - 4815, XP000294341, DOI: 10.1364/AO.31.004802 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268461A (en) * 2021-07-19 2021-08-17 广州嘉检医学检测有限公司 Method and device for gene sequencing data recombination packaging
CN113268461B (en) * 2021-07-19 2021-09-17 广州嘉检医学检测有限公司 Method and device for gene sequencing data recombination packaging

Also Published As

Publication number Publication date
JP7286872B2 (en) 2023-06-05
US20220238185A1 (en) 2022-07-28
EP4006908A1 (en) 2022-06-01
JP2022543094A (en) 2022-10-07
EP4006908A4 (en) 2022-08-31
WO2021023142A1 (en) 2021-02-11

Similar Documents

Publication Publication Date Title
CN107609350B (en) Data processing method of second-generation sequencing data analysis platform
US20070005556A1 (en) Probabilistic techniques for detecting duplicate tuples
EP2390810B1 (en) Taxonomic classification of metagenomic sequences
JP2007093582A (en) Automatic detection of quality spectrum
US11302419B2 (en) Method and system for DNA sequence alignment
CN105760706A (en) Compression method for next generation sequencing data
US11829376B2 (en) Technologies for refining stochastic similarity search candidates
US20200342531A1 (en) Cryptocurrency mining selection system and method
US20150248430A1 (en) Efficient encoding and storage and retrieval of genomic data
US10346073B2 (en) Storage control apparatus for selecting member disks to construct new raid group
US20220238185A1 (en) Gene Alignment Technology
JP2006317457A (en) Automatic detection of quality spectrum
JPWO2016114009A1 (en) Fusion gene analysis apparatus, fusion gene analysis method, and program
KR20200131733A (en) Parallelizable sequence alignment systems and methods
US20180239866A1 (en) Prediction of genetic trait expression using data analytics
Zhao et al. ANN softmax: Acceleration of extreme classification training
CN110648719B (en) Local structure gastric cancer drug-resistant lncRNA secondary structure prediction method based on energy and probability
Strzoda et al. A mapping-free NLP-based technique for sequence search in Nanopore long-reads
Jia et al. ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems
KR20200138821A (en) Determination of the frequency distribution of nucleotide sequence variations
KR100537636B1 (en) Apparatus for predicting transcription factor binding sites based on similar sequences and method thereof
Vetro et al. TIDE: Inter-chromosomal translocation and insertion detection using embeddings
CN115527612B (en) Genome second-fourth generation fusion assembly method and system based on numerical characteristic expression
US20190102515A1 (en) Method and device for decoding data segments derived from oligonucleotides and related sequencer
WO2024124453A1 (en) Base calling model training method and system, base calling model identification method and system, and device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination