CN114694753B

CN114694753B - Nucleic acid sequence comparison method, device, equipment and readable storage medium

Info

Publication number: CN114694753B
Application number: CN202210270358.0A
Authority: CN
Inventors: 王泰福; 张优劲; 杨姣博; 郑淇文; 贺增泉
Original assignee: Shenzhen Huada Medical Laboratory
Current assignee: Shenzhen Huada Medical Laboratory
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-04-07
Anticipated expiration: 2042-03-18
Also published as: CN114694753A

Abstract

The application discloses a nucleic acid sequence comparison method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: the method comprises the steps of firstly obtaining nucleic acid to be detected and determining a reference genome corresponding to the nucleic acid to be detected, further judging whether a video memory block which is distributed and used for loading a reference short sequence of the reference genome exists in the reference genome, if so, directly reading the reference short sequence of the video memory block, and comparing the short sequence of the nucleic acid to be detected with the read reference short sequence, thereby determining the sequence position of each short sequence of the nucleic acid to be detected on the reference genome according to the obtained comparison result. Obviously, the method can directly read the reference short sequence of the display and storage block for nucleic acid sequence comparison, and can avoid the process of repeatedly allocating the display and storage block and loading the reference short sequence, thereby realizing the effect of sharing the display and storage block and fully utilizing the hardware resource of the graphic processor compared with the prior art that each nucleic acid to be detected needs to allocate a graphic processor and load the reference short sequence again.

Description

Nucleic acid sequence comparison method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of gene detection technology, and more particularly, to a method, an apparatus, a device and a readable storage medium for nucleic acid sequence alignment.

Background

With the large-scale growth of biological gene sequence databases, gene sequencing data also show explosive growth, the requirement of nucleic acid sequence alignment on computing power is higher and higher, the computing performance of the traditional CPU is difficult to meet the computational power requirement of the large-scale growth of nucleic acid sequence alignment, and thus a Graphics Processing Unit (GPU for short) is applied to the research of nucleic acid sequence alignment.

The comparison program of the gene sequence needs to be operated in the GPU, but in the prior art, one GPU is generally allocated to one comparison program, when a plurality of comparison programs exist, a plurality of GPUs need to be allocated to meet the operation requirement of the comparison programs, but one comparison program usually does not need to completely occupy the resources of the whole GPU, so that partial GPU resources are in an idle state, and redundant hardware resources cannot be fully utilized.

Therefore, how to fully utilize the hardware resources of the GPU in running the nucleic acid sequence alignment program is a considerable problem.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a device and a readable storage medium for nucleic acid sequence alignment, which are used to fully utilize hardware resources of a GPU during running a nucleic acid sequence alignment program.

In order to achieve the above object, the proposed solution is as follows:

a method of nucleic acid sequence alignment comprising:

obtaining a nucleic acid to be detected, wherein the nucleic acid to be detected comprises a plurality of short sequences to be detected;

determining a reference genome corresponding to the nucleic acid to be detected, wherein the reference genome comprises a plurality of reference short sequences;

judging whether the reference genome has an allocated video memory block, wherein the video memory block is a part of video memory of a graphic processor and is used for loading each reference short sequence of the reference genome;

if so, reading each reference short sequence of the reference genome loaded in the video memory block;

comparing each short sequence to be detected with each reference short sequence read from the display block to obtain a comparison result;

and determining the sequence position of each short sequence to be detected on the reference genome according to the comparison result.

Preferably, if the allocated video memory block does not exist in the reference genome, the method further includes:

searching for free display blocks sufficient to load respective reference short sequences of the reference genome;

loading respective reference short sequences of the reference genome into the free display block.

Preferably, the nucleic acid to be tested is contained in a first containerization service;

the reference genome is included in a second containerization service.

Preferably, there are a plurality of said reference genomes, each of said reference genomes being different;

the determining whether the reference genome has the allocated video memory block includes:

and judging whether the allocated video memory blocks exist in each reference genome.

Preferably, the comparing each short sequence to be detected with each reference short sequence read from the display block includes:

and respectively comparing each short sequence to be detected with each reference short sequence of different reference genomes read from different display blocks.

Preferably, the method further comprises the following steps:

and receiving a deletion instruction sent by the first containerization service program, and deleting the display blocks allocated to the reference genome corresponding to the nucleic acid to be detected contained in the first containerization service program.

An apparatus for nucleic acid sequence alignment comprising:

the device comprises a to-be-detected nucleic acid obtaining unit, a detection unit and a detection unit, wherein the to-be-detected nucleic acid obtaining unit is used for obtaining a to-be-detected nucleic acid which comprises a plurality of to-be-detected short sequences;

a reference genome acquisition unit, configured to determine a reference genome corresponding to the test nucleic acid, where the reference genome includes a plurality of reference short sequences;

a video memory block allocation judging unit, configured to judge whether the reference genome has an allocated video memory block, where the video memory block is a part of a video memory of a graphics processor and is used to load each reference short sequence of the reference genome;

a short sequence reading unit, configured to read, if the allocated video memory block exists in the reference genome, each reference short sequence of the reference genome loaded in the video memory block;

a short sequence comparison unit for comparing each short sequence to be detected with each reference short sequence read from the display block to obtain a comparison result;

and the short sequence position determining unit is used for determining the sequence position of each short sequence to be detected on the reference genome according to the comparison result.

Preferably, the method further comprises the following steps:

a video memory block searching unit, configured to search, if there is no allocated video memory block in the reference genome, an idle video memory block that is sufficient for loading each reference short sequence of the reference genome;

a short sequence loading unit for loading each reference short sequence of the reference genome into the free display block.

A nucleic acid sequence alignment apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor is used for executing the program to realize the steps of the nucleic acid sequence alignment method.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-described nucleic acid sequence alignment method.

According to the scheme, the nucleic acid sequence comparison method provided by the application comprises the steps of firstly obtaining nucleic acid to be detected and determining a reference genome corresponding to the nucleic acid to be detected, further judging whether the reference genome has a video memory block which is allocated and used for loading a reference short sequence of the reference genome, if so, directly reading the reference short sequence of the video memory block, comparing the short sequence of the nucleic acid to be detected with the read reference short sequence, and obtaining a comparison result, so that the sequence position of each short sequence of the nucleic acid to be detected on the reference genome can be determined according to the comparison result.

Obviously, if a display block loaded with a reference short sequence exists, the method can directly read the reference short sequence of the display block for nucleic acid sequence comparison, and compared with the prior art that each nucleic acid to be detected needs to be allocated with a graphic processor and loaded with the reference short sequence again, the method can avoid the process of repeatedly allocating the display block and repeatedly loading the reference short sequence, realize the effect of sharing the display block, and can fully utilize the hardware resource of the graphic processor.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for aligning nucleic acid sequences according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of another method for aligning nucleic acid sequences according to the present disclosure;

FIG. 3 is a diagram illustrating an example of an alignment scenario for nucleic acid sequences provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a nucleic acid sequence alignment apparatus disclosed in the examples of the present application;

FIG. 5 is a block diagram of a hardware configuration of a nucleic acid sequence alignment apparatus disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for aligning nucleic acid sequences provided in an embodiment of the present application, the method comprising:

step S100: obtaining a nucleic acid to be detected, wherein the nucleic acid to be detected comprises a plurality of short sequences to be detected.

Specifically, a plurality of short sequences to be detected of the nucleic acid to be detected may be from the same ordered genome, wherein each short sequence to be detected may be a short sequence randomly derived from a certain position in the original genome, i.e., the relationship between the front and back sequence positions of each short sequence to be detected may be in a lost state.

Step S110: and determining a reference genome corresponding to the nucleic acid to be detected, wherein the reference genome comprises a plurality of reference short sequences.

Specifically, the reference genome corresponding to the nucleic acid to be detected can be arbitrarily specified by a worker in gene sequencing, and the relationship between the front and back sequence positions of each reference short sequence of the reference genome can be clear and definite.

Step S120: determining whether the allocated video memory block exists in the reference genome, if so, performing step S130.

In particular, the video memory block may be a partial video memory of the graphics processor, which may be used to load the respective reference short sequences of the reference genome.

In addition, the size of the video memory space of the allocated video memory block may be just enough to load each reference short sequence of the reference genome, and if a certain video memory region in the graphics processor is loaded with each reference short sequence of the reference genome, the video memory region may be considered as the video memory block allocated to the reference genome.

Step S130: reading each reference short sequence of the reference genome loaded in the video memory block.

Specifically, the reference short sequences and their relationship between the preceding and following sequence positions can be read from the display blocks allocated to the reference genome.

Step S140: and comparing each short sequence to be detected with each reference short sequence read from the display block to obtain a comparison result.

Specifically, each short sequence to be detected may be compared with each reference short sequence of the reference genome corresponding to the nucleic acid to be detected read from the video memory block, so as to obtain a comparison result of each short sequence to be detected.

Step S150: and determining the sequence position of each short sequence to be detected on the reference genome according to the comparison result.

Specifically, the sequence position of each short sequence to be detected on the reference genome can be determined according to the comparison result of each short sequence to be detected, the comparison result of each short sequence to be detected can be integrated, and then the relation between the front sequence position and the rear sequence position of each short sequence to be detected can be determined.

It can be seen from the above solutions that the nucleic acid sequence comparison method provided in the embodiments of the present application can directly read a reference short sequence in an existing display block and compare the reference short sequence with a short sequence to be detected, and the reference short sequence loaded into the display block can be used to compare each short sequence to be detected of a plurality of different nucleic acids to be detected, that is, a plurality of different nucleic acids to be detected can share each reference genome in one display block, thereby avoiding the problems of repeated allocation of display blocks and repeated loading of reference short sequences, and making full use of hardware resources of a graphics processor.

The foregoing embodiment describes step S120, determining whether the reference genome has an allocated video memory block, and if the determination result is that the reference genome has an allocated video memory block, step S130 may be executed. However, considering that not all reference genomes have an allocated display block, the present embodiment may further include a process of allocating a display block to a reference genome when the reference genome does not have an allocated display block as a result of the determination in step S120.

Specifically, the process may include the steps of:

s1, searching for idle display blocks which are enough for loading each reference short sequence of the reference genome.

Specifically, the display blocks of each reference short sequence enough to load the reference genome can be searched in the unallocated display space.

S2, loading each reference short sequence of the reference genome into the idle display block.

Specifically, the video memory block found in step S1 may be allocated to the reference genome, and each reference short sequence of the reference genome may be loaded into the video memory block.

From the above solution, for the reference genomes that are not allocated, the embodiment of the present application may allocate a video memory block sufficient for loading respective reference short sequences to the reference genomes, instead of allocating a graphics processor to each reference genome as in the prior art, and even if a video memory block is allocated in a certain graphics processor for loading respective reference short sequences of other reference genomes, as long as the remaining video memory space of the graphics processor is sufficient for loading respective reference short sequences of the reference genomes in this step, a video memory block may be allocated in the graphics processor to the reference genome in this step, so that one graphics processor can load multiple reference genomes, thereby fully utilizing hardware resources of the graphics processor.

In other embodiments of the present application, there may be multiple different reference genomes corresponding to a single test nucleic acid, which will be described with reference to FIG. 2.

Specifically, for each reference genome, it can be determined whether there is an allocated video memory block.

For the reference genome with the allocated display block, the alignment process with the nucleic acid to be detected can refer to steps S100 to S150 described in the previous embodiment, and will not be described herein again.

For reference group genes for which no allocated display block exists, after step S120, step S160 may be performed thereon: searching for free memory blocks sufficient to load respective reference short sequences of the reference genome; loading respective reference short sequences of the reference genome into the free display block. The step S160 may refer to the process of allocating the display block to the reference genome described in the foregoing embodiment.

Since, after allocating the display block to the reference genome, there exists an allocated display block in the reference genome, the corresponding step S130 for the already allocated display block, and the subsequent steps, can be performed.

Considering that the test nucleic acid may need to be aligned with a plurality of different reference genomes, the procedure of aligning each short sequence to be tested with each reference short sequence read from the display block in step S140 described in the above embodiment will be further described in the case that the test nucleic acid corresponds to a plurality of reference genomes.

Specifically, each short sequence to be detected can be compared with each reference short sequence of different reference genomes read from different display blocks, so that a comparison result between the nucleic acid to be detected and each reference genome can be obtained, and the sequence position of each short sequence to be detected of the nucleic acid to be detected on different reference genomes can be determined according to each comparison result.

It can be seen from the above scheme that, when the nucleic acid to be tested needs to be aligned with multiple reference genomes, the display blocks can be allocated to different reference genomes, so that the nucleic acid to be tested can be independently aligned with different reference genomes.

In order to rapidly deploy the alignment program for the test nucleic acids, in other embodiments of the present application, each test short sequence of the test nucleic acid can be contained in a first containerization service and each reference short sequence of the reference genome can be contained in a second containerization service.

Specifically, referring to fig. 3, fig. 3 shows an exemplary scenario in which both the test nucleic acid and the reference genome are contained in the containerization service.

The reference genomes corresponding to the nucleic acid a to be detected and the nucleic acid b to be detected are both reference genomes x, the display block 1 is a display block allocated to the reference genomes x, and the display block 1 is used for loading each reference short sequence of the reference genomes.

Since the containerization service program comprises various required dependents, a first containerization service program A comprising the nucleic acid a to be detected, a first containerization service program B comprising the nucleic acid B to be detected and a second containerization service program M comprising the reference gene x, the containerization service program can be deployed without being influenced by an operating system and a computing environment, and the effect of rapid deployment is achieved.

After deployment is completed, the first containerization service program and the second containerization service program can read each reference short sequence of the reference genome x loaded in the display block 1, and can be respectively used for comparing each short sequence to be detected of the nucleic acid a to be detected with each short sequence to be detected of the nucleic acid b to be detected, so that the sequence position of each short sequence to be detected of the nucleic acid a to be detected on the reference genome x and the sequence position of each short sequence to be detected of the nucleic acid b to be detected on the reference genome x can be determined according to the obtained comparison result.

In addition, since some reference genomes may not be used as the comparison object of the nucleic acid to be detected for a long time, the video memory block allocated to such reference genome may be in an idle state for a long time, resulting in wasted video memory space. Alternatively, a large number of different test nucleic acids need to be aligned with different reference genomes, resulting in a large number of display block allocation requirements. Therefore, in order to release the video memory space, the method for comparing nucleic acid sequences provided in the embodiment of the present application may further add a process of receiving a deletion instruction sent by the first containerization service program, and delete the video memory blocks allocated to the reference genome corresponding to the nucleic acid to be detected included in the first containerization service program, where after the video memory blocks are deleted, the corresponding video memory space is released and is in an idle state, and may be used for reallocation.

According to the scheme, the service program containing the nucleic acid to be detected and the service program where the reference genome is located are both containerized deployed service programs, so that the nucleic acid to be detected of the first containerized service program can be quickly obtained, the reference genome in the second containerized service program can be quickly loaded to the display block, and the efficiency of the comparison process of the short sequence to be detected of the nucleic acid to be detected and the reference short sequence of the reference genome by using the display block can be improved.

In addition, the first containerized service program and the second containerized service program which are containerized and deployed comprise computing environments required by the respective service programs, are comprehensive in functions, and can be rapidly deployed or transplanted to various operating systems or computing environments.

The following describes the nucleic acid sequence alignment apparatus provided in the embodiments of the present application, and the nucleic acid sequence alignment apparatus described below and the nucleic acid sequence alignment method described above can be referred to correspondingly.

First, referring to fig. 4, the nucleic acid sequence alignment apparatus will be described, and as shown in fig. 4, the nucleic acid sequence alignment apparatus may include:

a nucleic acid to be detected acquisition unit 100 configured to acquire a nucleic acid to be detected, where the nucleic acid to be detected includes a plurality of short sequences to be detected;

a reference genome obtaining unit 110, configured to determine a reference genome corresponding to the nucleic acid to be detected, where the reference genome includes a plurality of reference short sequences;

a display and memory block allocation determining unit 120, configured to determine whether there is an allocated display and memory block in the reference genome, where the display and memory block is a part of a display and memory of a graphics processor and is used to load each reference short sequence of the reference genome;

a short sequence reading unit 130, configured to read each reference short sequence of the reference genome loaded in the video memory block if the allocated video memory block exists in the reference genome;

a short sequence comparison unit 140, configured to compare each short sequence to be detected with each reference short sequence read from the display block, so as to obtain a comparison result;

a short sequence position determining unit 150, configured to determine, according to the alignment result, a sequence position of each short sequence to be detected on the reference genome.

Optionally, the nucleic acid sequence alignment apparatus may further comprise:

a video memory block searching unit, configured to search, if there is no allocated video memory block in the reference genome, an idle video memory block that is sufficient to load each reference short sequence of the reference genome;

Optionally, the nucleic acid to be tested is contained in a first containerization service program;

the reference genome is included in a second containerization service.

Optionally, there are a plurality of the reference genomes, and each of the reference genomes is not the same;

the video memory block allocation determining unit may include:

and judging whether each reference genome has an allocated video memory block.

Optionally, the short sequence alignment unit may include:

and the short sequence comparison pair subunit is used for comparing each short sequence to be detected with each reference short sequence of different reference genomes read from different display blocks respectively.

Optionally, the nucleic acid sequence alignment apparatus may further comprise:

and the display block deleting unit is used for receiving a deleting instruction sent by the first containerization service program and deleting the display blocks distributed to the reference genome corresponding to the nucleic acid to be detected contained in the first containerization service program.

The nucleic acid sequence alignment device provided by the embodiment of the application can be applied to nucleic acid sequence alignment equipment. Fig. 5 shows a block diagram of a hardware configuration of a nucleic acid sequence alignment apparatus, and referring to fig. 5, the hardware configuration of the nucleic acid sequence alignment apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

judging whether the reference genome has distributed video memory blocks or not, wherein the video memory blocks are partial video memories of a graphic processor and are used for loading each reference short sequence of the reference genome;

Alternatively, the detailed function and the extended function of the program may refer to the above description.

An embodiment of the present application further provides a storage medium, where the storage medium may store a program adapted to be executed by a processor, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of nucleic acid sequence alignment comprising:

judging whether the reference genome has an allocated video memory block, wherein the video memory block is a part of video memory of a graphic processor and is used for loading each reference short sequence of the reference genome; the size of the video memory space of the video memory block is such that each reference short sequence of the reference genome is loaded;

2. The method of claim 1, wherein if the reference genome does not have an allocated video memory block, further comprising:

3. The method of claim 1, wherein the test nucleic acid is contained in a first containerization service;

the reference genome is included in a second containerization service.

4. The method of claim 1, wherein there are a plurality of said reference genomes, each of said reference genomes being different;

and judging whether each reference genome has an allocated video memory block.

5. The method according to claim 4, wherein the comparing each short sequence to be tested with each reference short sequence read from the display block comprises:

6. The method of claim 3, further comprising:

7. An apparatus for aligning nucleic acid sequences, comprising:

a reference genome obtaining unit, configured to determine a reference genome corresponding to the nucleic acid to be detected, where the reference genome includes a plurality of reference short sequences;

a video memory block allocation judging unit, configured to judge whether the reference genome has an allocated video memory block, where the video memory block is a part of a video memory of a graphics processor and is used to load each reference short sequence of the reference genome; the size of the video memory space of the video memory block is such that each reference short sequence of the reference genome is loaded;

a short sequence reading unit, configured to read each reference short sequence of the reference genome loaded in the video memory block if the allocated video memory block exists in the reference genome;

8. The apparatus of claim 7, further comprising:

9. A nucleic acid sequence alignment apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor, for executing the program, to carry out the steps of the nucleic acid sequence alignment method according to any one of claims 1 to 6.

10. A readable storage medium having stored thereon a computer program for carrying out the steps of the method of nucleic acid sequence alignment according to any one of claims 1 to 6 when executed by a processor.