AU2024201174A1

AU2024201174A1 - Shared memory based gene analysis method, apparatus and computer device

Info

Publication number: AU2024201174A1
Application number: AU2024201174A
Authority: AU
Inventors: Zengquan HE; Chao SONG; Jin’an WANG; Jiaobo YANG; Chuang YU; Youjin Zhang
Original assignee: BGI Genomics Co Ltd; Bgi Health HK Co Ltd
Current assignee: BGI Genomics Co Ltd; Bgi Health HK Co Ltd
Priority date: 2020-10-22
Filing date: 2024-02-22
Publication date: 2024-03-14
Also published as: EP4235679A1; JP2023512610A; EP4235679A8; JP7344996B2; IL289071A; AU2020457044A1

Abstract

Asharedmemory based gene analysis method, apparatus and computer device. The method comprises: reading sample data and preprocessing the sample data; performing a gene analysis on the sample data preprocessed, and determining whether a required library file in the gene analysis is in a gene shared memory; if yes, obtaining the required library file from the gene shared memory, mapping the required library file to a process of the gene analysis of the sample data preprocessed, and completing a corresponding analysis. In this method, a shared memory mechanism is adopted to establish indexes for the gene analysis. Whether a library file that are frequently used in the gene analysis process are in the gene shared memory is determined; if yes, the library file can be obtained from the gene shared memory and can be mapped to the sample data, and can be convenientlymapped from the gene sharedmemory to an analysis process performed on the sample data. The method can greatly reduce the time and I/O occupation for loading the library file from a hard disk. Therefore, the efficiency of analysis can be improved.

Description

SHARED MEMORY BASED GENE ANALYSIS METHOD, APPARATUS AND COMPUTER DEVICE CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This is a divisional of Australian Patent Application No.

2020457044, the originally-filed specification of which is

incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] The present disclosure relates to the technical field of data

processing, in particular to a shared memory based gene analysis

method, apparatus, computer device, and a computer-readable storage

medium.

BACKGROUND

[0003] With the smooth implementation of the Human Genome Project

and the rapid development of sequencing technology, the cost of

sequencinghasbeen significantly reduced, and the speedofsequencing

has been significantly improved. The cost of the sequencing of human

whole genome has been reduced to less than $1000, and the amount of

DNA sequence data has increased exponentially. How to utilize and

express the of data quickly, then analyze and explain potential problems in gene sequences, and discover information beneficial to human beings from massive data has become an urgent problem to be solved. The more and more applications of sequence data generated by human whole genome sequencing (WGS) and the continuous demand for rapid analysis and processing of massive sequence data have formed a new technical bottleneck for data analysis, which restricts the clinical application of second-generation sequencing technology.

[0004] At present, there are many kinds methods and tools for data

analysis of the second-generation sequencing in the field of

bioinformatics internationally. The most commonly used process mainly

comprises an input of data, a preprocessing operation, a sequence

comparison, a annotation, a variant calling and a pathway analysis.

However, it is very time-consuming to apply the whole process in WGS.

In addition, samples input need customized processes such as merging

the samples, splitting the samples and so on which need to be performed

separately, so that the operation efficiency is low and the I/O

consumption is increased. In addition, in the process of data

analysis, index files should be loaded separately for each step of

analysis and processing. If multiple tasks load the same index file,

the tasks will consume more memory and take more time.

-2- IEE210846PAU

SUMMARY

[0005] In view of this, the disclosure provides a shared memory based

gene analysis method, apparatus, computer device, and a

computer-readable storage medium to solve a technical problem of the

low operation efficiency caused by the requirement of the processes

such as merging the input samples in some pipelines, the high memory

consumption and the high time consumption caused by loading of index

files repeatedly in the data analysis process in the prior art.

[0006] Some embodiments of this disclosure provide a shared memory

based gene analysis method, comprising: reading sample data and

preprocessing the sample data; performing a gene analysis on the

sample data preprocessed, and determining whether a required library

file in the gene analysis is in a gene shared memory; if yes, obtaining

the required library file from the gene shared memory, mapping the

required library file to a process of the gene analysis of the sample

data preprocessed, and completing a corresponding analysis.

[0007] Optionally, the method further comprises: determiningwhether

the required library file meets a load condition, in a case where

the required library file in the gene analysis is not in the gene

shared memory; and loading the required library file into the gene

shared memory, in a case where the loading condition is met.

[0008] Optionally, determining whether the required library file

meets a load condition, in a case where the required library file

in the gene analysis is not in the gene shared memory, and loading

-3 - IEE210846PAU the required library file into the gene shared memory, in a case where the loading condition is met comprises: acquiring information of the required library file and information of the gene shared memory, wherein the information of the required library file comprises a space required by the required library file and the number of historical load requests, and the information of the gene shared memory comprises a remaining space of the gene shared memory; and if the number of historical load requests is greater than a first preset number, and the space required by the required library file is less than the remaining space of the gene shared memory, loading the required library file into the gene shared memory.

[0009] Optionally, the information of the required library file

further comprises a load request frequency of the required library

file, the information of the gene shared memory further comprises

load request frequencies of all library files; determining whether

the required library file meets a load condition, and loading the

required library file into the gene shared memory, in a case where

the loading condition is met further comprises: if the number of

historical load requests is greater than the first preset number,

and the space required by the required library file is greater than

the remaining space of the gene shared memory, ranking the required

library file and the all library files in an order of priority

according to the load request frequency of the required library file

and the load request frequencies of the all library files to obtain

-4 - IEE210846PAU a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene shared memory; and loading the required library file into the gene shared memory.

[0010] Optionally, the method further comprises: setting the gene

shared memory for library files used in gene analysis, setting a size

of the gene shared memory, the number of library files that can be

accommodated, a name of each library file and a size offset of the

each library file; and loading library files commonly used in gene

analysis into the gene shared memory according to the size of the

gene shared memory, the number of library files that can be

accommodated, the name of the each library file and the size offset

of the each library file.

[0011] Optionally, the gene analysis comprises an alignment

analysis, a variation analysis and an annotation analysis, the method

further comprises: performing the alignment analysis, the variation

analysis, and the annotation analysis on the sample data preprocessed

in sequence, wherein in a case where the sample data preprocessed

comprises multiple groups of sample data, the multiple groups of

-5 - IEE210846PAU sample data are in a same step or different steps of the gene analysis at a time.

[0012] Optionally, the gene analysis further comprises a sorting

analysis and a marking-duplicate analysis, wherein after performing

the alignment analysis, the variation analysis, and the annotation

analysis on the sample data preprocessed in sequence, the method

further comprises: labeling the sample data after the alignment

analysis with a position tag; and performing the sorting analysis

and the marking-duplicate analysis by module on the sample data

labeled.

[0013] Optionally, the method further comprises: connecting some or

all steps of the gene analysis by a use of memory.

[0014] Optionally, preprocessing the sample data comprises:

performing a quality control, a filtering operation and a statistical

process on the sample data.

[0015] Some embodiments of the disclosure also provide a gene shared

memory based gene analysis apparatus, comprising: a data reading

module configured to read sample data; a data preprocessing module

configured to preprocess the sample data; and a gene analysis module

configured to perform a gene analysis on the sample data preprocessed,

and determine whether a required library file in the gene analysis

is in a gene shared memory; if yes, obtain the required library file

from the gene sharedmemory, map the required library file to aprocess

of the gene analysis of the sample data preprocessed, and complete

-6 - IEE210846PAU a corresponding analysis.

[0016] Some embodiments of the disclosure further provide a computer

device, comprising amemory, aprocessor and a computer program stored

on the memory and executable on the processor. The processor executes

the following steps: reading sample data and preprocessing the sample

data; performing a gene analysis on the sample data preprocessed,

and determining whether a required library file in the gene analysis

is in a gene shared memory; if yes, obtaining the required library

file from the gene shared memory, mapping the required library file

to a process of the gene analysis of the sample data preprocessed,

and completing a corresponding analysis.

[0017] Some embodiments of the disclosure further provide a

computer-readable storage medium on which a computer program is

stored, wherein the computer program when executed by a processor

implements the following steps: reading sample data and preprocessing

the sample data; performing a gene analysis on the sample data

preprocessed, and determining whether a required library file in the

gene analysis is in a gene shared memory; if yes, obtaining the

required library file from the gene shared memory, mapping the

required library file to a process of the gene analysis of the sample

data preprocessed, and completing a corresponding analysis.

[0018] The gene shared memory based gene analysis method, apparatus,

computer device and computer readable medium are provided in the

embodiments of the disclosure. Sample data is read first, and then

-7 - IEE210846PAU the sample data is preprocessed, and then a gene analysis is performed on the sample data preprocessed. In the gene analysis, it is necessary to determine whether a required library file is in a gene shared memory of library files in gene analysis; if yes, the required library file is obtained from the gene shared memory, and mapped to the gene analysis corresponding to the sample data to complete the corresponding analysis. In the gene sharedmemory based gene analysis method the gene shared memory mechanism is used to establish indexes for gene analysis (for example comprises alignment analysis, variant calling analysis, annotation analysis and so on), and then stores files in a database (i.e. library files) required in the gene analysis in the gene shared memory. A library file can be conveniently mapped from the gene shared memory to a process of the gene analysis performed on the sample data. On one hand, the time and the I/O occupation for loading the library file from a hard disk are greatly reduced. On the other hand, the communications among multiple processes in the process of the gene analysis are facilitated and the repeatedly loading of the library file is avoid.

-8g- IEE210846PAU

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] Inorder tomore clearlyexplain the embodiments of the present

disclosure or the technical solutions in the prior art, a brief

introduction will be given below for the drawings required to be used

in the description of the embodiments or the prior art. It is obvious

that, the drawings illustrated as follows are merely embodiments of

the present disclosure. For a person skilled in the art, he or she

may also acquire other drawings according to such drawings on the

premise that no inventive effort is involved.

[0020] Fig. 1 is a schematic diagram of an application environment

of a shared memory based gene analysis method according to some

embodiments of the present disclosure;

[0021] Fig. 2 is a flow diagramof a shared memory based gene analysis

method according to some embodiments of the present disclosure;

[0022] Fig. 3 is a schematic diagram showing a principle diagram

of a shared memory according to some embodiments of the present

disclosure;

[0023] Fig. 4 is a flow diagram of constructing a shared memory in

some embodiments of the present disclosure;

[0024] Fig. 5 is a structure diagram of a shared memory in some

embodiments of the present disclosure;

[0025] Fig. 6 is a flow diagramof a shared memory based gene analysis

method according to some embodiments of the present disclosure;

[0026] Fig. 7 is a diagram showing a CPU utilization and an I/O

-9 - IEE210846PAU utilization when a gene analysis is performed using a method A according to some embodiments of the present disclosure;

[0027] Fig. 8 is a diagram showing a CPU utilization and an I/O

utilization when a gene analysis is performed using a method B

according to some embodiments of the present disclosure;

[0028] Fig. 9 is a diagram showing a CPU utilization and an I/O

utilization when a gene analysis is performed using a method C

according to some embodiments of the present disclosure;

[0029] Fig. 10 is a structure diagram of a shared memory based gene

analysis apparatus according to some embodiments of the present

disclosure;

[0030] Fig. 11 is a structure diagram of a computer device according

to some embodiments of the present disclosure.

-10- IEE210846PAU

DETAILED DESCRIPTION

[0031] The technical solutions in the embodiments of the present

disclosure willbe clearly and completely described below. Obviously,

the described embodiments are only a part of the embodiments of the

present disclosure, but not all of the embodiments. All other

embodiments obtained by those of ordinary skill in the art based on

the embodiments of the present disclosure without creative efforts

shall fall within the protection scope of the present disclosure.

[0032] Glossary:

[0033] Gene (Mendelian factor) refers to a DNA or a RNA sequence that

carries geneticinformation (that is, a gene is a DNAor a RNA fragment

with genetic effects), also known as genetic factor, which is a basic

geneticunit that controls biologicaltraits.Agene expresses genetic

information it carries by directing a synthesis of proteins, thereby

controlling the traits of individual organisms. Gene sequencing is

a new type of gene detection technology that analyzes and determines

the whole sequence of genes from blood or saliva, so as to predict

the possibility of suffering from a variety of diseases, individual

behavior characteristics and reasonable behaviors.

[0034] Read: A short sequencing fragment, which is sequencing data

generated by a high-throughput sequencer. Tens of millions of reads

will be generated by sequencing an entire genome. Then, by splicing

these reads together, the full sequence of the genome can be obtained.

[0035] Alignment analysis: Reads sequencedby NGS are storedin FASTQ

-11- IEE210846PAU files. Although they originally came from an ordered genome, the sequential relationship between different reads in the files has been lost after DNA library building and sequencing. Therefore, there is no positional relationship between two reads next to each other in the FASTQ files. They are all short sequences randomly derived from certain positions in the original genome. Therefore, we need to straighten out a lot of short sequences first, compare them with a reference genome of the species one by one, find the position of each read on the reference genome, and then arrange them in order. This process is called the comparison of sequencing data.

[0036] Sorting analysis: Why are BAM files output out of order after

a BWA comparison? The reason is that these sequenced reads in the

FASTQ files are randomly distributed on the genome. The first step

of the comparison is to locate the reads one by one on the reference

genome according to their order in the FASTQ files, and then output

them directly. It is impossible in this step to automatically

recognize the sequence of their comparison positions and rearrange

the comparison results. Therefore, in the result file obtained after

the comparison, the positional order of the records is chaotic. We

need to sort the records in order for a subsequent step such as

marking-duplicate, which is the reason for the need to sort.

[0037] Marking-duplicate: After the sorting is completed,

deduplication is performed (i.e., removing PCR duplicated sequences) .

What is a duplicated sequence? How is it produced and why does it

-12- IEE210846PAU need to be removed? It is related to the library construction and sequencing in the experimental process. Before NGS sequencing, a sequencing library needs to be constructed: cut the original DNA sequence by physical (ultrasonic) interruption or using a chemical reagent (enzyme digestion), and then select sequences in a specific length range for PCR amplification and computer sequencing.

Therefore, the duplicated sequence here is actuallyintroduced during

the PCR process.

[0038] Base quality score correction: It is to (as far as possible)

correct systematicerrors in the sequencingprocess, because avariant

calling is a step that relies heavily on the sequencing base quality

scores. Because this quality score is an important (even the only)

indicator to measure how correct the base we sequenced is. It cannot

be measured directly, but an extremely close distribution result can

be obtained through statistical techniques. A known variation found

in a population is likely to be the same in someone. Therefore, we

can compare and analyze the comparison result directly, exclude all

knownvariation sites, and then calculate howmanybases are different

from those on the reference genome after comparison for each

(reported) quality score. These different bases are considered as

wrongbases, and their number ratio reflect the realbase error rates,

which are converted into Phred scores. This information is output

into a calibration table file, and is used to re-adjust the base

quality scores in the original BAM file. A new BAM file is output

-13- IEE210846PAU using these new quality scores.

[0039] Variant calling and analysis: the purpose of variant calling

and analysis is to accurately detect a variation set in the genome

of each sample (such as human), that is, those DNA sequences that

are different for different people.

[0040] In order tomake the object, technicalsolution and advantages

of the present application more clear and explicit, the present

application will be further described in detail in combination with

the drawings and the embodiments. It should be understood that the

detailed embodiments that will be described herein are only used for

explaining the present application, but not used for limiting the

present application.

[0041] This method can be applied to the terminal 102 in Fig. 1. The

terminal can be a personal computer, laptop, etc. The terminal 102

is connected with a gene sequencing device 104, which can be a gene

sequencer, etc.

[0042] When the terminal 102 is connected with the gene sequencing

device 104 through a local interface, the gene sequencing device 104

can sendsample dataafter sequencing to the terminal102. Inaddition,

the terminal 102 can obtain the sample data after sequencing in the

gene sequencing device 104 through instructions.

[0043] In some embodiments, as shown in Fig. 2, a shared memory based

gene analysis method is provided. As an illustration, this method

is applied to the terminal in Fig. 1 as an example, and comprises

-14- IEE210846PAU the following steps:

[0044] In step S202, sample data is read and the sample data is

preprocessed.

[0045] The sample data is data generated or formed after gene

sequencing of samples. The number of the samples can be one or more

groups.

[0046] In an optional embodiment, preprocessing the sample data

comprises: performing a quality control, a filtering operation and

a statistical process on the sample data.

[0047] The data obtained from gene sequencing is called raw data (i.e.

raw reads or raw data). The raw data may contain low-quality sequences

and splice sequences, which will affect the analysis result.

Therefore, a series of data processing shall be carried out on the

raw data, such as a quality control, a filtering operation and a

statistical process, to remove impurities in the raw data, so as to

determine whether the sequencing data is suitable for subsequent

analysis.

[0048] In step S204, a gene analysis on the sample data preprocessed

is performed, and whether a required library file in the gene analysis

is in a gene shared memory is determined.

[0049] Generally, after preprocessing the sample data, it is

necessary to carry out a relevant gene analysis on the sample data.

A common analysis mainly comprises a sequence alignment (i.e.

alignment analysis), a variant calling (i.e. variation analysis),

-15- IEE210846PAU a annotation statistics (i.e. annotation analysis) and a subsequent pathway analysis (such as a GO analysis, a KEGG analysis and a protein pathway analysis). However, no matter which analysis is carried out, it needs to adopt an analysis database. For example, a reference genome database is required for the alignment analysis, a species genome database (such as a human genome database) is required for the variant calling, an annotation database is required for the annotation analysis, a pathway database is required for the pathway analysis, etc. Each database has a large amount of data. These databases need to be loaded when the analysis is carried out on the sample data.

[0050] Shared memory is the last way of interprocess communication

in System V. Shared memory, as its name implies, allows two unrelated

processes to access a same logical memory, and is a very effective

way to share and transfer data between two running processes. The

memory shared between different processes is usually a same piece

of physical memory. Processes can connect the same piece of physical

memory to their own address space, and all processes can access

addresses in the shared memory. If a process writes data to the shared

memory, this change will immediately affect any other process that

can access the same piece of shared memory.

[0051] Fig. 3 is a schematic diagram showing the communication

principle of a shared memory. In Linux, each process has its own

process control block (PCB) and address space (Addr Space), and has

a corresponding page table, which is used for mapping virtual

-16- IEE210846PAU addresses of the process to physical addresses and is managed through a memory management unit (MMU). Two different virtual addresses may be mapped to a same area in a physical space by using the page table, and this area they point to is a shared memory. Referring to Fig.

3, there are two processes ProcA and ProcB in the Figure. When virtual

addresses are mapped to a physical address through page tables of

these two processes, there is a common memory area of the physical

address, that is, a shared memory, which can be seen by the two

processes at a same time. In this way, when one process writes and

another process reads, an inter process communication can be realized

between the two processes. For the shared memory, its implementation

adopts a principle of reference counting. When a process detaches

the shared memory area, a counter decreases by one. When a process

successfully hitches to the shared memory area, the counter increases

by one. The shared memory area can be deleted only if the counter

becomes zero. When the process terminates, the shared memory area

attached to it will automatically detach from it.

[0052] In the embodiments, a gene shared memory is constructed for

library files in gene analysis, in which the most commonly used

databases in gene analysis processing can be stored. When a database

is needed in an analysis of sample data, it can be obtained directly

from the gene shared memory, which greatly reduces time of loading

the database from a loading library of a disk. In addition, when

multiple groups of sample data are analyzed at a same time, the

-17- IEE210846PAU database can be shared among the multiple groups of sample data, which reduces repeated loading and I/O occupation.

[0053] In an optional embodiment, as shown in Fig. 4, there is also

provided a method of constructing a shared memory, comprising: steps

S402-S404o

[0054] In step S402, the gene shared memory for library files used

in gene analysis is set, a size of the gene shared memory, the number

of library files that can be accommodated, a name of each library

file and a size offset of the each library file are set.

[0055] In step S404, library files commonly used in gene analysis

are loaded into the gene shared memory according to the size of the

gene shared memory, the number of library files that can be

accommodated, the name of the each library file and the size offset

of the each library file.

[0056] Referring to Fig. 5, a certain area is selected in a terminal

system (i.e. a hardware device used for a gene analysis of sample

data) as the gene shared memory of library files in the gene analysis.

An appropriate size of the gene shared memory is determined according

to a storage space, a data processing ability and other performances

of the terminal system. Contents recorded or stored in the gene shared

memory area mainly comprise: a design of a table header of the gene

shared memory in node physical memory: 1) first, store the number

(n) of determined shared libraries and a total length (Len) of a shared

area; 2) store a name (e.g. Libl, Lib2) and a length offset (offsetl,

-18- IEE210846PAU offset2) of each specified library file in the gene shared memory;

3) store data of the each specified library file in a selected area

in turn.

[0057] Its working principle is as follows: the sample data can

comprise multiple groups of data; each group of data has a

corresponding sample process. Fromsample process P1to sample process

PN, each process has its own process control block (PCB) and address

space (Addr space), and has a corresponding page table, which is used

for mapping virtual addresses of the process to physical addresses

and is managed through a memory management unit (MMU). Two different

virtual addresses may be mapped to a same area in a physical space

byusing the page table, and this area theypoint tois a sharedmemory.

Through the above method, each sample process can enter the shared

memory area, so as to obtain a required library file in the shared

memory area.

[0058] In step S206, if yes, the required library file from the gene

shared memory is obtained, the required library file is mapped to

a process of the gene analysis of the sample data preprocessed, and

a corresponding analysis is completed.

[0059] In the gene shared memory based gene analysis method provided

in the embodiments of the disclosure, Sample data is read first, and

then the sample data is preprocessed, and then a gene analysis is

performed on the sample data preprocessed. In the gene analysis, it

is necessary to determine whether a required library file is in a

-19- IEE210846PAU gene shared memory of library files in gene analysis; if yes, the required library file is obtained from the gene shared memory, and mapped to the gene analysis corresponding to the sample data to complete the corresponding analysis. In the gene shared memory based gene analysis method the gene shared memory mechanism is used to establish indexes for gene analysis (for example comprises alignment analysis, variant calling analysis, annotation analysis and so on), and then stores files in a database (i.e. library files) required in the gene analysis in the gene shared memory. A library file can be conveniently mapped from the gene shared memory to a process of the gene analysis performed on the sample data. On one hand, the time and the I/O occupation for loading the library file from a hard disk are greatly reduced. On the other hand, the communications among multiple processes in the process of the gene analysis are facilitated and the repeatedly loading of the library file is avoid.

[0060] In some embodiment, the method further comprises: determining

whether the required library file meets a load condition, in a case

where the required library file in the gene analysis is not in the

gene shared memory; and loading the required library file into the

gene shared memory, in a case where the loading condition is met.

[0061] Specifically, if the required library file in the gene

analysis is not in the gene shared memory, it is determined whether

the required library file meets the load condition. The required

library file can be loaded into the gene shared memory if the loading

-20- IEE210846PAU condition is met. On the one hand, it is faster and more efficient to load the required library file into the gene shared memory and then obtain the required library file from the gene shared memory; on the other hand, it can also facilitate other sample data processes to use the required library file, which avoids a repeated loading.

[0062] In some embodiments, determining whether the required library

file meets a load condition, in a case where the required library

file in the gene analysis is not in the gene shared memory, and loading

the required library file into the gene shared memory, in a case where

the loading condition is met comprises:

[0063] acquiring information of the required library file and

information of the gene shared memory, wherein the information of

the required library file comprises a space required by the required

library file and the number of historical load requests, and the

information of the gene shared memory comprises a remaining space

of the gene shared memory; and if the number of historical load

requests is greater than a first preset number, and the space required

by the required library file is less than the remaining space of the

gene shared memory, loading the required library file into the gene

shared memory.

[0064] The information of the required library file refers to

information related to the required library file, which can comprises

a type of the required library file, a size of the required library

file, a space required by the required library file, the number of

-21- IEE210846PAU historical load requests and a load request frequency of the required library file, etc. Information of the gene shared memory refers to information related to the gene shared memory, mainly comprising a size of the gene shared memory, a remaining space of the gene shared memory, etc.

[0065] A first preset number is a preset value, which can be used

to reflect an importance of a library file to a certain extent. That

is, if the number of historical load requests is greater than the

first preset number, it indicates that the required library file is

needed or used frequently, i.e., the required library file is

important in the gene analysis, and can be loaded into the gene shared

memory, so as to facilitate the use for other sample data. After

determining the importance of the required library file, it is further

necessary to determine whether the remaining space of gene shared

memoryis enough to store the requiredlibrary file, thatis, determine

whether the space required by the required library file is less than

the remaining space of gene shared memory. If so, the required library

file can be directly loaded into gene shared memory.

[0066] In some embodiments, the information of the required library

file further comprises a load request frequency of the required

library file, the information of the gene shared memory further

comprises load request frequencies of all library files; determining

whether the required library file meets a load condition, and loading

the required library file into the gene shared memory, in a case where

-22- IEE210846PAU the loading condition is met further comprises: if the number of historical load requests is greater than the first preset number, and the space required by the required library file is greater than the remaining space of the gene shared memory, ranking the required library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene shared memory; and loading the required library file into the gene shared memory.

[0067] Specifically, if it is determined that the space required for

the required library file is greater than the remaining space of the

gene shared memory, it indicates that the remaining space of the gene

shared memory is not enough to store the required library file; in

this case, it is necessary to compare the required library file with

the library files already stored in the gene shared memory, delete

a library file with a low load request frequency according to the

load request frequency priorities of the library files, and then load

-23- IEE210846PAU the required library file into the gene shared memory.

[0068] In the embodiments, the required library file and the library

files stored in the gene shared memory are ranked in an order of

priority mainly according to the load request frequency of each

library file. If the load request frequency priority of the required

library file is higher than that of a library file in the gene shared

memory, the library file in the gene shared memory is deleted to load

the required library file into the gene shared memory. The sizes of

all the library files are taken into comprehensive consideration in

the above process. It is only necessary to ensure that the memory

occupied by the deleted library file is sufficient to store the

required library file.

[0069] In this way, when the required library file in the process

of the gene analysis is not in the gene shared memory, the library

file can be loaded into the gene shared memory first, so as to improve

the efficiency of a subsequent calculation.

[0070] In some embodiments, the gene analysis comprises an alignment

analysis, a variation analysis and an annotation analysis; the method

further comprises: performing the alignment analysis, the variation

analysis, and the annotation analysis on the sample data preprocessed

in sequence, wherein in a case where the sample data preprocessed

comprises multiple groups of sample data, the multiple groups of

sample data are in a same step or different steps of the gene analysis

at a time.

-24- IEE210846PAU

[0071] In the embodiments, the method of the gene analysis comprises

the alignment analysis, the variation analysis and the annotation

analysis. However, there is usually a sequence requirement in the

process of the gene analysis, that is, the alignment analysis is

generally carried out first, followed by the variation analysis, and

then the annotation analysis. However, when there are multiple groups

of sample data, each group of sample data can be in a same step or

different steps of the gene analysis. For example, sample data 1 can

be in an alignment analysis, sample data 2 can be in a variation

analysis, and sample data 3 can be in an annotation analysis. It is

also possible for sample data 1, sample data 2 and sample data 3 to

be in an alignment analysis, a variation analysis or an annotation

analysis at the same time. Multiple groups of sample data can be

processed at the same time by using the method, which can further

improve the data processing speed.

[0072] In some embodiments, the gene analysis further comprises a

sorting analysis and a marking-duplicate analysis, wherein after

performing the alignment analysis, the variation analysis, and the

annotation analysis on the sample data preprocessed in sequence, the

method further comprises: labeling the sample dataafter the alignment

analysis with a position tag; and performing the sorting analysis

and the marking-duplicate analysis by module on the sample data

labeled.

[0073] Specifically, the gene analysis further comprises the

-25- IEE210846PAU sequencing analysis and the marking-duplicate analysis; labeling the sample data after the alignment analysis with a position tag is to add a position-related tag to a file after a comparison, so that the sequencing analysis and the marking-duplicate analysis can be performed by module, and more efficient multi-threaded sorting can be increased to the sequencing analysis and the marking-duplicate analysis.

[0074] In some embodiments, the method further comprises: connecting

some or all steps of the gene analysis by a use of memory.

[0075] Specifically, several steps or all steps in processes of

comparison, sorting, marking-duplicate and variant calling in the

process of the gene analysis can be connected by the use of memory.

Sam/bam files outputted intermediately can be reduced by connecting

each step by the use of memory which reduces the I/O occupation.

[0076] For ease of understanding, a detailed embodiment is given

below. Fig. 6 shows the whole process of a gene analysis and a process

in the gene shared memory area. The process of the gene analysis is

as follows: after samples are input, the data of each sample is

preprocessed, and then whether a library file required for an

alignment analysis is loaded into the gene shared memory area is

determined; if yes, the alignment analysis is started, or if not,

the library file is loaded from a hard disk to perform the alignment

analysis; the process of the alignment analysis is synthesized as

a flexible step by a memory connection and an algorithm optimization;

-26- IEE210846PAU then the variant calling is performed, and whether a library file of annotation information has been loaded into the gene shared memory is determined; if yes, an annotation statistics is started, or if not, the library file is loaded from a hard disk for the annotation statistics; the analysis process is ended.

[0077] A process in the gene shared memory area is as follows: if

there is a request for information of library lib-x (i.e. a required

library file), whether the required library file is in the gene shared

memory area is determined; if yes, library data is feedback, and the

process is ended; if the required library file is not in the gene

shared memory area, whether to load the required library file through

a load method Q is determined; if yes, the required library file is

loaded into the gene shared memory area, the library data is return,

and the process is ended; if the required library file is not to be

loaded through the load method Q, no information is returned and the

process is ended.

[0078] The specific steps of the load method Q are as follows: 1.

a type and a size of the required library file are determined; 2.

a record file is obtained; 3. a total memory size of the node, a size

of the shared memory area, the number of historical load requests

of library and the total number of historical load requests of all

types of libraries are read from the record file; 4. the memory size

of the node is updated from a hard disk to prevent the memory size

of the node from changing; 5. the number of historical load requests

-27- IEE210846PAU of this type of library is increased by 1 (f type+1); 6. the total number of historical load requests of the all types of libraries is increased by 1 (ftotal+l); 7. whether the remaining space is enough to load the library is determined; 8. request frequencies

(ftype/ftotal) of all types of libraries in the record file are

ranked in descending order, and a ranked linked list is returned;

9. whether the required library file has been loaded is determined;

if the required library file has been loaded, a library index is

returned; if the required library file has not been loaded and the

number of historical load requests of this type of library is more

than 10, its priority and rank position in all unloaded libraries

are determined; 10. if the priority of this type of library exceeds

that of a loaded library, the system predicts whether a sum of the

sizes of the loaded libraries ranked after this type of library in

the typelist meets a condition W of a size of memory for loading

this type of library; if yes, these loaded libraries are unloaded

in reverse order until the condition W is met; if not, no process

is performed; 11. if the load condition is met, the record of the

size of the shared memory area is updated; 12. otherwise, a case that

the library has not been loaded because there is no sufficient memory

to load the library is marked, and update it to the record file.

[0079] The format of the record file is given below:

[0080] M: 63492649171200

-28- IEE210846PAU

[0081] Len: 13492649171200

[0082] f total: 100

Type Size Loaded The number of type flag historical load requests f type Libx 10000000000000 Yes 75 0 Liby 3492649171200 Yes 12 0 Libw 40000000000000 No 10 1 Libz 5000000 No 3 0

[0083] typeflag indicates the reason for not being loaded, wherein

"1" indicates that the load priority of this type of library was ranked

first and it was not loaded because of insufficient memory, and the

typeflag of a loaded library is 0.

[0084] In addition, the pseudo code of the loadmethod Qis as follows.

[0085] RequestShareMem(type, size) // type: the type of

the library for sharing, size: the size of the library for sharing

[0086] File = RecordFile // the record file

[0087] ReadFromFile(M, Len, ftype, ftotal) // read from

the record file (M: total memory size of the node; Len: current size

of the shared memory area; ftype: the number of historical load

requests of this type of library; ftotal: the total number of

historical load requests of all types of libraries;

[0088] Update (M) // update the memory size of

the node from a hard disk to prevent the memory size of the node from

changing;

[0089] f type = f type + 1 //update f type

[0090] f total = f total + 1 updatee ftotal

-29- IEE210846PAU

[0091] W = M*0.5 - Len - size > 0 // the condition W:

determine whether there is remaining space for loading, 0.5 is an

adjustable factor, currently 50% of the total memory is used

[0092] typelist = SortAllTypeInFile() / rank

request frequencies (ftype/ftotal) of all types of libraries in

the record file in descending order, and return a ranked linked list;

[0093] if AlreadyLoaded(type) then

[0094] id = GetShareMemId(type) // if the required

library file has been loaded, return a library index

[0095] else if f type > 10 // the number of historical load

requests of this type of library is more than 10

[0096] if IsPrior (typelist, type) // determine whether it is

the first priority: ranked in front of all other unloaded libraries

whose typeflag is 0

[0097] if typeflag = 1

[0098] UnloadShareMem (type-list, type) // if a sum of

the sizes of the loaded libraries ranked after this type of library

in the typelist meets a condition W, unload these loaded libraries

in reverse order until the condition W is met; otherwise, no process

is performed

[0099] if W

[00100] id= LoadShareMem ( type, size)

[00101] Len = Len + size //update the size of the shared

memory area

-30- IEE210846PAU

[00102] typeflag = 0

[00103] else

[00104] typeflag = 1 //mark that there

is no sufficient memory, update the record

[00105] id = 0

[00106] else

[00107] id = 0 //return no information

[00108] UpdateFile (M, Len, ftype, ftotal, typeflag) //

update the record file

[00109] return id returnn an index of the shared

memory area, "0" represents no information

[00110] end

[00111] Some embodiments for showing effects:

[00112] Inorder toverifythe effectiveness ofthe sharedmemorybased

gene analysis method in the embodiments of the disclosure, three gene

analysis methods, namely method A (software without optimization

(i.e. all steps of the gene analysis are not connected by a use of

memory, and the steps are independent from each other) + without a

use of the gene shared memory), method B (software with optimization

(i.e. all steps of the gene analysis are connected by a use of memory)

+ without a use of the gene shared memory) and method C (software

with optimization(i.e. all steps of the gene analysis are connected

by a use of memory) + with a use of the gene shared memory) are given

to compare CPU utilizations and I/O times of the methods. The results

-31- IEE210846PAU are shown in Figs. 7 to 9, wherein FIG. 7 shows an analysis result of the method A, Fig. 8 shows an analysis result of the method B, and Fig. 9 shows an analysis result of the C.

[00113] It can be seen fromFigs. 7 to 9 that running time ofan analysis

portion of the method A before acceleration (i.e. each of a step of

reading sample data and a step of preprocessing before the alignment

analysis runs independently and the comparison is processed directly

without the use of the gene shared memory) is 2.83 hours, and the

CPU utilization fluctuates greatly. Running time of the comparison

portion and an annotation portion before acceleration (i.e. the

comparison and the annotation are processed directly without the use

of the gene shared memory) is 2.61 hours, the CPU utilization is high,

and the I/O sec (i.e. the number of transfers output to a physical

disk per second) is high, indicating that the I/O utilization is high

and the probability of blocking is high.

[00114] Running time of an analysis portion of the method B after

acceleration (i.e. a step of reading sample data and a step of

preprocessing before the alignment analysis are connected by the use

of memory and the comparison is processed by using the gene shared

memory) is 1.75 hours, and the CPU utilization fluctuates smaller

than that of method A. Running time of a library comparison portion

before the use of gene shared memory (i.e. the comparison is processed

directly without the use of the gene shared memory) is 2.38 hours,

the CPU utilization is high, and the I/O sec (i.e. the number of

-32- IEE210846PAU transfers output to a physical disk per second) is high, indicating that the I/O utilization is high and the probability of blocking is high.

[00115] Running time of an analysis portion of the method C after

acceleration (i.e. a step of reading sample data and a step of

preprocessing before the alignment analysis are connected by the use

of memory and the comparison is processed by using the gene shared

memory) is 1.75 hours, and the CPU utilization fluctuates smaller

than that of method A (this portion is the same as method B). Running

time of a library comparison portion after the use of gene shared

memory (i.e. the comparison is processed with the use of the gene

shared memory) is 0.82 hours, the CPU utilization is high, and the

I/O sec (i.e. the number of transfers output to a physical disk per

second) is low, indicating that the I/O utilization is low and the

probability of blocking is low.

[00116] Therefore, the method C is used for the gene analysis, that

is, the gene analysis steps are connected by the use of memory. The

method of adopting the gene shared memory in comparison, annotation

and other processes can greatly reduce the time used for the gene

analysis and reduce the I/O utilization rate, that is, reduce I/O

blocking.

[00117] It should be understood that although the steps in the

flowcharts of FIGS. 2, 4 and 6 are shown in order as indicated by

the arrows, these steps are not necessarily performed in order as

-33 - IEE210846PAU indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited in order, and these steps can be performed in other orders. Moreover, at least some steps in Figs. 2, 4, and 6 may comprise multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be performed alternately with other steps or at least some sub-steps or stages of other steps.

[00118] In some embodiments, as shown in FIG. 10, there is provided

a shared memory based gene analysis apparatus, comprising:

[00119] The data reading module 102 is configured to read sample data.

[00120] The data preprocessing module 104 is configured to preprocess

the sample data.

[00121] The gene analysis module 106 is configured to perform a gene

analysis on the sample data preprocessed, and determine whether a

required library file in the gene analysis is in a gene shared memory;

if yes, obtain the required library file from the gene shared memory,

map the required library file to a process of the gene analysis of

the sample data preprocessed, and complete a corresponding analysis.

[00122] In some embodiments, a library file loading module configured

to determine whether the required library file meets a load condition,

in a case where the required library file in the gene analysis is

not in the gene shared memory; and load the required library file

into the gene shared memory, in a case where the loading condition

is met.

-34 - IEE210846PAU

[00123] In some embodiments, the library file loadingmodule comprises

a library information and memory information acquisition module.

[00124] The library information and memory information acquisition

module is configured to acquire information of the required library

file and information of the gene shared memory, wherein the

information of the required library file comprises a space required

by the required library file and the number of historical load

requests, and the information of the gene shared memory comprises

a remaining space of the gene shared memory.

[00125] The library file loadingmodule is configured to, if the number

of historical load requests is greater than a first preset number,

and the space required by the required library file is less than the

remaining space of the gene shared memory, load the required library

file into the gene shared memory.

[00126] In some embodiments, the information of the required library

file further comprises a load request frequency of the required

library file, the information of the gene shared memory further

comprises load request frequencies of all library files; and the

library file loading module further comprises a priority ranking

module and a library file deleting module.

[00127] The priority sorting module is configured to, if the number

of historical load requests is greater than the first preset number,

and the space required by the required library file is greater than

the remaining space of the gene shared memory, rank the required

-35 - IEE210846PAU library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file.

[00128] The library file deleting module is configured to, if the load

request frequency priority of the required library file is higher

than that of a library file in the gene shared memory, and if the

remaining space of the gene shared memory after deleting the library

file with a lower load request frequency priority in the gene shared

memory is greater than or equal to the space required by the required

library file, delete the library file with the lower load request

frequency priority in the gene shared memory.

[00129] the library file loading module is further configured to load

the required library file into the gene shared memory.

[00130] In some embodiments, the apparatus further comprises: a gene

shared memory setting module configured to set the gene shared memory

for library files used in gene analysis, set a size of the gene shared

memory, the number of library files that can be accommodated, a name

of each library file and a size offset of the each library file.

[00131] The library file loading module is further configured to load

library files commonly used in gene analysis into the gene shared

memory according to the size of the gene shared memory, the number

of library files that can be accommodated, the name of the each library

file and the size offset of the each library file.

- 36- IEE210846PAU

[00132] In some embodiments, the gene analysis comprises an alignment

analysis, a variation analysis and an annotation analysis.

[00133] The gene analysis module is configured to perform the

alignment analysis, the variation analysis, and the annotation

analysis on the sample data preprocessed in sequence, wherein in a

case where the sample data preprocessed comprises multiple groups

of sample data, the multiple groups of sample data are in a same step

or different steps of the gene analysis at a time.

[00134] In some embodiments, the gene analysis further comprises a

sorting analysis and a marking-duplicate analysis, and the apparatus

further comprises: a sorting and marking-duplicate module configured

to label the sample data after the alignment analysis with a position

tag; and perform the sorting analysis and the marking-duplicate

analysis by module on the sample data labeled.

[00135] In some embodiments, the apparatus further comprises: amemory

connection module configured to connect some or all steps of the gene

analysis by a use of memory.

[00136] In some embodiments, the data preprocessing module is further

a quality control, a filtering operation and a statistical process

on the sample data perform a quality control, a filtering operation

and a statistical process on the sample data.

[00137] For the specific definition of the shared memory based gene

analysis apparatus, please refer to the definition of the shared

memory based gene analysis method described above, which will not

-37- IEE210846PAU be repeated here. All or some of the modules in the shared memory based gene analysis apparatus can be realized by software, hardware, or a combination thereof. The above modules can be embedded in or independent of a processor of a computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so as to facilitate the processor to call and execute the corresponding operations of the above modules.

[00138] In some embodiments, a computer device is provided, which may

be a server, and its internal structure may be as shown in FIG. 11.

The computer device comprises a processor, a memory, a network

interface and a database connected through a systembus. The processor

of the computer device is used to provide computing and control

capabilities. The memory of the computer device comprises a

nonvolatile storage medium and a memory device. The nonvolatile

storage medium stores an operating system, a computer program, and

a database. The memory device provides an environment for the

operation of the operating system and the computer program in

nonvolatile storage medium. The database of the computer device is

used to store the data of a resistance equivalent modeland equivalent

sub models, as well as the equivalent resistance, working resistance

and contact resistance obtained during calculation. The network

interface of the computer device is used to communicate with external

terminals through network connection. The computer program is

executed by the processor to implement a shared memory based gene

-3g- IEE210846PAU analysis method.

[00139] Those skilled in the art can understand that the structure

shown in FIG. 11 is only a block diagram of some structures related

to the scheme ofthis application, and does not constitute alimitation

on the computer device to which the scheme of this application is

applied. The specific computer device may comprise more or fewer

components than those shown in the Figure, or combine some components,

or have different component arrangements.

[00140] In some embodiments, acomputer device is provided, comprising

a processor, a memory, and a computer program stored in the memory

and executable by the processor, which when executing the computer

program implements the following steps: reading sample data and

preprocessing the sample data; performing a gene analysis on the

sample data preprocessed, and determining whether a required library

file in the gene analysis is in a gene shared memory; if yes, obtaining

the required library file from the gene shared memory, mapping the

required library file to a process of the gene analysis of the sample

data preprocessed, and completing a corresponding analysis.

[00141] In some embodiments, the processor whenexecuting the computer

program further implements the following steps: determining whether

the required library file meets a load condition, in a case where

the required library file in the gene analysis is not in the gene

shared memory; and loading the required library file into the gene

shared memory, in a case where the loading condition is met.

-39 - IEE210846PAU

[00142] In some embodiments, the processor whenexecuting the computer

programfurtherimplements a stepof: determiningwhether the required

library file meets a load condition, in a case where the required

library file in the gene analysis is not in the gene shared memory,

and loading the required library file into the gene shared memory,

in a case where the loading condition is met comprises: acquiring

information of the required library file and information of the gene

shared memory, wherein the information of the required library file

comprises a space required by the required library file and the number

of historical load requests, and the information of the gene shared

memory comprises a remaining space of the gene shared memory; and

if the number of historical load requests is greater than a first

preset number, and the space required by the required library file

is less than the remaining space of the gene shared memory, loading

the required library file into the gene shared memory.

[00143] In some embodiments, the processor whenexecuting the computer

program further implements the following step: the information of

the required library file further comprises a load request frequency

ofthe requiredlibrary file, the information ofthe gene sharedmemory

further comprises load request frequencies of all library files in

the gene shared memory; determining whether the required library file

meets a load condition, and loading the required library file into

the gene shared memory, in a case where the loading condition is met

further comprises: if the number of historical load requests is

-40- IEE210846PAU greater than the first preset number, and the space required by the required library file is greater than the remaining space of the gene shared memory, ranking the required library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene sharedmemory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene shared memory; and if the number of historical load requests is greater than a first preset number, and the space required by the required library file is less than the remaining space of the gene shared memory, loading the required library file into the gene shared memory.

[00144] In some embodiments, the processor whenexecuting the computer

program further implements the following step: the information of

the required library file further comprises a load request frequency

ofthe requiredlibrary file, the information ofthe gene sharedmemory

further comprises load request frequencies of all library files;

determining whether the required library file meets a load condition,

and loading the required library file into the gene shared memory,

-41- IEE210846PAU in a case where the loading condition is met further comprises: if the number ofhistoricalloadrequestsis greater than the firstpreset number, and the space required by the required library file is greater than the remaining space of the gene shared memory, ranking the required library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene sharedmemory; and loading the required library file into the gene shared memory.

[00145] In some embodiments, the processor whenexecuting the computer

program further implements the following steps: setting the gene

shared memory for library files used in gene analysis, setting a size

of the gene shared memory, the number of library files that can be

accommodated, a name of each library file and a size offset of the

each library file; and loading library files commonly used in gene

analysis into the gene shared memory according to the size of the

gene shared memory, the number of library files that can be

-42- IEE210846PAU accommodated, the name of the each library file and the size offset of the each library file.

[00146] In some embodiments, the processor whenexecuting the computer

program further implements the following step: the gene analysis

comprises an alignment analysis, a variation analysis and an

annotation analysis, and the processor when executing the computer

program further implements the following step: performing the

alignment analysis, the variation analysis, and the annotation

analysis on the sample data preprocessed in sequence, wherein in a

case where the sample data preprocessed comprises multiple groups

of sample data, the multiple groups of sample data are in a same step

or different steps of the gene analysis at a time.

[00147] In some embodiments, the processor whenexecuting the computer

program further implements the following step: the gene analysis

further comprises a sorting analysis and a marking-duplicate

analysis, wherein after performing the alignment analysis, the

variation analysis, and the annotation analysis on the sample data

preprocessed in sequence, the processor when executing the computer

program further implements the following steps: labeling the sample

data after the alignment analysis with a position tag; and performing

the sorting analysis and the marking-duplicate analysis by module

on the sample data labeled.

[00148] In some embodiments, the processor whenexecuting the computer

program further implements the following step: connecting some or

-43 - IEE210846PAU all steps of the gene analysis by a use of memory.

[00149] In some embodiments, the processor whenexecuting the computer

program further implements the following step: preprocessing the

sample data comprises: performing a quality control, a filtering

operation and a statistical process on the sample data.

[00150] Some embodiments provide a computer-readable storage medium

on which a computer program is stored, which when executed by a

processor implements the following steps: reading sample data and

preprocessing the sample data; performing a gene analysis on the

sample data preprocessed, and determining whether a required library

file in the gene analysis is in a gene shared memory; if yes, obtaining

the required library file from the gene shared memory, mapping the

required library file to a process of the gene analysis of the sample

data preprocessed, and completing a corresponding analysis.

[00151] In some embodiments, the processor whenexecuting the computer

program further implements the following steps: determining whether

the required library file meets a load condition, in a case where

the required library file in the gene analysis is not in the gene

shared memory; and loading the required library file into the gene

shared memory, in a case where the loading condition is met.

[00152] In some embodiments, the computer program when executed by

a processor implements the following steps: determining whether the

required library file meets a load condition, in a case where the

required library file in the gene analysis is not in the gene shared

-44 - IEE210846PAU memory, and loading the required library file into the gene shared memory, in a case where the loading condition is met comprises: acquiring information of the required library file and information of the gene shared memory, wherein the information of the required library file comprises a space required by the required library file and the number of historical load requests, and the information of the gene shared memory comprises a remaining space of the gene shared memory; and if the number of historical load requests is greater than a first preset number, and the space required by the required library file is less than the remaining space of the gene shared memory, loading the required library file into the gene shared memory.

[00153] In some embodiments, the computer program when executed by

a processor implements the following steps: the information of the

required library file further comprises a load request frequency of

the required library file, the information of the gene shared memory

further comprises load request frequencies of all library files in

the gene shared memory; determining whether the required library file

meets a load condition, and loading the required library file into

the gene shared memory, in a case where the loading condition is met

further comprises: if the number of historical load requests of the

required library file is greater than the first preset number, and

the space required by the required library file is greater than the

remaining space of the gene shared memory, ranking the required

library file and the all library files in an order of priority

-45 - IEE210846PAU according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene shared memory; and if the number of historical load requests of the required library file is greater than a first preset number, and the space required by the required library file is less than the remaining space of the gene shared memory, loading the required library file into the gene shared memory.

[00154] In some embodiments, the computer program when executed by

a processor implements the following steps: the information of the

required library file further comprises a load request frequency of

the required library file, the information of the gene shared memory

further comprises load request frequencies of all library files;

determining whether the required library file meets a load condition,

and loading the required library file into the gene shared memory,

in a case where the loading condition is met further comprises: if

the number ofhistoricalloadrequestsis greater than the firstpreset

number, and the space required by the required library file is greater

-46 - IEE210846PAU than the remaining space of the gene shared memory, ranking the required library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene sharedmemory; and loading the required library file into the gene shared memory.

[00155] In some embodiments, the computer program when executed by

a processor further implements the following steps: setting the gene

shared memory for library files used in gene analysis, setting a size

of the gene shared memory, the number of library files that can be

accommodated, a name of each library file and a size offset of the

each library file; and loading library files commonly used in gene

analysis into the gene shared memory according to the size of the

gene shared memory, the number of library files that can be

accommodated, the name of the each library file and the size offset

of the each library file.

[00156] In some embodiments, the computer program when executed by

-47 - IEE210846PAU a processor further implements the following steps: the gene analysis comprises an alignment analysis, a variation analysis and an annotation analysis, and the computer program when executed by a processor further implements the following step: performing the alignment analysis, the variation analysis, and the annotation analysis on the sample data preprocessed in sequence, wherein in a case where the sample data preprocessed comprises multiple groups of sample data, the multiple groups of sample data are in a same step or different steps of the gene analysis at a time.

[00157] In some embodiments, the computer program when executed by

a processor further implements the following steps: the gene analysis

further comprises a sorting analysis and a marking-duplicate

analysis, wherein after performing the alignment analysis, the

variation analysis, and the annotation analysis on the sample data

preprocessed in sequence, the computer program when executed by a

processor further implements the following steps: labeling the sample

data after the alignment analysis with a position tag; and performing

the sorting analysis and the marking-duplicate analysis by module

on the sample data labeled.

[00158] In some embodiments, the computer program when executed by

a processor further implements the following step: connecting some

or all steps of the gene analysis by a use of memory.

[00159] In some embodiments, the computer program when executed by

a processor further implements the following step: preprocessing the

-48- IEE210846PAU sample data comprises: performing a quality control, a filtering operation and a statistical process on the sample data.

[00160] As understood by those skilled in the art, all or part of the

steps for carrying out the method in the above embodiments can be

completed by hardware or a program instructing the related hardware,

wherein the program can be stored in a computer readable nonvolatile

storage medium; the program when executed can carry out the steps

of the embodiments of the above methods; Any reference to memory,

storage, database or other media used in the embodiments provided

by the present application may comprise nonvolatile and/or volatile

memory. The nonvolatile memory may comprise read only memory (ROM),

programmable ROM (PROM) , electrically programmable ROM (EPROM)

, electrically erasable programmable ROM (EEPROM), or flash memory.

The volatile memory may comprise random access memory (RAM) or

external cache memory. As an illustration rather than a limitation,

RAM is available in various forms, such as static RAM (SRAM) , dynamic

RAM (DRAM) , synchronous DRAM (SDRAM) , dual data rate SDRAM (DDRSDRAM) ,

enhanced SDRAM (ESDRAM) , synchronous link DRAM (SLDRAM) , Rambus

direct RAM (RDRAM) , direct memory bus dynamic RAM (DRDRAM) and Rambus

dynamic RAM (RDRAM), etc.

[00161] The technical features of the above embodiments can be

combined arbitrarily. In order to make the description concise, all

possible combinations of the various technical features in the

embodiments are not described, but should be regarded as within the

-49 - IEE210846PAU scope of this description, as long as there is no contradiction in the combinations of these technical features.

[00162] The aforesaid embodiments merely present several embodiments

of the present application. However, the relatively specific and

detailed descriptions thereof cannot therefore be construed as

limiting the scope of the present application. It shall be pointed

out that a person skilled in the art is capable of making various

modifications and improvements without departing from the concept

of the present application. Suchmodifications and improvements shall

be regarded as within the protection scope of the present application.

Therefore, the protection scope of the present application shall be

determined by the terms of the claims.

-50- IEE210846PAU

Claims

What is claimed is:

1. A shared memory based gene analysis method, characterized by,

comprising:

reading sample data and preprocessing the sample data;

performing a gene analysis on the sample data preprocessed, and

determining whether a required library file in the gene analysis is

in a gene shared memory;

if yes, obtaining the required library file from the gene shared

memory, mapping the required library file to a process of the gene

analysis of the sample data preprocessed, and completing a

corresponding analysis.

2. The shared memory based gene analysis method according to claim

1, characterized by, further comprising:

determining whether the required library file meets a load

condition, in a case where the required library file in the gene

analysis is not in the gene shared memory; and

loading the required library file into the gene shared memory,

in a case where the loading condition is met.

3. The shared memory based gene analysis method according to claim

2, characterized in that determining whether the required library

file meets a load condition, in a case where the required library

-51- IEE210846PAU file in the gene analysis is not in the gene shared memory, and loading the required library file into the gene shared memory, in a case where the loading condition is met comprises: acquiringinformation of the requiredlibrary file andinformation of the gene shared memory, wherein the information of the required library file comprises a space required by the required library file and the number of historical load requests, and the information of the gene shared memory comprises a remaining space of the gene shared memory; and if the number of historical load requests is greater than a first preset number, and the space required by the required library file is less than the remaining space of the gene shared memory, loading the required library file into the gene shared memory.

4. The shared memory based gene analysis method according to claim

3, characterized in that the information of the required library file

further comprises a load request frequency of the required library

file, the information of the gene shared memory further comprises

load request frequencies of all library files; determining whether

the required library file meets a load condition, and loading the

required library file into the gene shared memory, in a case where

the loading condition is met further comprises:

if the number of historical load requests is greater than the first

preset number, and the space required by the required library file

-52- IEE210846PAU is greater than the remaining space of the gene shared memory, ranking the required library file and the all library files in an order of priority according to the load request frequency of the required library file and the load request frequencies of the all library files to obtain a load request frequency priority of each library file; if the load request frequency priority of the required library file is higher than that of a library file in the gene shared memory, and if the remaining space of the gene shared memory after deleting the library file with a lower load request frequency priority in the gene shared memory is greater than or equal to the space required by the required library file, deleting the library file with the lower load request frequency priority in the gene shared memory; and loading the required library file into the gene shared memory.

5. The shared memory based gene analysis method according to any

one of claims 1 to 4, characterized by, further comprising:

setting the gene shared memory for library files used in gene

analysis, setting a size of the gene shared memory, the number of

library files that can be accommodated, a name of each library file

and a size offset of the each library file; and

loading library files commonly used in gene analysis into the gene

shared memory according to the size of the gene shared memory, the

number of library files that can be accommodated, the name of the

each library file and the size offset of the each library file.

-53 - IEE210846PAU

6. The shared memory based gene analysis method according to claim

1, characterized in that the gene analysis comprises an alignment

analysis, a variation analysis and an annotation analysis, and the

method further comprises:

performing the alignment analysis, the variation analysis, and

the annotation analysis on the sample data preprocessed in sequence,

wherein in a case where the sample data preprocessed comprises

multiple groups of sample data, the multiple groups of sample data

are in a same step or different steps of the gene analysis at a time.

7. The shared memory based gene analysis method according to claim

6, characterizedin that the gene analysis further comprises a sorting

analysis and a marking-duplicate analysis, wherein after performing

the alignment analysis, the variation analysis, and the annotation

analysis on the sample data preprocessed in sequence, the method

further comprises:

labeling the sample data after the alignment analysis with a

position tag; and performing the sorting analysis and the

marking-duplicate analysis by module on the sample data labeled.

8. The shared memory based gene analysis method according to claim

7, characterized by, further comprising:

connectingsome orallsteps ofthe gene analysisbyause ofmemory.

-54 - IEE210846PAU

9. The shared memory based gene analysis method according to any

one of claims 6 to 8, characterized in that preprocessing the sample

data comprises:

performing a quality control, a filtering operation and a

statistical process on the sample data.

10. A shared memory based gene analysis apparatus, characterized

by, comprising:

a data reading module configured to read sample data;

a data preprocessing module configured to preprocess the sample

data; and

a gene analysis module configured to perform a gene analysis on

the sample datapreprocessed, anddetermine whether arequiredlibrary

file in the gene analysis is in a gene shared memory; if yes, obtain

the requiredlibrary file fromthe gene sharedmemory, map the required

library file to a process of the gene analysis of the sample data

preprocessed, and complete a corresponding analysis.

11. A computer device comprising: a memory, a processor, and a

computer program stored on the memory and executable on the processor,

characterized in that the processor when executing the computer

program implements the steps of the method according to any one of

claims 1 to 9.

-55 - IEE210846PAU

12. Acomputer-readable storage medium on which a computer program

is stored, characterized in that the computer program when executed

by a processor implements the steps of the method according to any

one of claims 1 to 9.

-56 - IEE210846PAU

102 104

102 104 2024201174

Network o

Network

Fig. 1 Fig. 1

S202 Read sample data and the preprocess sample S202 data Read sample data and the preprocess sample data

Perform a gene analysis on the sample data S204 preprocessed , and determine whether a S204 Perform a gene analysis on the sample data required library file in the gene analysis is preprocessed , and determine whether a in a gene shared memory required library file in the gene analysis is in a gene shared memory

S206 If yes, obtain the required library file from the gene shared memory , map the required If yes, obtain the required library file from S206 library file to a process of the gene the gene shared memory , map the required analysis of the sample data preprocessed, and library file to a process of the gene complete the gene analysis analysis of the sample data preprocessed , and complete the gene analysis

Fig. 2 Fig. 2 1 / 7 IEE210846PAU 1 / 7 IEE210846PAU

Process Physical address Process A Process Physical address B A Process B 2024201174

Address space Address space Shared Address space Page table memory Address space Shared Page table

Page table memory Page table

Fig. 3 Fig. 3

Set the gene shared memory for library files used in gene analysis, set a size of the gene S402 Set the gene shared memory for library files shared memory, the number of library files used in gene analysis, set a size of the gene S402 that can be accommodated, a name of each shared memory, the number of library files library file and a size offset of the each that can be accommodated, a name of each library file library file and a size offset of the each library file Load library files commonly used in gene S404 analysis into the gene shared memory according Load library files commonly used in gene S404 to the size of the gene shared memory, the analysis into the gene shared memory according number of library files that can be to the size of the gene shared memory, the accommodated, the name of the each library number of library files that can be file and the size offset of the each library accommodated, the name of the each library file file and the size offset of the each library file

Fig. 4 Fig. 4 2 / 7 IEE210846PAU 2 / 7 IEE210846PAU

Gene shared memory area M (in physical memory of a node)

0 72 offset2 memory of a node) memory area M (in physical Gene sharedoffset1 offset n Len Total information: Libl: Lib2: Number of shared libraries n Name1 Raw data (Data of database) 0 72 Name2 ... offset1 offset2 offset n Len Total length of shared memory area Len Offset1 Offset2 Total information: Lib1： Lib2： Number of shared libraries n Name1 Name2 ... Raw data (Data of database) Total length of shared memory area Len Offset1 Offset2

Logic address space Logic address space Logic address space

System kernel area System kernel area System kernel area Logic address space Logic address space Logic address space User stack User stack User stack

DynamicSystem library link area kernel area DynamicSystem library link area kernel area DynamicSystem library link area kernel area

User stack Heap User stack Heap User stack Heap ...

Data Dynamic .data, link segment (library area .bss) Data Dynamic segment (.data, link area library .bss) Data Dynamic .data, link segment (library area .bss)

Heap Code segment (text, .rodata) Heap Code segment (text, .rodata) Heap Code segment ( text, .rodata) … segmentarea DataReserved ( .data, .bss) ( .data, .bss) segmentarea DataReserved ( .data, .bss) segmentarea DataReserved Code segment ( .text, .rodata) Code segment ( .text, .rodata) Code segment ( .text, .rodata) Reserved area Reserved area Reserved area

Sample process P1 Sample process P2 Sample process P3 Sample process Pn

Fig. 5 Fig. 5

3 / 7 IEE210846PAU 3 / 7 IEE210846PAU

Gene analysis process Start

Gene analysis process Start Sample input

Start Sample input Data preprocessing 2024201174

process quality

control, filtering and Start Request for Data preprocessing statistical processing Lib-x process quality information control, filtering and Request for Use comparison statistical processing library Lib-x Lib-x information Use comparison Lib-x in the library Lib-x shared area? Load Lib-x Obtain Lib-x from gene from hard disk No Lib-x in the shared memory area? shared area? Load Lib-x Obtain Lib-x from gene No from hard disk sharedYes memory area? Map lib data to this Load method Q?

Yes process

Map lib data to this Load method Q? Yes process Alignment No analysis Load Lib-x to gene Yes shared memory area Alignment No analysis Load Lib-x to gene Variation analysis shared memory area

Variation Use analysis annotation Return lib Return no information library Lib-y data information

Use annotation Return lib Return no information library Lib-y data information Load Lib-y Obtain Lib-y from gene

from hard disk No shared memory area? End

Load Lib-y Obtain Lib-y from gene No End from hard disk shared Yes memory area? Map lib data to this Yes process

Map lib data to this process Annotation statistics

Annotation statistics Output

Output End

End

Fig. 6 Fig. 6

4 / 7 IEE210846PAU 4 / 7 IEE210846PAU

An analysis portion before acceleration 2.83h An comparison portion and an annotation portion before acceleration 2.61h 2024201174

Fig. 7

A library comparison portion before the An analysis portion after acceleration 1.75h use of gene shared memory 2.38h

Fig. 8

5/7 5 / 7 IEE210846PAU

An analysis portion after System Summary ubuntu 2020/1/16 A library comparison portion after the acceleration 1.75h on -Other use of gene shared memory 0.82h 13000

A n analysis portion a f t e r A library comparison portion a ft e r the acceleration 1.75h use of gene shared memory 0.82h 2024201174

20

as

.

Fig. 9 Fig. 9

102 104 106

Data reading 102 Data preprocessing 104 106 Gene analysis module module module Data reading Data preprocessing Gene analysis module module module

Fig. 10

6 / 7 IEE210846PAU 6 / 7 IEE210846PAU

Processor System bus Processor System bus Memory OS device Memory OS Computer device Network interface program Computer Network program Database interface Nonvolatile Database storage medium Nonvolatile storageComputer medium device Computer device

Fig. 11

7 / 7 IEE210846PAU 7 / 7 IEE210846PAU