CN114882956A

CN114882956A - Pan-genome data organization method based on graph and system thereof

Info

Publication number: CN114882956A
Application number: CN202210412619.8A
Authority: CN
Inventors: 郭金旦; 陈禹保; 刘江宁; 秦川
Original assignee: Institute of Laboratory Animal Science of CAMS
Current assignee: Institute of Laboratory Animal Science of CAMS
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-09

Abstract

The invention discloses a method, a system, equipment and a computer readable storage medium for organizing pan-genomic data based on a graph, wherein the method comprises the following steps: obtaining a set of pan-genomic sequence data; patterning the pan-genome sequence data to obtain a staining map of a pan-genome; marking and acquiring the access state characteristics of a single node of the colored graph, and traversing the colored graph to obtain a cSupB data model after the colored graph is decomposed and data information of the cSupB data model; determining the inclusion relation between the cSupB data models based on the data information of the cSupB data models, and constructing a cSupB structure tree model according to the inclusion relation. The invention overcomes the problems of disordered data organization mode and poor sequence readability, validity and integrity when aiming at a large amount of genome data at present.

Description

Pan-genome data organization method based on graph and system thereof

Technical Field

The invention belongs to the technical field of medical treatment, and particularly relates to a pan-genomic data organization method based on a graph and a system thereof.

Background

The development of the fields of life science, medicine and the like is closely related to the application of sequencing technology, but due to the sequencing technology, the sequencing cost, even the calculation cost and the like, the research of many genomes has many problems, such as over-dependence on a reference genome. At present, reference genome is important in many fields, and in almost all the researches related to genome, people firstly need to construct reference genome for research species, and then carry out different subsequent researches based on the reference genome, for example, comparing other newly sequenced individual data of the species with the reference genome to find difference, and the method is the basis for seeking disease gene origin in human genomics. However, the biggest disadvantage of the reference genome-based approach is the omission of the problem, because only one genome apparently cannot contain all the information of the genome, and in the context of the wide sequencing of today's large number of species and individuals, for example, human, at least 10% of the sequence information of the human genome will be omitted in the reference genome if the traditional reference genome research method is still used.

In recent years, as the quality of assembling individual genomes is improved due to the development of sequencing technologies, the quantity of sequencing is continuously increased due to the reduction of sequencing cost, and still for example, human beings are taken as examples, the quality of assembling sequencing sample genomes can be better than that of GRCh38, so that a plurality of available genome assembling results exist at present, the quantity is believed to be continuously increased in the future, and not only human beings but also other species are the same, and the population genome era is gradually entered from the genome era. While the advent of the population genome era brings a lot of genome data and unprecedented research opportunities, new requirements and challenges are provided for bioinformatics analysis methods, for example, how to effectively organize large-scale population genome data and perform subsequent analysis (such as phylogenetic analysis) is a problem that researchers are in urgent need to solve.

In the face of a large amount of genome data, a genome map is widely applied as an effective data organization mode, but for subsequent research, the validity and the simplicity of a data structure need to be guaranteed while sequence information integrity is ensured as much as possible, and at present, many related researches exist, but most data organizations are relatively disordered and have poor readability and information integrity.

Aiming at solving the problems that the data organization mode is disordered and the readability, the validity and the integrity of the sequence are poor when aiming at a large amount of genome data, a method and a system for organizing pan-genome data based on a graph are provided.

Disclosure of Invention

In order to overcome the problems presented in the background art, the present invention provides a map-based pan-genomic data organization method and a system thereof.

A graph-based pan-genomic data organization method, comprising:

obtaining a set of pan-genomic sequence data;

patterning the pan-genome sequence data to obtain a staining map of a pan-genome;

marking and acquiring the access state characteristics of a single node of the colored graph, and traversing the colored graph to obtain a cSupB data model after the colored graph is decomposed and data information of the cSupB data model;

determining the inclusion relation between the cSupB data models based on the data information of the cSupB data models, and constructing a cSupB structure tree model according to the inclusion relation.

The characteristics of the access state of the single node of the coloring diagram comprise an unaccessed state, a semi-access state, an accessible state and an accessed state;

optionally, the state of not accessing is that no access point is accessed; the semi-access state is that at least one access point is accessed and at least one access point is not accessed; the accessible state is that all access points are accessed and the access points are in a state that the access points can be accessed with users; the accessed state is that all access points of the node are accessed and the node itself is also accessed;

optionally, the traversing the colored graph adopts a class subsequent traversal method;

optionally, the cpub data model is:in the coloring graph G ═ (V, E, C), V (G), V (E), and V (C) are a point set, an edge set, and a color set of the graph G, respectively. For any one color set

G ₁ ＝(V ₁ ,E ₁ ,C ₁ ) Is a subgraph of graph G, and satisfies the requirement for any node u _i ∈V ₁ ，

For two different points s and t, called<s,t,C1>A clustered Superbubble; s is called a source node, and t is a sink node;

optionally, the data information of the cpub data model includes, but is not limited to, the following information: source, sink, color of the cpub data model, and order of the cpub data model.

The containment relationship is determined based on the color of the cSupB data model and the order of the cSupB data model;

optionally, cSupB1, cSupB2, and cSupB3 are arbitrary cSupB data models, let G ₁ ＝(V ₁ ,E ₁ ,C ₁ )，G ₂ ＝(V ₂ ,E ₂ ,C ₂ ) And G ₁ ＝(V ₃ ,E ₃ ,C ₃ ) Cbubb 1, cbubb 2 and cbubb 3 contain node-induced subgraphs that, as the metrics are met simultaneously, cbubb 1 is the child cbubb of cbubb 2 and cbubb 2 is the parent cbubb of cbubb 1;

the data organization method further comprises: constructing a pan-genome coordinate system based on the cSupB structure tree model;

optionally, the pan-genome coordinate system represents the position characteristics of a single site on the staining map by means of a triplet;

optionally, the sub-features of the triplet include: numerical information, topological information, and color information.

The pan-genomic coordinate system represents the position characteristics of a single sequence on the staining map in a six-tuple manner;

optionally, the six-tuple sub-features include: an offset value of a path starting point, a minimum cSupB where a path starting node is located, an offset value of a path ending point, a minimum cSupB where a path ending node is located, a minimum cSupB containing the path starting node and the path ending node, and a color of the path; the six-tuple of sub-features is denoted as: startpos, startbu, endpos, endbu, pathbu, pathcolor.

Determining a correlation between at least two single sequences based on positional characteristics of the single sequences on the shading map; the position characteristics of the two single sequences are respectively as follows: path1 (startpos1, startbu 1, endpos1, endbu 1, pathbu 1, pathcolor1) and path2 (startpos2, startbu 2, endpos2, endbu 2, pathbu 2, pathcolor2), (startpos1, endpos1) and (startpos2, endpos2) do not intersect, output path1 and path2 are separated; (startpos1, endpos1) and (startpos2, endpos2) are mutually contained, and pathcolor1 and pathcolor2 have an inclusion relationship, but the color inclusion relationship and the interval inclusion relationship are opposite, and output path1 and path2 contain; not the case, output path1 and path2 intersect.

After acquiring the pan-genome sequence data, carrying out pretreatment by adopting a method comprising base substitution and sequence fragment addition;

optionally, the base substitution is carried out by substituting the base with the highest frequency of occurrence in other series at the position when the degenerate base sequence exists corresponding to the genome-wide sequence data;

optionally, the adding sequence segments are adding identical sequence segments at the head and tail of the pan-genome sequence data, respectively;

optionally, the pan-genomic staining pattern is directed, acyclic, degenerate base-free, and has a unique start and end point based on the pre-treatment.

An analysis device of a graph-based pan-genomic data organization method, the device comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions that, when executed, perform the graph-based genome-wide data organization method described above.

An analysis system for a graph-based pan-genomic data organization method, comprising:

the first processing unit is used for acquiring a group of pan-genome sequence data, and patterning the pan-genome sequence data to obtain a staining map of a pan-genome;

the second processing unit is used for marking and acquiring the characteristics of the access state of a single node of the colored graph, and traversing the colored graph to obtain a cSupB data model after the colored graph is decomposed and data information of the cSupB data model;

the third processing unit is used for determining the inclusion relationship between the cSupB data models based on the data information of the cSupB data models and constructing a cSupB structure tree model according to the inclusion relationship;

and a fourth processing unit, which is used for constructing a pan-genome coordinate system based on the cSubB structure tree model. .

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned graph-based pan-genomic data organization method.

The application has the following beneficial effects:

1. the invention provides a new data organization mode, based on the research thought of genome, a large amount of complex data are depicted by constructing a staining pattern, the data structure is clear, and the structural information of the staining pattern is effectively analyzed by decomposing and recombining the staining pattern; a set of genome-wide sequence data can be constructed as a directed acyclic graph with unique start and stop nodes. Based on the graph, on one hand, a data structure of the cSubB is provided, and the cSubB inherits the characteristics of superbubbles and combines sample information such as junction sources and linkage. The method is characterized in that the whole coloring graph can be decomposed into a cSupB with a large size and a small size by traversing the graph once through the proposed class post-order traversal strategy. And then, obtaining the inclusion relation between the cSupBs by using the information obtained during traversal, and quickly obtaining the cSupB structure tree. On the other hand, to describe the location information of the node, an offset value is introduced.

2. In the invention, a three-dimensional coordinate system is also provided to completely and completely depict the characteristics and the information of the coloring picture;

the technical scheme of the application effectively solves the problems that the current data organization mode is disordered and has relatively poor readability, validity and integrity of the sequence, and is simple and convenient.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a map-based pan-genomic data organization method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of an analysis apparatus for a graph-based pan-genomic data organization method provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an analysis system of a map-based pan-genomic data organization method provided by an embodiment of the invention;

FIG. 4 is a schematic diagram of the construction, disassembly and reassembly of a rendering map provided in accordance with an embodiment of the present invention;

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of an intra-operative navigation method for intravascular treatment according to an embodiment of the present invention, and specifically, the method includes the following steps:

101: acquiring a group of pan-genome sequence data, and patterning the pan-genome sequence data to obtain a staining map of a pan-genome;

in one embodiment, the pan-genomic sequence data is acquired followed by a pre-treatment using a method comprising base substitution, addition of sequence fragments;

the base substitution is carried out by substituting the base with a base having the highest frequency of occurrence in other series at the site when a degenerate base sequence is present in the pan-genomic sequence data;

the added sequence segments are the same sequence segments added at the head and tail of each sequence in the genome-wide sequence data respectively; the function of the fragment is to anchor the head and the tail of all the sequences together, and the topology structure of other positions on the map should not be affected, i.e. no loop structure is formed with the genome sequence, so that the fragment needs to be judged whether to be applicable to the group of samples during random generation, and is not fixed and can be adjusted correspondingly with the change of the samples.

Based on the pre-treatment, the pan-genomic staining pattern is directed, acyclic, free of degenerate bases, and has a unique start and end point. Each sequence corresponds to a unique path from the starting point to the ending point on the rendering.

Pangenomic is a collection of multiple genomes. Pan-genomes include core genomes (core genes) and non-essential genomes (variable genes).

Path (path): from v ₀ To v _k A path of (a) refers to a sequence v ₀ ，e ₁ ，v ₁ ，e ₂ ,……,e _k ，v _k， Wherein e _i Is a connecting node v _i-1 To v _i The path length is k. If a path exists in the graph whose start and stop points are the same, the path is "closed", indicating that the graph has a loop.

102: marking and acquiring the access state characteristics of a single node of the colored graph, and traversing the colored graph to obtain a cSupB data model after the colored graph is decomposed and data information of the cSupB data model;

in one embodiment, the characteristics of the access state of the single node of the coloring diagram comprise an unaccessed state, a semi-accessed state, an accessible state and an accessed state;

the non-access state is that no access point is accessed and is recorded as-1; the semi-access state is that at least one access point is accessed and at least one access point is not accessed, and is recorded as 0; the accessible state is that all access points are accessed, and the access point is in a state that the access point can be accessed with oneself and is marked as 1; the visited state is that all access points of the node are visited and the node itself is visited, and is recorded as 2;

the traversing coloring graph adopts a class subsequent traversing method; the basic requirement of the subsequent traversal is to visit the father node after all the child nodes are visited in a graph, and the basic requirement of the similar subsequent traversal is to visit the child nodes after all the father nodes are visited, namely, for each node in the graph, the access to the node and the exit point can be carried out only when all the entry points of the node are visited completely.

The cSupB data model is as follows: in the color chart G ═ V, (E, C), V (G), V (E), and V (C) are points of the chart G, respectivelySet, edge set, and color set. For any color set C1(C1C), G1 (V1, E1, C1) is a subgraph of graph G, satisfying that for any node u _i ∈V ₁ ，

For two different points s and t, called<s,t,C1>A colored Superbubble, also known as cSuperB; if it satisfies the four criteria of the traditional superbubble at G1, reachability, matching, circularity and minimum, the function color represents the color information of the node or edge; s is called a source node, and t is a sink node; any given two cpubs, defined as cpub 1 and cpubb 2, were assigned to the subgraphs induced by the inclusion of nodes in cpubb 1 and cpubb 2, respectively, with G1 ═ V1, E1, C1 and G2 ═ V2, E2, C2. Then set V ₀ ＝V ₁ ∩V ₂ ，C ₀ ＝C ₁ ∩C ₂ When is coming into contact with

Or

And is

Then outputs cSubB 1 and cSubB 2 are separated; when in use

And is

And V is ₀ ≠V ₁ (V ₂ ) When the outputs cSubB 1 and cSubB 2 intersect; when V is ₀ ＝V ₁ And C ₀ ＝C ₁ Or V ₀ ＝V ₂ And C ₀ ＝C ₂ The outputs cpub 1 and cpub 2 comprise. If two cSubBs are involved, for example: v ₀ ＝V ₂ And C ₀ ＝C ₂ The cSupB1 is the parent of cSupB2, and the cSupB2 is the child of cSupB 1.

The data model for SupB is: in the assembly graph G ═ (V, E), V (G), and V (E) are the point set and the edge set of the graph G, respectively. For any two different points s and t, < s, t, > is called a SuperBubble; if the SuperBubble satisfies the following four criteria: reachability, matchability, acyclic, and minimal; accessibility is reported as reachability: there is a path from point to point; the matching is recorded as matching: sending a point set which can be reached without passing through the point from the point, wherein the point set is the same as the point set which can be reached without passing through the point; acyclic is recorded as acyclicity: subgraphs induced by the point sets satisfying the matching are acyclic; minimum is noted as minimality: in the point set U, no other point except the point t can form a pair with the point s, wherein the pair can meet the three criteria; s is called a source node, and t is a sink node; here, only superbunbles containing at least two supernodes are considered.

As shown in FIG. 4, the Ahaplotype of FIG. 4 is a unimodal specimen, which is three initial sequences; based on the SupB data model, there are only two superbubbles in FIG. 4B, < TAT, ACC > and < TCA, GTA >; fig. 4B is a colored drawing, where k is 3, each circle represents a node, and black arrows represent sides. The characters above the nodes sequentially represent the bases, the access orders and the colors of the nodes, and the lowest figures represent the theoretical deviation values of each node; FIG. 4C is the result of the reverse edge direction access graph; the numbers below the nodes represent the final and initial offset values, respectively; fig. 4D is a cpubb tree. Here, 5 cpubes are found, in turn, bub1.< TAT, ACC,111 >; bub2.< GGG, GTA,110 >; bub3.< CAC, GTA,011 >; bub4.< TCA, GGG,110 >; and bub5.< TCA, GTA,111 >; FIG. 4E is a final exploded and representative rendering.

The data information of the cpub data model includes, but is not limited to, the following information: source, sink, color of the cpub data model, and order of the cpub data model.

Matching principle of cSubB data model: the colors of adjacent out (in) edges of the source (sink) nodes are not intersected; the intersection of the adjacent in (out) edge colors of the sink (source) nodes and the out (in) edge color of each adjacent source (sink) node is not null; the cSubB color is the intersection of the source and sink colors.

Matching process: in class-sort traversalIf a source node s is encountered, putting the source node s into a source node queue Q to be accessed; if a sink node t is encountered, a source node s matched with the sink node t is reversely found from Q according to the matching principle. In particular if it is

S is deleted from Q and continues; if it is

The matching is stopped.

Access point (associating node)/edge (associating edge): given any two nodes u, v, u is called the entry point of v, if and only if there is at least one path from u to v, and all edges on the path are called the entry edge of v, the number of v adjacent entry edges is called the in-degree (indegree) of v.

Out point (outgoing node)/out edge (outgoing edge): meanwhile, v is also called the out-point of u, all edges on the path are called out-edges of u, and the number of the out-edges adjacent to u is called out-degree of u (outdegree).

Degree (degree): the sum of the node in-degree and out-degree is called the degree of the node.

super node: a node is called a supernode if at least one of the out-degree or in-degree of the node is greater than 1.

branch: in this context, a path v ₀ ，e ₁ ，v ₁ ，e ₂ ,……,e _k ，v _k Called branch, if the out-degree and in-degree of all nodes on the path are equal to 1 and v ₀ Previous node of (a) and v _k The latter nodes of (2) are all supernodes.

Bridge (bridge): a branch is referred to herein as a bridge, and if any point or edge on the branch is deleted, the number of connected blocks in the graph increases.

Bubble (bubble): the structure similar to bubbles formed by diverging first and then converging on the path diagram due to sequence difference is called bubble, and is detailed in a data model of SupB.

103: determining the inclusion relation between the cSupB data models based on the data information of the cSupB data models, and constructing a cSupB structure tree model according to the inclusion relation.

In one embodiment, the class is traversed in a subsequent order to obtain all the cSupB data models, known information including source/sink, cSupB color, and cSupB order, the order indicating the order in which cSupB is obtained. In order to better study the above cpubs data model, it is necessary to determine the inclusion relationship between cpubs data models first, and then obtain a cpubs hierarchical structure tree, obviously, there may be more than one structure tree. And assigning a level value to each cSupB according to the hierarchy of the structure tree. First, the root cSupB is determined, and the cSupB containing all samples is set as the root cSupB, with a level of 1. And assigning a level value to each cSupB according to the nesting level. In this embodiment, the cpubs are cpubs data models.

Determining an inclusion relationship based on the color of the cSupB data model and the order of the cSupB data model; the basic criteria are that cpubb 1, cpubb 2 and cpubb 3 are arbitrary cpubb data models, let G1 ═ (V1, E1, C1), G2 ═ V2, E2, C2) and G3 ═ V3, E3, C3, and cpubb 1, cpubb 2 and cpubb 3, respectively, contain node-induced subgraphs, as if the following three conditions are met at the same time, a ₁ )＜order(G ₂ )；b.

And C ₁ ≠C ₀ (C ₀ Is the color set that includes all samples); c. absence of cSupB3, such that order (G) ₁ )＜order(G ₃ )＜order(G ₂ ) And is

Called cpubb 1 as child cpubb of cpubb 2 and cpubb 2 as the nearest parent cpubb of cpubb 1.

104: and constructing a pan-genome coordinate system based on the cSupB structure tree model.

In the linear reference genome coordinate system, the position information of a locus can be uniquely represented by only one positive integer a, the position information of a sequence can be uniquely represented by a binary group (a, b), and the biological relationship between sequences can be discussed by analyzing the position relationship between sequences. However, it is clear that the above representation is not applicable to genomic maps, where a haplotype pan-genomic coordinate system is constructed based on the previously constructed cSupB tree model.

Specifically, the pan-genome coordinate system represents the position (Base Location; BL) of a single site on the staining map by adopting a triple manner;

optionally, the sub-features of the triplet include: numerical information, topological information, and color information; the sub-features of the triplet are written in english as: position, bubid, basecolor. Since there is one and only one path for each sample contained in any one cbub, this representation corresponds to one for each site in the figure. The position represents an offset value of a node corresponding to a position, bubid represents the minimum cSupB where the position is located, basecolor represents the color of the position, and the position and the buscolor are represented in two ways, wherein one way is represented by a character string consisting of 0 and 1 as with bubcolor, and the other way is represented by randomly selecting a sample id where the position is located, the former way is more comprehensive, and the latter way is simpler and can be selected according to different purposes. In this triplet, position, bubid, and basecolor represent numerical information, topological information, and color information, respectively.

Optionally, the pan-genome coordinate system represents the position (Base Location; BL) of a single sequence on the staining pattern in a six-tuple manner;

optionally, the six-tuple sub-features include: an offset value of a path starting point, a minimum cSupB where a path starting node is located, an offset value of a path ending point, a minimum cSupB where a path ending node is located, a minimum cSupB containing the path starting node and the path ending node, and a color of the path; the sub-feature format of the six-tuple is as follows: (startpos, startbu, endpos, endbu, pathbu, pathcolor). The startpos and the endpos respectively represent offset values of a path starting point and a path ending point, and the startbu and the endbu respectively represent the minimum cpubs where the path starting node and the path ending node are located. pathpub represents the minimum cSupB containing both the starting node and the ending node of the path, obviously startbub and endbub are the sub-cSupBs of pathbub, if pathbub does not exist, that is, the path crosses the root node, if the number of the root cSupBs it crosses is n, pathbub is marked as-n, pathcolor represents the color of the path, and the color represents a string composed of 0 and 1 and having the length equal to the number of samples. Specifically, when the path length is equal to 1, the path is a site, where startpos is endpos, startbu is pathbu, and pathcolor is basecolor, and the hexahydric group of the path becomes a triplet of sites.

Determining a correlation between at least two single sequences based on positional characteristics of the single sequences on the shading map. In a linear coordinate system, if two intervals are given, three correlations of phase separation, intersection and inclusion can be given, and similarly, on a genomic map, the correlation of two paths can also be given. The position characteristics of the two single sequences are respectively as follows: path1 (a1, bub1, b1, bub2, bub3 and color1) and path2 (a2, bub4, b2, bub5, bub6 and color2), analyzing that no intersection exists between [ a1 and b1] and [ a2 and b2], and outputting path1 and path 2; [ a1, b1] and [ a2, b2] are mutually contained, color1 and color2 have an inclusion relationship, but the color inclusion relationship and the interval inclusion relationship are opposite, and output path1 and path2 are contained; not the case, output path1 and path2 intersect. startpos is denoted by a and endpos by b.

In particular, if there is intersection between [ a1, b1] and [ a2, b2] and there is no intersection between color1 and color2, the intersection between path1 and path2 is obtained based on the above relationship, and then sequence similarity analysis can be performed. According to the cSupB structure tree model, the nearest father cSupB, bub7 and the color3 which are common to bub3 and bub6 are found. If bub7 does not exist, it indicates that the path crosses the root node, and the region containing both path1 and path2 is composed of one or more roots cSubB and one or more bridges, and color3 contains all samples; if bub7 exists, a similarity analysis can be performed within this bub7, when the union of color1 and color2 is a subset of color 3. As shown in FIG. 4, some nodes are randomly selected to assign base positions, such as: TCA (3,4,111), ATG (13,1,100), CCC (17, -1,111) and GGG (7,2, 110). Randomly selecting three paths of a.CAGGGTGTGTA- > (5,4,11,2,5, 100); gggagta- > (7,2,11,2,2,010); taaccc- > (13,1,17, -1, 011). Where path a intersects path b and paths a (b) and c are separated.

Based on the genomic map coordinate system, more functions can be realized. For example: given a genome annotation file (e.g., gtf, gff), the relationship between annotation information and topology can be derived and possible variations predicted; given a variant file (e.g., vcf), not only can topology information be found from the variables, but the variable information can also be predicted and compared to the variant file to explore further findings.

FIG. 2 is a schematic diagram of an analysis apparatus for a graph-based pan-genomic data organization method according to an embodiment of the present invention, the apparatus comprising: a memory and a processor;

the memory is to store program instructions;

FIG. 3 is a schematic flow chart of an analysis system of a graph-based pan-genomic data organization method provided by an embodiment of the invention, comprising: 301: the first processing unit is used for acquiring a group of pan-genome sequence data, and patterning the pan-genome sequence data to obtain a staining map of a pan-genome;

302: the second processing unit is used for marking and obtaining the characteristics of the access state of a single node of the colored graph, traversing the colored graph to obtain a cSupB data model after the colored graph is decomposed and data information of the cSupB data model;

303: the third processing unit is used for determining the inclusion relationship between the cSupB data models based on the data information of the cSupB data models and constructing a cSupB structure tree model according to the inclusion relationship;

304: and a fourth processing unit, which is used for constructing a pan-genome coordinate system based on the cSubB structure tree model.

The validation results of this validation example show that assigning an intrinsic weight to an indication can moderately improve the performance of the method relative to the default settings.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

While the invention has been described in detail with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A graph-based pan-genomic data organization method, comprising:

obtaining a set of pan-genomic sequence data;

mapping the pan-genome sequence data to obtain a staining map of a pan-genome;

2. The graph-based genome-wide data organization method of claim 1, wherein the characteristics of the individual node access states of the shading graph comprise an unaccessed state, a semi-accessed state, an accessible state and an accessed state;

optionally, the state of not accessing is that no access point is accessed; the semi-access state is that at least one access point is accessed and at least one access point is not accessed; the accessible state is that all access points are accessed and the access points are in a state of being accessible with users; the accessed state is that all access points of the node are accessed and the node itself is also accessed;

optionally, the cpub data model is: in the coloring graph G ═ (V, E, C), V (G), V (E), and V (C) are a point set, an edge set, and a color set of the graph G, respectively. For any one color set

G ₁ ＝(V ₁ ,E ₁ ,C ₁ ) Is a subgraph of graph G, satisfying any node u _i ∈V ₁ ，

For two different points s and t, called<s,t,C ₁ >One coloredSuperbubble; s is called a source node, and t is a sink node;

3. The graph-based pan-genomic data organization method according to claim 2, wherein the inclusion relation is determined based on the color of the cSupB data model and the order of the cSupB data model; optionally, cSupB1, cSupB2, and cSupB3 are arbitrary cSupB data models, let G ₁ ＝(V ₁ ,E ₁ ,C ₁ )，G ₂ ＝(V ₂ ,E ₂ ,C ₂ ) And G ₁ ＝(V ₃ ,E ₃ ,C ₃ ) The cSupB1, cSupB2 and cSupB3 contain subgraphs induced by nodes, respectively, and as such satisfy the criteria of being equal, cSupB1 is the child cSupB of cSupB2 and cSupB2 is the parent cSupB of cSupB 1.

4. The graph-based pan-genomic data organization method according to any one of claims 1-3, wherein the data organization method further comprises: constructing a pan-genome coordinate system based on the cSupB structure tree model;

5. The map-based pan-genomic data organization method according to claim 4, wherein the pan-genomic coordinate system represents the positional features of a single sequence on the staining map in a six-tuple manner;

optionally, the six-tuple sub-features include: an offset value of a path starting point, a minimum cSupB where a path starting node is located, an offset value of a path ending point, a minimum cSupB where a path ending node is located, a minimum cSupB containing the path starting node and the path ending node, and a color of the path;

optionally, the sub-features of the six-tuple are recorded as: startpos, startbu, endpos, endbu, pathbu, pathcolor.

6. The map-based pan-genomic data organization method according to claim 5, wherein the correlation between at least two single sequences is determined based on the positional features of the single sequences on the staining map;

optionally, the position characteristics of the two single sequences are respectively: path1 (startpos1, startbu 1, endpos1, endbu 1, pathbu 1, pathcolor1) and path2 (startpos2, startbu 2, endpos2, endbu 2, pathbu 2, pathcolor2), (startpos1, endpos1) and (startpos2, endpos2) do not intersect, output path1 and path2 are separated; (startpos1, endpos1) and (startpos2, endpos2) are mutually contained, and pathcolor1 and pathcolor2 have an inclusion relationship, but the color inclusion relationship and interval inclusion relationship are opposite, and output path1 and path2 contain; not the case, output path1 and path2 intersect.

7. The map-based genome data organization method according to claim 1, wherein after the genome-wide sequence data is acquired, a pre-treatment is performed, wherein the pre-treatment adopts a method comprising base substitution and sequence fragment addition;

8. An analysis device of a graph-based pan-genomic data organization method, the device comprising: a memory and a processor;

the memory is to store program instructions;

the processor is configured to invoke program instructions, which when executed, are configured to perform the graph-based genome-wide data organization method of any one of claims 1-7.

9. An analysis system for a graph-based pan-genomic data organization method, comprising:

and a fourth processing unit, which is used for constructing a pan-genome coordinate system based on the cSubB structure tree model.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the graph-based genome-wide data organization method of any one of the preceding claims 1 to 7.