CN113743609B

CN113743609B - Multi-signal-oriented rapid breakpoint detection method, system, equipment and storage medium

Info

Publication number: CN113743609B
Application number: CN202110997289.9A
Authority: CN
Inventors: 段君博; 王青; 刘轩宇
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2024-04-02
Anticipated expiration: 2041-08-27
Also published as: CN113743609A

Abstract

The invention discloses a rapid breakpoint detection method, a rapid breakpoint detection system, rapid breakpoint detection equipment and a storage medium for multiple signals. The method provided by the invention can rapidly and accurately detect the breakpoint shared by the same position in the signal, thereby providing reliable starting and ending position information for further segmentation, fitting and parameter estimation of the signal. The technology has wide practical characteristics, can be applied to clinical applications such as reproductive health diagnosis, prenatal screening of pregnant women, genetic disease gene diagnosis of newborns, health monitoring of wearable equipment and other scientific research fields such as archaeology, biology, medicine, engineering and the like, and has great significance for improving national physical quality and promoting scientific research work.

Description

Multi-signal-oriented rapid breakpoint detection method, system, equipment and storage medium

Technical Field

The invention belongs to the technical field of signal breakpoint detection, and particularly relates to a rapid breakpoint detection method, system and device for multiple signals and a storage medium.

Background

Conventional breakpoint detection methods mainly include cyclic binary segmentation (circular binary segmentation, CBS), optimal segmentation (optimal partitioning, OP), exact linear time of pruning (pruned exact linear time, PELT), and the like.

CBS is a copy number variation (copy number variation, CNV) detection method developed aiming at gene chip data, and is widely applied to copy number variation detection based on high-throughput sequencing (high-throughput sequencing, HTS) data at present, and DNAcopy in R language is realized based on a CBS algorithm. Assuming that only one breakpoint exists on a chromosome and the data obeys normal distribution, whether the breakpoint exists or not can be judged by performing t-test of double samples on the data at two sides of the breakpoint. The location of this breakpoint can be determined by examining all possible loci on the chromosome by traversal. If a plurality of breakpoints exist on the chromosome, the position of the first breakpoint is firstly determined, then the same treatment is respectively carried out on the chromosomes at the two sides of the breakpoint, and the positions of the second breakpoint and the third breakpoint can be determined; the process is repeated in this way, so-called bipartite segmentation is achieved. If the signal does not follow a normal distribution, other suitable tests may be used instead of the t-test. Therefore, CBS repeatedly calculates the maximum log likelihood ratio, and uses a hypothesis test method to connect the regions with similar values, thereby completing the signal binary segmentation.

The CBS method cannot prove the optimality of the segmentation from a global perspective, whereas the OP method can prove its global optimality. The OP method utilizes the dynamic programming (dynamic programming, DP) idea to decompose the long signal segmentation problem into a plurality of short signal segmentation sub-problems, and obtains the solution of the original problem by combining the solutions of the sub-problems. The OP starts from the segmentation problem of the signal with the length of 1, and the segmentation problem of the signal with any length can be solved by a method of induction method and a method of decomposing the sub-problem.

Although the OP method has global optimality, since the OP needs to traverse all possible cases in the process of applying dynamic programming, its calculated amount increases as the signal length is squared. For large-scale problems (such as signals with the length of ten thousands of points), the calculation amount is remarkable, and the practical application requirements are hardly met. However, it can be shown that under certain specific conditions, some cases are not possible, so they can be deleted during traversal, thus avoiding unnecessary computations. Therefore, the PELT method breaks through the bottleneck of the OP in terms of the calculated amount, the calculated amount is reduced to be in a linear growth relation with the signal length, the application range is greatly expanded, and only the breakpoint of a single signal can be detected.

Many scientific research and engineering applications require detection of breakpoints in signals. By breakpoint is meant here a position in the signal at which the signal assumes different patterns (mathematically strictly speaking a distribution). For example, in fig. 1, the signal contains two high and low steps, the signal maintains the same pattern (the mean value is unchanged) in the steps, and there is a change in pattern between the steps. If a break point (thick line) can be detected, the average value of the signal between the break points is easily obtained.

The above-mentioned CBS, OP, PELT methods only can detect the break point in a single signal, but in practical applications, it is often required to detect the break point common to multiple signals (as shown in fig. 2). As with the copy number variation detection problem based on high throughput sequencing techniques, the true positive rate (true positive rate, TPR) of individual signals is low and the false positive rate (false positive rate, FPR) is high because the read depth signal contains noise. A straightforward approach to improving detection performance is to increase sequencing coverage, which however leads to increased experimental costs. Alternatives are to subject the sample to multiple medium or low coverage sequencing, or to sequence using multiple platforms, i.e. multiple sequencing. Multiple sequencing can reduce systematic errors introduced by a single sample or platform, can improve detection performance, but requires techniques that can detect multiple signal breakpoints. In addition, many individuals in a population may share CNV, and many complex diseases may also share CNV, so it is necessary to detect common CNV in multiple signals from the perspective of multiple samples.

Therefore, the above-mentioned conventional methods such as CBS, OP, PELT only detect the break point in a single signal, but cannot detect the break point shared by a plurality of signals, and the detection calculation amount is large, which results in high experiment cost.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a rapid breakpoint detection method, a rapid breakpoint detection system, rapid breakpoint detection equipment and a rapid breakpoint detection storage medium for multiple signals, and aims to solve the technical problems that the breakpoint detection method in the prior art can only detect breakpoints in a single signal and has low detection efficiency.

The invention provides a rapid breakpoint detection method for multiple signals, which comprises the following steps:

s1, preprocessing an original signal to obtain a preprocessed signal matrix Y with the size of N multiplied by M after preprocessing;

s2, determining a breaking point number k or a punishment parameter lambda according to actual conditions;

s3, when the penalty parameter lambda is selected as an input parameter, solving a minimization optimization problem, and acquiring a processed signal matrix X with the size of N multiplied by M; when the number k of break points is selected as an input parameter, then the maximum possible lambda is calculated from the pre-processing signal matrix Y _max In the interval [0, lambda ] _max ]The inner search estimates a penalty parameter lambda,solving a minimization optimization problem under a given lambda to enable the number of broken points to be k, and obtaining a processed signal matrix X with the size of N multiplied by M;

s4, calculating a signal matrix X according to the obtained breakpoint position; the signal matrix X is subjected to normalization processing to obtain a processed signal X ₀ The breakpoint rapid detection of the segmented signal is realized.

Preferably, the step S1 specifically includes the following steps:

s1.1, storing the acquired original signals into an N multiplied by M original matrix Y ₀ ；

S1.2, calculating an original matrix Y ₀ The maximum absolute value c of (2);

s1.3, preprocessing the original signal by adopting the maximum absolute value c.

Preferably, in S1.3, the original signal is preprocessed by using a normalization method, and the preprocessing result is shown in formula (1):

Y＝Y ₀ /c (1)。

preferably, in S4, the processed signal X ₀ The calculation formula of (2) is shown as the formula:

X ₀ ＝cX (2)。

preferably, in S3, the search interval [0, lambda ] is searched by dichotomy _max ]A penalty parameter lambda within.

Preferably, in S3, given the pre-processing signal matrix Y and the penalty parameter λ, the minimization optimization problem as shown in equation (3) is solved, namely the signal matrix X:

where Y is an n×m pre-processed signal matrix, X is a processed n×m size signal matrix, N is the number of samples per signal, M is the number of signals, λ is the penalty parameter for each breakpoint, and P (X) is the number of breakpoints in X.

Preferably, the specific operation steps of S3 are as follows:

when the number k of broken points is selected as an input parameter:

1) Calculating the maximum possible lambda from the pre-processed signal matrix Y _max And let the least possible lambda _min Calculation of penalty parameters =0

2) Under the condition of given preprocessing signal matrix Y and penalty parameter lambda, solving the minimum optimization problem shown in formula (3), namely solving signal matrix X and breakpoint number P (X);

3) If the number k of break points is smaller than the number P (X) of break points, let lambda _min =λ and repeating the above steps;

if the number k of break points is greater than the number P (X) of break points, let lambda _max =λ and repeating the above steps;

if the number k of the break points is equal to the number P (X) of the break points, outputting a signal matrix X;

when penalty parameter λ is selected as input parameter: the minimization optimization problem, i.e., the signal matrix X, is solved as shown in equation (3).

The invention also discloses a system of the rapid breakpoint detection method facing the multiple signals, which comprises the following steps:

the signal preprocessing module is used for preprocessing the acquired original signals;

the punishment parameter estimation module is used for acquiring the processed signal matrix under the condition of given break points or punishment parameters;

and the signal processing module is used for determining the breakpoint position of the signal matrix and the processed signal, and realizing the breakpoint rapid detection of the segmented signal.

A computer device comprising a memory storing a computer program and a processor implementing steps of a multi-signal oriented fast breakpoint detection method when the computer program is executed.

A computer readable storage medium storing a computer program which when executed by a processor implements the steps of a multi-signal oriented fast breakpoint detection method.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a multi-signal-oriented processing method, which is characterized in that the number of break points and punishment parameters are determined according to actual conditions by preprocessing collected signals, and the final position of the break point and the processed signals can be determined by combining the preprocessing result, the number of break points and the punishment parameters. The method can rapidly and accurately detect the breakpoint shared by the same positions in the signal, thereby providing reliable starting and ending position information for further segmentation, fitting and parameter estimation of the signal.

Further, searching a punishment parameter by adopting a dichotomy method, firstly comparing the punishment parameter with an element in the middle of the sequence, and if the punishment parameter is larger than the element, continuing searching in the second half part of the current sequence; if the element is smaller than the element, the searching is continued in the first half part of the current sequence until the same element is found or the searched sequence range is empty, and the advantage of adopting the dichotomy is that the comparison times are less, the searching speed is high, and the average performance is good.

Furthermore, the original signal is preprocessed by adopting the normalization method, so that the convergence speed can be improved, and meanwhile, in order to ensure the accuracy of the preprocessed signal, the original signal is standardized.

According to the system for the rapid breakpoint detection method for the multiple signals, breakpoint detection is achieved by decomposing the breakpoint detection into different mutually independent modules according to the relevance of the content, the multiple-signal breakpoint detection is achieved by adopting a modularized idea, when a problem occurs in which module, the modules can be independently managed, and the modules are mutually independent and are not affected.

Drawings

FIG. 1 is a schematic diagram of a signal breakpoint detection including two high and low steps;

FIG. 2 is a schematic diagram illustrating a common breakpoint detection among multiple signals;

FIG. 3 is a flow chart of a method for detecting a quick breakpoint according to the present invention;

FIG. 4 shows the invention applied to family common copy number variation detection ((a) child inherits variation common to parents and (b) child inherits variation from parents);

FIG. 5 shows the gait analysis of the invention applied to a wearable device ((a) gait detected using three x, y and z directional acceleration sensors; b) gait detected using only z directional acceleration sensors);

fig. 6 shows the improvement of the calculation time compared with the conventional method ((a) the improvement of the calculation time with the signal length N, (b) the improvement of the calculation time with the signal dimension M).

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

the invention provides a rapid breakpoint detection method for multiple signals, a specific signal processing flow is shown in fig. 3, and the rapid breakpoint detection method comprises the following steps:

s1, carrying out normalization pretreatment on an original signal to obtain a pretreated signal matrix Y with the size of N multiplied by M after pretreatment;

S1.2, calculating an original matrix Y ₀ The maximum absolute value c of (2);

s1.3, preprocessing the original signal by using the maximum absolute value c.

s3, if the selected input parameter is lambda, directly solving a minimization optimization problem to obtain a processed signal matrix X with the size of N multiplied by M; if the input parameter is selected to be k, the maximum possible lambda is calculated according to the preprocessing signal matrix Y _max Thereafter in interval [0, lambda ] _max ]The method comprises the steps of (1) internal search estimating penalty parameter lambda, and solving a minimization optimization problem under the condition of given lambda, so that the number of broken points is k, and a processed signal matrix X with the size of N multiplied by M is obtained;

search for the interval [0, lambda ] by dichotomy _max ]A penalty parameter lambda in the detection circuit, so that the detected break point number P (X) is k;

s4, calculating a signal matrix X according to the obtained breakpoint position; the signal matrix X is multiplied by the maximum absolute value c to be subjected to de-normalization post-processing, and a processed signal X is obtained ₀ Thereby realizing the breakpoint rapid detection of the segmented signal.

Wherein, the pretreatment result is shown in formula (1):

Y＝Y ₀ /c (1)

processed signal X ₀ The calculation formula of (2) is shown as the formula:

X ₀ ＝cX (2)

given the pre-processing signal matrix Y and the penalty parameter λ, solve a minimization optimization problem as shown in equation (3), namely solve the signal matrix X:

wherein Y is N X M signal matrix to be processed, X is the same-size signal matrix after processing, N is the number of sampling points (signal length) of each signal, M is the number of signals, lambda is the penalty parameter for each breakpoint, and P (X) is the number of breakpoints in X. When X is obtained, the position and number of break points are easily known.

When the number k of broken points is selected as an input parameter:

The core steps of the technical scheme are S3, and the specific operation steps of S3 are as follows:

a) Input: preprocessing a signal matrix Y and penalty parameters lambda;

b) Initializing: the objective function storage vector F is an all-zero vector with the length of n+1, the breakpoint storage array bp is a cell array, the first cell is a null vector, the effective index list R=1, the segmentation energy E=0, and the average value Z is equal to the first column of the preprocessing signal matrix Y;

c) Enter a loop of i=1 until step x);

d) Adding the R element of the effective index list of the objective function storage vector F with the segmentation energy E, and storing the segmentation energy E into a temporary vector v;

e) Searching the minimum value and position of the temporary vector v, and storing the minimum value and position of the temporary vector v into a and i respectively ₁ ；

f) A+lambda is calculated and stored in the (i+1) th storage unit of the objective function storage vector F;

g) Ith of reading active index list R ₁ The elements are stored in the maximum absolute value c;

h) Reading the c-th cell of the breakpoint storage array bp, splicing with the maximum absolute value c, and storing the i+1th cell of the breakpoint storage array bp;

i) Searching the position of the (i+1) th element in the temporary vector v which is smaller than the objective function storage vector F, and storing the position in the (i) ₂ ；

j) If i is smaller than the number N of sampling points, executing the step x) one by one, otherwise, directly jumping to the step x);

k) Storing the (i+1) th column of the preprocessing signal matrix Y into Y;

l) only the ith is retained in the active index list R ₂ Deleting the elements and the rest elements;

m) calculating i minus the effective index list R and adding 1, and storing the length vector l;

n) copying the length vector L for M times in a column mode, and storing the length vector L into a matrix L;

o) copying y in a row mode n times (n is the length of the length vector l) and storing the y in a matrix B;

p) reading i of the mean value Z ₂ The rows are stored in a matrix T;

q) calculating a matrix B minus a matrix T, calculating the square sum of all elements for each row, and storing the square sum into a vector e;

r) calculating a vector e, multiplying a length vector l by a point, dividing the length vector l by l+1, and storing the length vector l into a vector e;

s) calculating the i-th of vector E and segment energy E ₂ The sum of the position elements is stored in the segmentation energy E;

t) adding 0 at the end of the segmentation energy E;

u) calculating a matrix T point multiplication matrix L, adding a matrix B, dividing by (L+1), and storing in an average value Z;

v) adding y at the last column of the average value Z;

w) adding i+1 at the end of the valid index list R;

x) i is added with 1, and the step c) is returned;

y) output: the position of the output breakpoint is the 2 nd to the last element in the last cell of the breakpoint storage array bp; the data matrix X after the output processing is processed by taking columns as units, and for each column of signals, the average value of the columns in the data matrix Y between two adjacent break points is used as the signal value of the section of the column X.

As shown in fig. 4, the beneficial effects of the invention in the detection of population-shared copy number variation for high throughput sequencing technology are demonstrated. The sequencing data originates from a three-port family (parent, child, m=3), and the pre-processing of the data includes (i) comparing the HTS off-machine data file fastq to the reference genome hg19 using the mapping tool bowtie; (ii) Calculating a Read Depth (RD) signal, which is the sequencing coverage depth for each base site within a fixed width window on the genome; (iii) GC and mappability correction of RD signals. FIG. 4 (a) shows copy number variation detected in the 32.8-33.4 Mb interval on chromosome 22 using the present invention. It can be seen that the offspring inherit variations common to both parents. Further, FIG. 4 (b) shows the copy number variation detected in the interval of 39.2 to 39.5Mb on chromosome 22 using the present invention. It can be seen that here the offspring inherit the variation from the father where the mother has no variation. The detection of breakpoints using multiple signals is described to have important application value.

As shown in fig. 5, the invention is shown applied to the health data analysis of a wearable device. This data resulted from a 12 second running exercise of the subject, with sensors mounted to the right ankle joint of the subject, and detected accelerations in the x, y and z directions (m=3)). Fig. 5 (a) shows the detection effect of the present invention on three acceleration signals, and it can be seen that the detection is very good from step to step. In contrast, fig. 5 (b) shows the detection effect using only one acceleration signal (z direction), and it can be seen that gait at around 8 seconds is not well distinguished. The detection precision of the break point can be improved by using a plurality of signals.

As shown in fig. 6, the improvement of the calculation time compared to the conventional method by the present invention is demonstrated. Simulation data is used herein. Fig. 6 (a) shows the relationship between the calculation time and the signal length N when the number of signals m=10. It can be seen that as the signal length N increases, the computation time increases, and the present invention uses only about one percent of the computation time of the conventional method. For very long signals of n=100000 points, the invention uses only about 10 seconds, whereas the traditional method requires about 1000 seconds. Fig. 6 (b) shows the relationship between the calculation time and the signal dimension M when the signal length n=3000. It can be seen that the computation time required by the present invention remains almost unchanged as the signal dimension M increases, whereas the computation time of the conventional method increases linearly. For many signals with m=1000, the calculation time of the present invention is less than 1 second, whereas the conventional method requires about 400 seconds. The invention can greatly reduce the calculation time, and is particularly suitable for a large number of signals.

The invention provides a rapid signal processing method. The method can rapidly and accurately detect the common breakpoint position of the multi-dimensional signals, and further provides reliable starting and ending position information for the multi-dimensional signals through segmentation, fitting and parameter estimation. The technology of the invention has wide practical characteristics, and the method can be applied to the fields of biology, medicine, engineering and the like, such as group copy number variation detection based on a high-throughput sequencing technology, movement state detection based on wearable equipment and the like.

Abbreviations and key terms appearing and used in the present invention are defined as follows:

CNV Copy Number Variation copy number variation

HTS High-Throughput Sequencing High throughput sequencing

DP Dynamic Programming dynamic planning

RD Read Depth

CBS Circular Binary Segmentation cycle halving

OP Optimal Partitioning optimal segmentation

PELT Pruned Exact Linear Time exact linear time of the subtraction

TPR True Positive Rate true Positive Rate

FPR False Positive Rate false Positive rate

GC Guanine-cytosine content Guanine-cytosine content

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The rapid breakpoint detection method for the multiple signals is characterized by comprising the following steps of:

s3, when the penalty parameter lambda is selected as an input parameter, solving a minimization optimization problem, and acquiring a processed signal matrix X with the size of N multiplied by M; when the number k of break points is selected as an input parameter, then the maximum possible lambda is calculated from the pre-processing signal matrix Y _max In the interval [0, lambda ] _max ]The method comprises the steps of (1) internally searching and estimating a penalty parameter lambda, solving a minimization optimization problem under the condition of the given lambda, enabling the number of broken points to be k, and obtaining a processed signal matrix X with the size of N multiplied by M;

in S3, given the pre-processing signal matrix Y and the penalty parameter λ, the minimization optimization problem as shown in equation (3) is solved, i.e., the signal matrix X is solved:

wherein Y is an NxM pre-processing signal matrix, X is a processed NxM size signal matrix, N is the number of sampling points of each signal, M is the number of signals, lambda is a penalty parameter for each breakpoint, and P (X) is the number of breakpoints in X;

the specific operation steps are as follows:

when the number k of broken points is selected as an input parameter:

when penalty parameter λ is selected as input parameter: solving a minimization optimization problem shown in a formula (3), namely solving a signal matrix X;

2. The rapid multi-signal-oriented breakpoint detection method according to claim 1, wherein the step of S1 specifically includes the steps of:

S1.2, calculating an original matrix Y ₀ The maximum absolute value c of (2);

3. The rapid breakpoint detection method for multiple signals according to claim 2, wherein in S1.3, the original signal is preprocessed by using a normalization method, and the preprocessing result is shown in formula (1):

Y＝Y ₀ /c (1)。

4. the multi-signal-oriented rapid breakpoint detection method according to claim 2, wherein in S4, the processed signal X ₀ The calculation formula of (2) is shown as the formula:

X ₀ ＝cX (2)。

5. the method of claim 1, wherein in S3, a binary search is performed for the interval [0, λ _max ]A penalty parameter lambda within.

6. A system for implementing the multi-signal oriented rapid breakpoint detection method according to any one of claims 1-5, comprising:

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the multi-signal oriented rapid breakpoint detection method according to any of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the multi-signal oriented rapid breakpoint detection method according to any of claims 1 to 5.