CN113743609A

CN113743609A - Multi-signal-oriented rapid breakpoint detection method, system, equipment and storage medium

Info

Publication number: CN113743609A
Application number: CN202110997289.9A
Authority: CN
Inventors: 段君博; 王青; 刘轩宇
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-03
Anticipated expiration: 2041-08-27
Also published as: CN113743609B

Abstract

The invention discloses a multi-signal-oriented rapid breakpoint detection method, a system, equipment and a storage medium. The method provided by the application can quickly and accurately detect the shared breakpoint at the same position in the signal, thereby providing reliable initial and termination position information for further signal segmentation, fitting and parameter estimation. The technology has wide practical characteristics, can be applied to clinical applications such as reproductive health diagnosis, prenatal screening of pregnant women, genetic diagnosis of newborn genetic diseases, health monitoring of wearable equipment and the like, and other scientific research fields such as archaeology, biology, medicine, engineering and the like, and has great significance for improving the physical quality of the nation and promoting scientific research work.

Description

Multi-signal-oriented rapid breakpoint detection method, system, equipment and storage medium

Technical Field

The invention belongs to the technical field of signal breakpoint detection, and particularly relates to a multi-signal-oriented rapid breakpoint detection method, a multi-signal-oriented rapid breakpoint detection system, multi-signal-oriented rapid breakpoint detection equipment and a storage medium.

Background

The conventional breakpoint detection method mainly includes Cyclic Binary Segmentation (CBS), optimal segmentation (OP), and pruning precise linear time (pel).

CBS is a Copy Number Variation (CNV) detection method developed for gene chip data, and is also widely used for copy number variation detection based on high-throughput sequencing (HTS) data at present, and DNAcopy in R language is realized based on CBS algorithm. If only one breakpoint on the chromosome is assumed and the data obeys normal distribution, the data on both sides of the breakpoint are subjected to double-sample t-test, and the presence or absence of the breakpoint can be judged. The location of this breakpoint can be determined by traversing all possible loci of the chromosome. If a plurality of breakpoints exist on the chromosome, the position of a first breakpoint is determined firstly, and then the chromosomes on two sides of the breakpoint are processed in the same way, so that the positions of a second breakpoint and a third breakpoint can be determined; this process is repeated to realize so-called binary division. If the signal does not follow a normal distribution, other suitable tests may be used in place of the t-test. Therefore, the CBS repeatedly calculates the maximum log-likelihood ratio, connects regions with similar values by using a hypothesis test method, and thus completes signal bisection.

The CBS method cannot prove the optimality of the segmentation from a global perspective, whereas the OP method can prove its global optimality. The OP method decomposes the segmentation problem of the long signal into a plurality of sub-problems of short signal segmentation by using a Dynamic Programming (DP) idea, and obtains a solution of the original problem by merging the solutions of the sub-problems. The OP can solve the division problem of signals of arbitrary length by using a method of decomposing a subproblem by a method of a regression method starting from the division problem of signals of length 1.

Although the OP method has global optimality, since the OP needs to traverse all possible cases in the process of applying dynamic programming, the calculation amount increases in square with the signal length. The calculation amount is huge for large-scale problems (such as signals with the length of tens of thousands of points), and the practical application requirements are difficult to meet. However, it may prove that under certain specific conditions, some cases do not have the possibility, and therefore these cases may be eliminated during traversal, thereby avoiding unnecessary computations. Therefore, the PELT method breaks through the bottleneck of OP in the aspect of calculation amount, the calculation amount is reduced to be in a linear increasing relation with the signal length, the application range is greatly expanded, and only the breakpoint of a single signal can be detected.

Many scientific research and engineering applications require the detection of breakpoints in signals. A breakpoint is here understood to mean a position in a signal on both sides of which the signal exhibits different patterns (strictly mathematically speaking distributions). For example, fig. 1 shows a signal with two high and low steps, and the signal keeps the same mode (the mean value is constant) in the steps, but has a mode change between the steps. If a break point (thick line) can be detected, the mean value of the signal between break points can be easily obtained.

The existing methods such as CBS, OP, and PELT can only detect a breakpoint in a single signal, but in practical applications, it is often necessary to detect a breakpoint shared by multiple signals (as shown in fig. 2). As copy number variation detection problems based on high-throughput sequencing techniques, the True Positive Rate (TPR) of a single signal is low and the False Positive Rate (FPR) is high because the deep-reading signal contains noise. A straightforward way to improve detection performance is to increase sequencing coverage, which however leads to increased experimental costs. An alternative is to sequence the sample multiple times with medium or low coverage, or to use multiple platforms for sequencing, i.e. multiplex sequencing. Multiplex sequencing can reduce systematic errors introduced by a single sample or platform, can improve detection performance, but requires techniques that can detect multiple signal breakpoints. In addition, multiple individuals in a population may share CNV, and many complex diseases may also share CNV, so it is necessary to detect common CNV in multiple signals from the viewpoint of multiple samples.

Therefore, the conventional methods such as CBS, OP, and PELT can only detect a breakpoint in a single signal, cannot detect a breakpoint common to a plurality of signals, and have a large detection calculation amount, which results in high experiment cost.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, the present invention provides a multi-signal-oriented fast breakpoint detection method, system, device and storage medium, and aims to solve the technical problem of the prior art that a breakpoint detection method can only detect a breakpoint in a single signal and has low detection efficiency.

The invention provides a multi-signal-oriented rapid breakpoint detection method, which comprises the following steps:

s1, preprocessing the original signal to obtain a preprocessed signal matrix Y with the size of NxM after preprocessing;

s2, determining a breakpoint number k or a penalty parameter lambda according to actual conditions;

s3, when selecting the punishment parameter lambda as the input parameter, solving the minimization optimization problem, and acquiring the processed signal matrix X with the size of N multiplied by M; when the number of break points k is selected as an input parameter, the maximum possible lambda is calculated from the preprocessed signal matrix Y_maxIn the interval [0, λ_max]Estimating a punishment parameter lambda by internal search, solving a minimization optimization problem under the given lambda to ensure that the number of broken points is k, and acquiring a processed signal matrix X with the size of NxM;

s4, calculating a signal matrix X according to the obtained breakpoint position; performing denormalization processing on the signal matrix X to obtain a processed signal X₀And the breakpoint of the segmented signal is rapidly detected.

Preferably, the method at S1 specifically includes the following steps:

s1.1, storing the acquired original signals into an N multiplied by M original matrix Y₀；

S1.2, calculating an original matrix Y₀The maximum absolute value c of;

and S1.3, carrying out preprocessing operation on the original signal by adopting the maximum absolute value c.

Preferably, in S1.3, the raw signal is preprocessed by using a normalization method, and the result of the preprocessing is shown in formula (1):

Y＝Y₀/c (1)。

preferably, in S4, the processed signal X₀Is shown in formula (2):

X₀＝cX (2)。

preferably, in S3, the interval [0, λ ] is searched by using a bisection method_max]The penalty parameter lambda within.

Preferably, in S3, given the preprocessed signal matrix Y and the penalty parameter λ, the minimization optimization problem as shown in equation (3) is solved, i.e. the signal matrix X is solved:

where Y is the N × M pre-processed signal matrix, X is the processed N × M sized signal matrix, N is the number of sampling points per signal, M is the number of signals, λ is a penalty parameter for each breakpoint, and p (X) is the number of breakpoints in X.

Preferably, the specific operation steps of S3 are as follows:

when the number of break points k is selected as an input parameter:

1) calculating the maximum possible lambda according to the preprocessed signal matrix Y_maxAnd let the minimum possible lambda_minCalculating penalty parameter as 0

2) Under the condition of giving a preprocessed signal matrix Y and a punishment parameter lambda, solving the minimization optimization problem shown in a formula (3), namely solving a signal matrix X and the number P (X) of breakpoints;

3) if the number k of break points is less than the number P (X) of break points, let λ_minλ and repeating the above steps;

if the number k of break points is greater than the number P (X) of break points, let λ_maxλ and repeating the above steps;

if the number k of the break points is equal to the number P (X) of the break points, outputting a signal matrix X;

when the penalty parameter λ is selected as an input parameter: the minimization optimization problem, i.e. the signal matrix X, is solved as shown in equation (3).

The invention also discloses a system of the multi-signal-oriented rapid breakpoint detection method, which comprises the following steps:

the signal preprocessing module is used for preprocessing the acquired original signal;

the penalty parameter estimation module is used for acquiring a processed signal matrix under the condition of giving a breaking point number or a penalty parameter;

and the signal processing module is used for determining the breakpoint position of the signal matrix and the processed signal and realizing the breakpoint quick detection of the segmented signal.

A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the multi-signal-oriented rapid breakpoint detection method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the multi-signal oriented fast breakpoint detection method.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a multi-signal-oriented processing method, which is characterized in that acquired signals are preprocessed, and the number of break points and punishment parameters are determined according to actual conditions, so that the final position of the break points and the processed signals can be determined by combining preprocessing results, the number of break points and the punishment parameters. The method can quickly and accurately detect the shared breakpoint at the same position in the signal, thereby providing reliable initial and termination position information for further signal segmentation, fitting and parameter estimation.

Further, a dichotomy is adopted to search the penalty parameters, firstly, the penalty parameters are compared with the elements in the middle of the sequence, and if the penalty parameters are larger than the elements, the penalty parameters are continuously searched in the latter half of the current sequence; if the number of the elements is smaller than the element, the searching is continued in the first half part of the current sequence until the same element is found or the searched sequence range is empty, and the dichotomy has the advantages of less comparison times, high searching speed and good average performance.

Furthermore, the convergence rate can be improved by preprocessing the original signal by using a normalization method, and meanwhile, in order to ensure the accuracy of the preprocessed signal, the original signal is subjected to normalization processing.

According to the system of the multi-signal-oriented rapid breakpoint detection method, breakpoint detection is decomposed into different and mutually independent modules according to the relevance of contents, multi-signal breakpoint detection is achieved through the modularization idea, when a problem occurs in which module can be managed independently, and the modules are mutually independent and do not influence each other.

Drawings

FIG. 1 is a schematic diagram of a signal breakpoint detection including two high and low steps;

FIG. 2 is a diagram illustrating detection of common breakpoints in multiple signals;

FIG. 3 is a flowchart of a fast breakpoint detection method according to the present invention;

FIG. 4 is a graph of the present invention applied to family consensus copy number variation detection ((a) children inherit a variation common to parents, (b) children inherit a variation from parents);

fig. 5 shows a gait analysis of the wearable device according to the present invention ((a) gait detected by using three acceleration sensors in x, y and z directions, (b) gait detected by using only the acceleration sensor in z direction);

FIG. 6 shows the improvement of the computation time of the present invention over the conventional method ((a) the improvement of the computation time with the signal length N and (b) the improvement of the computation time with the signal dimension M).

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the accompanying drawings:

the invention provides a multi-signal-oriented rapid breakpoint detection method, a specific signal processing flow is shown in fig. 3, and the rapid breakpoint detection method comprises the following steps:

s1, carrying out normalization preprocessing on the original signal to obtain a preprocessed signal matrix Y with the size of NxM;

S1.2, calculating an original matrix Y₀The maximum absolute value c of;

s1.3, preprocessing the original signal by using the maximum absolute value c.

s3, if the selected input parameter is lambda, directly solving a minimization optimization problem to obtain a processed signal matrix X with the size of N multiplied by M; if the input parameter is selected to be k, the maximum possible λ is first calculated from the preprocessed signal matrix Y_maxThen in the interval [0, λ_max]Estimating a punishment parameter lambda by internal search, solving a minimization optimization problem under the given lambda to enable the number of broken points to be k, and obtaining a processed signal matrix X with the size of NxM;

search the interval [0, λ ] by dichotomy_max]The penalty parameter lambda in the system is set to ensure that the detection breakpoint P (X) is k;

s4, calculating a signal matrix X according to the obtained breakpoint position; the signal matrix X is multiplied by the maximum absolute value c to be subjected to post-normalization processing to obtain a processed signal X₀Therefore, the breakpoint of the segmented signal can be rapidly detected.

Wherein, the pretreatment result is shown as formula (1):

Y＝Y₀/c (1)

processed signal X₀Is shown in formula (2):

X₀＝cX (2)

given the preprocessed signal matrix Y and the penalty parameter λ, the minimization optimization problem as shown in equation (3) is solved, i.e. the signal matrix X is solved:

wherein, Y is a to-be-processed signal matrix of nxm, X is a processed signal matrix of the same size, N is the number of sampling points (signal length) of each signal, M is the number of signals, λ is a penalty parameter for each breakpoint, and p (X) is the number of breakpoints in X. When X is obtained, the position and the number of the breakpoints are easy to know.

When the number of break points k is selected as an input parameter:

The core steps of the above technical scheme are S3, and the specific operation steps of S3 are as follows:

a) inputting: preprocessing a signal matrix Y and a punishment parameter lambda;

b) initialization: the target function storage vector F is an N +1 long all-zero vector, the breakpoint storage array bp is a cell array, the first cell is a null vector, the effective index list R is 1, the segmentation energy E is 0, and the average value Z is equal to the first column of the preprocessing signal matrix Y;

c) entering a loop of i-1 until step x);

d) adding the R-th element of the effective index list of the target function storage vector F and the segmentation energy E, and storing the R-th element into a temporary vector v;

e) searching the minimum value and position of the temporary vector v, and storing the minimum value and position of the temporary vector v into a and i respectively₁；

f) Calculating a + lambda, and storing the a + lambda into the (i +1) th storage unit of the target function storage vector F;

g) reading ith of effective index list R₁An element stored in the maximum absolute value c;

h) reading the c-th cell of the breakpoint storage array bp, splicing the c-th cell with the maximum absolute value c, and storing the c-th cell into the i + 1-th cell of the breakpoint storage array bp;

i) finding the position of the (i +1) th element in the temporary vector v, which is smaller than the storage vector F of the target function, and storing the position into the i₂；

j) If i is less than the number of sampling points N, executing to step x) one by one, otherwise, directly jumping to step x);

k) storing the (i +1) th column of the preprocessed signal matrix Y into Y;

l) keeping only the ith index in the effective index list R₂Individual elements, delete + remaining elements;

m) calculating i minus the effective index list R and adding 1, and storing a length vector l;

n) copying the length vector L for M times in a row mode, and storing the length vector L into a matrix L;

o) copying y n times in a row mode (n is the length of the length vector l), and storing the y into a matrix B;

p) reading i of the mean value Z₂Rows, stored in matrix T;

q) calculating a matrix B minus a matrix T, calculating the sum of squares of all elements of each row, and storing the sum into a vector e;

r) calculating a vector e by point multiplying a length vector l, then dividing the vector by point by l +1, and storing the vector e;

s) calculating the ith of vector E and segment energy E₂The sum of the position elements is stored in the segmentation energy E;

t) adding 0 at the end of the segment energy E;

u) calculating a matrix T point multiplication matrix L, adding a matrix B, then point dividing (L +1), and storing into an average value Z;

v) column at the end of the mean value Z is added y;

w) adding i +1 at the end of the effective index list R;

x) adding 1 to i, and returning to the step c);

y) outputting: the output breakpoint position is the 2 nd to the last element in the last cell of the breakpoint storage array bp; and (4) processing the output processed data matrix X by using columns as units, and taking the average value of the column in the data matrix Y between two adjacent break points as the signal value of the section of the column X for each column of signals.

As shown in fig. 4, the beneficial effects of the population consensus copy number variation detection of the present invention applied to high throughput sequencing technologies are demonstrated. The sequencing data originated from a three-family (father, mother, son, M ═ 3), and the pre-processing of the data included (i) alignment of HTS off-line data file fastq pairs to reference genome hg19 using mapping tool bowtie; (ii) calculating a Read Depth (RD) signal, which is the sequencing coverage depth of each base site within a fixed width window on the genome; (iii) GC and mappability correction of RD signals. FIG. 4(a) shows the copy number variation detected in the interval 32.8-33.4 Mb for chromosome 22 using the present invention. It can be seen that the offspring inherits the variation common to both parents. Furthermore, FIG. 4(b) shows the copy number variation detected in the region of chromosome 22 from 39.2 to 39.5Mb using the present invention. It can be seen that here the offspring inherits the variation from the father, where the mother has no variation. The method has important application value in detecting the breakpoint by using a plurality of signals.

As shown in fig. 5, the health data analysis applied to the wearable device of the present invention is illustrated. This data is derived from a 12-second running exercise performed on the subject, and the sensor is attached to the right ankle joint of the subject and detects acceleration in three directions x, y, and z (M is 3)). Fig. 5(a) shows the effect of detecting three acceleration signals using the present invention, and it can be seen that the detection is good step by step. In contrast, fig. 5(b) shows the detection effect using only one acceleration signal (z direction), and it can be seen that gaits around 8 seconds are not well distinguished. The detection precision of the breakpoint can be improved by using a plurality of signals.

As shown in fig. 6, the improvement of the calculation time of the present invention compared to the conventional method is demonstrated. Here, simulation data is used. Fig. 6(a) shows the relationship between the calculation time and the signal length N when the number of signals M is 10. It can be seen that as the signal length N increases, the computation time increases and the present invention uses only about one percent of the computation time of the conventional method. For very long signals with N of 100000 points, the invention only uses about 10 seconds, while the conventional method uses about 1000 seconds. Fig. 6(b) shows the relationship between the calculation time and the signal dimension M when the signal length N is 3000. It can be seen that the computation time required by the present invention remains almost constant as the signal dimension M increases, whereas the computation time of the conventional method increases linearly. For many signals with M1000, the calculation time of the present invention is less than 1 second, while the conventional method takes about 400 seconds. The invention can greatly reduce the calculation time, and is particularly suitable for a large number of signals.

The invention provides a rapid signal processing method. The method can quickly and accurately detect the common breakpoint position in the multi-dimensional signals, and further provides reliable starting and stopping position information for the multi-dimensional signals through segmentation, fitting and parameter estimation. The method has the wide practical characteristics, and can be applied to the fields of biology, medicine, engineering and the like, such as population copy number variation detection based on a high-throughput sequencing technology, motion state detection based on wearable equipment and the like.

Abbreviations and key terms appearing and used in the present invention are defined as follows:

CNV Copy Number Variation

HTS High-Throughput Sequencing

DP Dynamic Programming

RD Read Depth

CBS Circular Binary Segmentation cyclic Binary Segmentation

OP Optimal partioning

Precise Linear Time for PELT Pruned Exact Linear Time pruning

TPR True Positive Rate

False Positive Rate of FPR False Positive Rate

GC Guanine-cysteine content Guanine-cytosine content

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A multi-signal-oriented rapid breakpoint detection method is characterized by comprising the following steps:

2. The multi-signal-oriented fast breakpoint detection method according to claim 1, wherein S1 specifically includes the following steps:

S1.2, calculating an original matrix Y₀The maximum absolute value c of;

3. The multi-signal-oriented fast breakpoint detection method according to claim 2, wherein in S1.3, the original signal is preprocessed by a normalization method, and the preprocessing result is shown in formula (1):

Y＝Y₀/c (1)。

4. the multi-signal-oriented fast breakpoint detection method according to claim 2, wherein in S4, the processed signal X₀Is shown in formula (2):

X₀＝cX (2)。

5. the method as claimed in claim 1, wherein in S3, a binary search interval [0, λ ] is used_max]The penalty parameter lambda within.

6. The multi-signal-oriented fast breakpoint detection method according to claim 1, wherein in S3, given the preprocessed signal matrix Y and the penalty parameter λ, a minimization optimization problem is solved as shown in formula (3), that is, a signal matrix X is solved:

7. The multi-signal-oriented fast breakpoint detection method according to claim 6, wherein the specific operation steps of S3 are as follows:

when the number of break points k is selected as an input parameter:

8. The system for realizing the multi-signal-oriented rapid breakpoint detection method according to any one of claims 1 to 7 comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the multi-signal oriented fast breakpoint detection method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the steps of the multi-signal-oriented fast breakpoint detection method according to any one of claims 1 to 7.