CN113990393A

CN113990393A - Data processing method and device for gene detection and electronic equipment

Info

Publication number: CN113990393A
Application number: CN202111616503.8A
Authority: CN
Inventors: 姬晓勇; 高司航; 单光宇; 伍启熹; 王建伟
Original assignee: Beijing Youxun Medical Devices Co ltd
Current assignee: Beijing Youxun Medical Devices Co ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-01-28
Anticipated expiration: 2041-12-28
Also published as: CN113990393B

Abstract

The invention provides a data processing method, a data processing device and electronic equipment for gene detection, wherein the method comprises the following steps: acquiring a sequence to be detected; inputting the sequence to be detected into a gene detection model to obtain a gene detection result output by the gene detection model; wherein the sequence to be detected is obtained by the following method: obtaining an original sequence; wherein the original sequence comprises at least one read sequence; determining sequence numbers corresponding to the reading segment sequences; preprocessing the corresponding read segment sequence according to the sequence number; and determining the sequence to be detected according to each read sequence after pretreatment. According to the data processing method, the data processing device and the electronic equipment for gene detection, the corresponding read sequence is preprocessed according to the sequence number, a plurality of software are not needed, the preprocessing steps are simplified, and the efficiency of the whole gene detection process can be improved.

Description

Data processing method and device for gene detection and electronic equipment

Technical Field

The invention relates to the technical field of data processing, in particular to a data processing method and device for gene detection and electronic equipment.

Background

Before data analysis is performed on the original FASTQ (sequence format) file of the second generation sequencing, the data needs to be preprocessed to ensure that the input data of the downstream analysis is clean and reliable. Especially in the field of liquid biopsy, it is necessary to detect low frequency genetic variation using data, and data preprocessing is particularly important. Conventional data preprocessing typically involves excision of linker sequences, excision of base sequences of low sequencing quality and measured as N, excision of repeat sequences, and the like.

In the prior art, data preprocessing is usually split into several steps: firstly, Cutadapt (joint removal software) is used for carrying out joint excision, then Trimmotic (data quality control software) software sliding window excision is used for removing bases containing N bases and poor sequencing quality in sequencing reads, and Picard (command line tool) software is used for removing repeated sequences according to reading comparison coordinates.

The above process involves multiple steps, and is inconvenient to operate and low in efficiency of the gene detection process.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a data processing method and apparatus for gene detection, and an electronic device.

The present invention provides a data processing method for gene detection, comprising:

acquiring a sequence to be detected;

inputting the sequence to be detected into a gene detection model to obtain a gene detection result output by the gene detection model;

wherein the sequence to be detected is obtained by the following method:

obtaining an original sequence; wherein the original sequence comprises at least one read sequence;

determining sequence numbers corresponding to the reading segment sequences;

preprocessing the corresponding read segment sequence according to the sequence number;

and determining the sequence to be detected according to each read sequence after pretreatment.

According to the data processing method for gene detection provided by the invention, the determining of the sequence number corresponding to the read sequence comprises the following steps:

cutting the read segment sequence from a preset position according to a first preset step length and the length of a preset connector sequence to obtain a plurality of read segment subsequences;

and determining the sequence number corresponding to each read segment subsequence.

According to the data processing method for gene detection provided by the invention, the determining of the sequence number corresponding to each read subsequence comprises the following steps:

converting the read subsequences into target digital sequences according to a preset number of bases; wherein the length of the target number sequence is a preset number times of the length of the read subsequences;

converting the target digit sequence into a joint multiplication matrix;

the joint multiplication matrix is multiplied by a preset weight matrix to obtain a target matrix;

determining a trace of the target matrix as the sequence number.

According to the data processing method for gene detection provided by the invention, the step of converting the read subsequence into the target digital sequence according to the preset number of bases comprises the following steps:

for each target base, replacing a base which is the same as the target base in the read subsequences with a first target number, and replacing a base which is not the same as the target base in the read subsequences with a second target number to obtain the target digital subsequences with the preset number;

and connecting the target digital subsequences of the preset number according to the sequence of the preset number of bases to obtain the target digital sequence.

According to the data processing method for gene detection provided by the invention, the converting the target digital sequence into the joint multiplication matrix comprises the following steps:

replacing each first target number in the target number sequence with a first second-order matrix, and replacing each second target number in the target number sequence with a second-order matrix;

and multiplying all the first second-order matrixes and all the second-order matrixes according to the sequence of the numbers in the target number sequence to obtain the joint multiplication matrix.

According to the data processing method for gene detection provided by the invention, the preprocessing of the corresponding read sequence according to the sequence number comprises the following steps:

determining a linker sequence in the read segment sequence according to the sequence number corresponding to each read segment subsequence;

cleaving the linker sequence in the read sequence;

removing bases meeting preset conditions in the read sequence after the linker sequence is cut to obtain a target residual sequence;

determining the sequence to be detected according to each read sequence after pretreatment comprises:

and determining the sequence to be detected according to each target residual sequence.

According to the data processing method for gene detection provided by the invention, the determining of the linker sequence in the read sequence according to the sequence number corresponding to each read subsequence comprises:

for each read segment subsequence, determining the read segment subsequence corresponding to the sequence number as the linker sequence when determining that the sequence number corresponding to the read segment subsequence is contained in a preset array;

the preset array is a combination of sequence numbers corresponding to each preset connector subsequence; and the preset joint subsequence is obtained by cutting the preset joint sequence according to a second preset step length.

According to the data processing method for gene detection provided by the invention, the removing of the bases meeting the preset conditions in the read sequence after the cutting of the adaptor sequence comprises the following steps:

respectively intercepting bases at two ends of the read sequence after the joint sequence is cut and the corresponding base quality according to the length of a preset window;

determining a first proportion of N-containing bases in the intercepted current window to all bases in the current window, and determining a second proportion of bases in the current window, wherein the base quality of which is less than that of a first preset base, to all bases in the current window;

and when the first proportion and/or the second proportion are/is determined to be larger than or equal to a preset threshold value, cutting off the current window, and continuously returning to execute the step of respectively cutting off the bases at two ends of the read sequence after the joint sequence is cut off and the corresponding base quality according to the length of the preset window until the read sequence after the joint sequence is cut off is stopped when the first proportion and/or the second proportion are determined to be smaller than the preset threshold value.

According to the data processing method for gene detection provided by the invention, the determining the sequence to be detected according to each target residual sequence comprises the following steps:

for each target residual sequence, when the length of the target residual sequence is determined to be greater than the preset length, determining a sequence number corresponding to the target residual sequence;

determining the target residual sequence as a sequence to be detected under the condition that the sequence number and the length corresponding to the target residual sequence are not contained in a preset dictionary; the preset dictionary is used for storing the sequence number and the length corresponding to the residual sequence of each read sequence in the original sequence;

the method further comprises the following steps:

and under the condition that the sequence number and the length corresponding to the target residual sequence are both contained in the preset dictionary, determining the target residual sequence as a repeated sequence to remove.

According to the data processing method for gene detection provided by the present invention, further comprising:

and counting a first quality control index of the original sequence and a second quality control index of the sequence to be detected.

According to the data processing method for gene detection provided by the invention, the statistics of the first quality control index of the original sequence comprises the following steps:

reading the target base quality corresponding to all bases in each original read sequence in the original sequence;

for each original read sequence, converting the coding symbols of the quality of each target base into corresponding numbers and then sequencing;

counting at least one of the following total numbers according to the sorting result: a first total number of bases less than a first predetermined base mass, a second total number of bases greater than a second predetermined base mass, a third total number of bases greater than a third predetermined base mass, a fourth total number of all bases in the original read sequence;

wherein the third preset base quality is greater than the second preset base quality, and the second preset base quality is greater than the first preset base quality.

The present invention also provides a data processing apparatus for gene detection, comprising:

an acquisition unit, configured to acquire a sequence to be detected;

the detection unit is used for inputting the sequence to be detected into a gene detection model and obtaining a gene detection result output by the gene detection model;

the acquisition unit specifically acquires the sequence to be detected by the following method:

determining sequence numbers corresponding to the reading segment sequences;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the gene detection data processing method.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the data processing method for gene detection as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the data processing method for gene detection as described in any of the above.

According to the data processing method, the data processing device and the electronic equipment for gene detection, provided by the invention, the sequence number corresponding to each read segment sequence in the original sequence is determined, and the corresponding read segment sequence is preprocessed according to the sequence number without using a plurality of software, so that the preprocessing steps are simplified; and determining each read sequence after pretreatment as a sequence to be detected, inputting the sequence to be detected into a gene detection model for gene detection, and improving the efficiency of the whole gene detection process.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a data processing method for gene detection according to the present invention;

FIG. 2 is a second schematic flow chart of the data processing method for gene detection according to the present invention;

FIG. 3 is a third schematic flow chart of the data processing method for gene detection according to the present invention;

FIG. 4 is a fourth schematic flow chart of the data processing method for gene detection according to the present invention;

FIG. 5 is a fifth flowchart of the gene detection data processing method according to the present invention;

FIG. 6 is a schematic view of the structure of a data processing apparatus for gene detection according to the present invention;

fig. 7 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The data processing method for gene detection of the present invention is described below with reference to FIGS. 1 to 5.

FIG. 1 is a schematic flow chart of a data processing method for gene detection according to the present invention, which comprises, as shown in FIG. 1:

and step 110, acquiring a sequence to be detected.

Wherein the sequence to be detected is obtained by the following method:

obtaining an original sequence; wherein the original sequence comprises at least one read sequence; determining sequence numbers corresponding to the reading segment sequences; preprocessing the corresponding read segment sequence according to the sequence number; and determining the sequence to be detected according to each read sequence after pretreatment.

And 120, inputting the sequence to be detected into a gene detection model to obtain a gene detection result output by the gene detection model.

The gene detection model is a model which is arranged in the system and is used for detecting genes.

The data processing method for gene detection provided by the invention determines the sequence number corresponding to each read segment sequence in the original sequence, and preprocesses the corresponding read segment sequence according to the sequence number without using a plurality of software, thereby simplifying the preprocessing steps; and determining each read sequence after pretreatment as a sequence to be detected, inputting the sequence to be detected into a gene detection model for gene detection, and improving the efficiency of the whole gene detection process.

FIG. 2 is a second schematic flow chart of the data processing method for gene detection provided by the present invention, and as shown in FIG. 2, the steps of obtaining the sequence to be detected include the following steps:

and step 210, acquiring an original sequence.

Wherein the original sequence comprises at least one read sequence.

For example, the original sequence may be second-generation sequencing original offline data, the original sequence includes a plurality of read sequences that need to be preprocessed, and the read sequences may be single-ended Sequencing (SE) data or double-ended sequencing (PE) data.

Step 220, determining the sequence number corresponding to the read segment sequence.

Optionally, cutting the read segment sequence from a preset position according to a first preset step length and the length of a preset adaptor sequence to obtain a plurality of read segment subsequences; and determining the sequence number corresponding to each read segment subsequence.

Specifically, fig. 3 is a third schematic flow chart of the data processing method for gene detection provided by the present invention, and as shown in fig. 3, determining the sequence number corresponding to each read subsequence can be achieved by the following steps:

step 221, converting the read subsequences into target digital sequences according to a preset number of bases.

Wherein the length of the target number sequence is a preset number times the length of the read subsequences.

Optionally, for each target base, replacing bases in the read subsequences that are the same as the target base with a first target number, and replacing bases in the read subsequences that are not the same as the target base with a second target number, so as to obtain the preset number of target number subsequences; and connecting the target digital subsequences of the preset number according to the sequence of the preset number of bases to obtain the target digital sequence.

Preferably, the first target number is a first binary number, the second target number is a second binary number, and correspondingly, the target number sequence is a binary sequence, and the target number subsequence is a binary subsequence. The following description will be made by taking the first target number as the first binary number, the second target number as the second binary number, the target number sequence as the binary sequence, and the target number subsequence as the binary subsequence.

For example, starting from a preset position, the read sequence is cut according to a first preset step length and the length of the preset adaptor sequence to obtain a plurality of read subsequences, and the specific cutting method may be: and sequentially intercepting the read sequence from a preset position of the read sequence according to a first preset step length until the length of the preset adaptor sequence is reached, assuming that the first preset step length is 1 base, the length of the preset adaptor sequence is 4, and the read sequence is ACAADD, wherein four read subsequences obtained by cutting are A, AC, ACA and ACAA respectively.

For example, the preset number of bases are four bases, i.e., a (adenine), T (thymine), C (cytosine), and G (guanine), preferably, the first binary number is 1, the second binary number is 0, for each read subsequence, A, T, C, G four bases are respectively used to construct 4 binary subsequences with the length equal to the length of the read subsequence, first, the a base is used to construct a first binary subsequence, i.e., the base in the read subsequence that is the same as the a base is replaced by 1, the base in the read subsequence that is different from the a base is replaced by 0, and then, according to the sequence of each base in the read subsequence, the replaced 0 and 1 are sequenced to obtain a first binary subsequence; for example, the read subsequence is A, then the first binary subsequence is 1.

Then, a second binary subsequence is constructed by utilizing the T base, namely, the base which is the same as the T base in the reading subsequence is replaced by 1, the base which is not the same as the T base in the reading subsequence is replaced by 0, and the 0 and 1 obtained by replacement are sequenced according to the sequence of each base in the reading subsequence to obtain a second binary subsequence; for example, the read subsequence is A, and the second binary subsequence is 0.

Then, constructing a third binary subsequence by using the C base, namely replacing the base which is the same as the C base in the reading subsequence by 1, replacing the base which is not the same as the C base in the reading subsequence by 0, and sequencing the 0 and 1 obtained by replacement according to the sequence of each base in the reading subsequence to obtain the third binary subsequence; for example, the read subsequence is A, and the third binary subsequence is 0.

Finally, a fourth binary subsequence is constructed by utilizing G bases, namely bases which are the same as the G bases in the reading subsequence are replaced by 1, bases which are not the same as the G bases in the reading subsequence are replaced by 0, and the 0 and 1 obtained by replacement are sequenced according to the sequence of each base in the reading subsequence to obtain the fourth binary subsequence; for example, the read subsequence is A, then the fourth binary subsequence is 0.

And finally, connecting the obtained first binary subsequence, second binary subsequence, third binary subsequence and fourth binary subsequence according to the sequence of A, T, C, G four bases to obtain the binary sequence corresponding to the read subsequence. For example, the binary sequence corresponding to the read subsequence a is 1000.

It should be noted that the first binary number is 0, and the second binary number is 1, that is, the base identical to the target base is replaced by 0, and the base different from the target base is replaced by 1, which is not limited herein.

Step 222, converting the target number sequence into a joint multiplication matrix.

Optionally, each first target number in the target number sequence is replaced by a first second-order matrix, and each second target number in the target number sequence is replaced by a second-order matrix; and multiplying all the first second-order matrixes and all the second-order matrixes according to the sequence of the numbers in the target number sequence to obtain the joint multiplication matrix.

For example, the first and second order matrices may be

The second order matrix may be

Using each first binary number in the binary sequence

Replacing, using each second binary number in the binary sequence

Replacing, and finally multiplying each second-order matrix obtained by replacing according to the sequence of binary numbers in the binary sequence to obtain a joint multiplication matrix; for example, the read subsequence A corresponds to a joint multiplication matrix of

。

It should be noted that the first second order matrix is selected as

The second order matrix is selected as

The memory space occupied by data processing is reduced; if the memory space is not considered, the first second-order matrix and the second-order matrix may also be other matrices, as long as the selected second-order matrix can ensure that each row of elements of the finally obtained combined multiplication matrix conforms to the fibonacci number sequence rule, which is not limited.

And 223, multiplying the joint multiplication matrix by a preset weight matrix to obtain a target matrix.

For example, the predetermined weight matrix may be

Multiplying the joint multiplication matrix to the left

Obtaining a target matrix; for example, joint multiplication corresponding to the read subsequence AThe matrix is

Then, then

。

It should be noted that the preset weight matrix is set according to the needs and experience, and is not limited thereto.

Step 224, determining the trace of the target matrix as the sequence number.

Exemplary, target matrix for read subsequence A

Is equal to

And is equal to about 12.85, then 12.85 is determined as the sequence number corresponding to the read segment subsequence a.

And step 230, preprocessing the corresponding read segment sequence according to the sequence number.

Optionally, fig. 4 is a fourth schematic flow chart of the data processing method for gene detection provided by the present invention, and as shown in fig. 4, the preprocessing of the corresponding read sequence according to the sequence number can be realized by the following steps:

and 231, determining a linker sequence in the read segment sequence according to the sequence number corresponding to each read segment subsequence.

Specifically, for each read subsequence, when it is determined that the sequence number corresponding to the read subsequence is included in a preset array, the read subsequence corresponding to the sequence number is determined to be the linker sequence.

For example, the predetermined linker sequence is a known linker sequence, the predetermined linker sequence needs to be cut in advance according to a second predetermined step length, each cut predetermined linker subsequence is converted into a corresponding sequence number according to the methods of the above steps 221 to 224, and then the sequence number corresponding to each predetermined linker subsequence is stored in the predetermined array.

For example, for each read segment subsequence, it is required to determine whether a sequence number corresponding to the read segment subsequence is included in a preset array, and if the sequence number corresponding to the read segment subsequence is included in the preset array, it indicates that the read segment subsequence is a linker sequence and needs to be cut; and if the sequence number corresponding to the read segment subsequence is not contained in the preset array, the read segment subsequence is not a linker sequence and does not need to be cut.

If the read segment sequence is single-ended sequencing data, only detecting the single-ended linker sequence; if the read sequence is double-ended sequencing data, the 5 'adaptor sequence needs to be subjected to adaptor detection from the 5' end, and the 3 'adaptor sequence needs to be subjected to adaptor detection from the 3' end.

Step 232, cutting the adaptor sequence in the read sequence.

Illustratively, in determining the linker sequence in the read sequence, the determined linker sequence is cleaved from the read sequence.

And 233, removing bases meeting preset conditions in the read sequence after the linker sequence is cut, so as to obtain a target residual sequence.

Specifically, the method for removing bases meeting preset conditions in the read sequence after the linker sequence is cut by adopting a sliding window method specifically comprises the following steps: respectively intercepting bases at two ends of the read sequence after the joint sequence is cut and the corresponding base quality according to the length of a preset window; determining a first proportion of N-containing bases in the intercepted current window to all bases in the current window, and determining a second proportion of bases in the current window, wherein the base quality of which is less than that of a first preset base, to all bases in the current window; and when the first proportion and/or the second proportion are/is determined to be larger than or equal to a preset threshold value, cutting off the current window, and continuously returning to execute the step of respectively cutting off the bases at two ends of the read sequence after the joint sequence is cut off and the corresponding base quality according to the length of the preset window until the read sequence after the joint sequence is cut off is stopped when the first proportion and/or the second proportion are determined to be smaller than the preset threshold value.

Wherein, the N-containing base is a base detected by a sequencer, is juxtaposed with A, T, C, G four bases, and is a base to be removed, and whether to remove the N-containing base depends on the proportion of the N-containing base in the corresponding sequence.

For example, assuming that the preset window length is 2, the first preset mass is 20, and the preset threshold is 0.5, for the read sequence AAATACCTTCCAGCAC with the adaptor sequence removed, the base truncation at both ends of AAATACCTTCCAGCAC means to truncate the base with the preset window length and the corresponding base mass from the left end and the right end, respectively. Sliding the window 1 from the left with the sequence "AA", the base quality D of the first A is converted from ASCII code to number 35, the base quality C of the second A is converted from ASCII code to number 34, both 35 and 34 are greater than 20 and contain no N base, therefore, the first ratio and the second ratio are both 0 and are not greater than the preset threshold value of 0.5, so the sliding from the left window is stopped; from the 1 st window of the right slide, the sequence is "AC", the base mass D of A is converted from ASCII code to number 35, the base mass H of C is converted from ASCII code to number 39, both 35 and 39 are greater than 20, and N bases are not contained, so that the first ratio and the second ratio are both 0, and are not greater than the preset threshold value of 0.5, so that the right slide is stopped, and the finally obtained residual sequence is AAATACCTTCCAGCAC, thereby realizing the removal of bases with low-mass bases and bases with N bases whose ratio exceeds a certain value.

And 240, determining the sequence to be detected according to each read sequence after pretreatment.

Optionally, the sequence to be detected is determined according to each of the target remaining sequences.

For example, for each target remaining sequence, when it is determined that the length of the target remaining sequence is greater than a preset length, determining a sequence number corresponding to the target remaining sequence; determining the target residual sequence as a sequence to be detected under the condition that the sequence number and the length corresponding to the target residual sequence are not contained in a preset dictionary; and under the condition that the sequence number and the length corresponding to the target residual sequence are both contained in the preset dictionary, determining the target residual sequence as a repeated sequence to remove.

The preset dictionary is used for storing sequence numbers and lengths corresponding to the remaining sequences of each read sequence in the original sequence, namely the preset dictionary is used for storing the length and the sequence numbers of the remaining sequences after removing the adaptor sequence of each read sequence in the original sequence and removing the base meeting the preset condition, and the preset length can be 20.

For example, when it is determined that both the number and the length of the sequence corresponding to the target remaining sequence are contained in the preset dictionary, it is indicated that the target remaining sequence is a repeated sequence, and the target remaining sequence is not continuously output to the processed file; and when determining that the sequence number and the length corresponding to the target residual sequence are not contained in the preset dictionary, indicating that the target residual sequence is not a repetitive sequence and serving as a sequence to be detected, realizing the detection of the repetitive sequence in each read sequence and improving the accuracy of the gene detection result output by the gene detection model.

Further, fig. 5 is a fifth schematic flow chart of the data processing method for gene detection provided by the present invention, and as shown in fig. 5, the method further includes the following steps:

and 250, counting the first quality control index of the sequence to be detected and the second quality control index of the original sequence.

Optionally, the second quality control index of the statistical original sequence may be implemented by:

reading the target base quality corresponding to all bases in each original read sequence in the original sequence; for each original read sequence, converting the coding symbols of the quality of each target base into corresponding numbers and then sequencing; counting at least one of the following total numbers according to the sorting result: a first total number of bases less than a first predetermined base mass, a second total number of bases greater than a second predetermined base mass, a third total number of bases greater than a third predetermined base mass, a fourth total number of all bases in the original read sequence. Namely, the number of bases in each original read sequence, which are smaller than the first preset quality base, larger than the second preset quality base and larger than the third preset quality base, is calculated by adopting a dichotomy method.

Wherein the third preset base quality is greater than the second preset base quality, the second preset base quality is greater than the first preset base quality, the code symbols of the base qualities comprise ASCII codes, the first total number is represented by D1, the second total number is represented by D2, the third total number is represented by D3, and the fourth total number is represented by D4.

Further, when D1, D2, D3, and D4 corresponding to each original read sequence are obtained through calculation, D1 corresponding to each original read sequence may be added, and is recorded as S1; adding D2 corresponding to each original read sequence, and recording as S2; adding D3 corresponding to each original read sequence, and recording as S3; adding D4 corresponding to each original read sequence, and recording as S4; thus, the number of bases in the whole original sequence, which is smaller than the first preset quality base, larger than the second preset quality base, larger than the third preset quality base and all the bases in the original sequence are counted; determining the ratio of S1 to S4 as the proportion of bases in the original sequence which are smaller than the first preset base mass to all bases in the original sequence; determining the ratio of S2 to S4 as the proportion of bases in the original sequence which are larger than the second preset base mass to all bases in the original sequence; the ratio of S3 to S4 was determined as the ratio of bases in the original sequence that were greater than the third predetermined base mass to all bases in the original sequence.

For example, assuming that the original read sequence is CCAAAATTACCTTCCAGCAC, the first predetermined base quality is 15, the second predetermined base quality is 20, the third predetermined base quality is 30, and the base quality corresponds to CIC8DDCCIFCCIID2ID8H, the base quality is converted into a number to obtain 34,40, 34, 23, 35,35,34, 34,40, 37,34,34,40, 40,35,17,40, 35, 23, 39; the numbers are sequenced, and the statistics result that the number of bases which is more than 30 is 17, the number of bases which is more than 20 is 19, the number of bases which is less than 15 is 0, all the bases of the whole original read sequence is 20, and if the original sequence only comprises one original read sequence, the proportion of the bases with the base quality of less than 15 to all the bases of the original sequence is 0, the proportion of the bases with the base quality of more than 20 to all the bases of the original sequence is 19/20, and the proportion of the bases with the base quality of more than 30 to all the bases of the original sequence is 17/20, so that the overall quality of the original sequence can be evaluated according to the quality control indexes.

Optionally, counting the second quality control index of the sequence to be detected can be realized by:

from the above, the sequence to be detected is obtained by combining a plurality of target residual sequences, so the specific statistical method is as follows: reading the base quality corresponding to all bases in each target residual sequence in the sequence to be detected; for each target residual sequence, converting the coding symbols of each base quality into corresponding numbers and then sequencing; counting at least one of the following total numbers according to the sorting result: the total number of bases less than the first predetermined base mass D5, the total number of bases greater than the second predetermined base mass D6, the total number of bases greater than the third predetermined base mass D7, the total number of all bases in the target residue sequence D8.

Further, when D5, D6, D7, and D8 corresponding to each target residual sequence are obtained through calculation, D5 corresponding to each target residual sequence may be added, and is denoted as S5; adding D6 corresponding to each target residual sequence, and recording as S6; adding D7 corresponding to each target residual sequence, and recording as S7; adding D8 corresponding to each target residual sequence, and recording as S8; thus, the number of bases in the whole sequence to be detected, which is smaller than the first preset quality base, larger than the second preset quality base, larger than the third preset quality base and all the bases in the sequence to be detected are counted; determining the ratio of S5 to S8 as the proportion of the bases with the mass less than that of the first preset base in the sequence to be detected to all the bases in the sequence to be detected; determining the ratio of S6 to S8 as the proportion of the base with the mass larger than that of the second preset base in the sequence to be detected to all the bases in the sequence to be detected; and determining the ratio of S7 to S8 as the proportion of the bases with mass larger than the third preset base in the sequence to be detected to all the bases in the sequence to be detected.

For example, assuming that the target remaining sequence is AATTACCTTCCAGC, the first predetermined base mass is 15, the second predetermined base mass is 20, the third predetermined base mass is 30, and the base masses correspond to DDCCIFCCIID2ID, the base masses are converted into numbers to obtain 35,35,34, 34,40, 40,37,34,34,40, 40,35,17,40, 35; sequencing the numbers, counting to obtain 13 bases which are more than 30, 13 bases which are more than 20 and 0 bases which are less than 15, wherein all the bases of the whole original read sequence are 14, and assuming that the sequence to be detected only comprises a target residual sequence, the ratio of the bases with the base quality of less than 15 to all the bases of the sequence to be detected is 0, the ratio of the bases with the base quality of more than 20 to all the bases of the sequence to be detected is 13/14, and the ratio of the bases with the base quality of more than 30 to all the bases of the sequence to be detected is 13/14.

It should be noted that after each read subsequence in the original sequence is preprocessed, the total number of read subsequences including the linker sequence may be counted as L1, the total number of bases removed in the original sequence and satisfying the preset condition as L2, the total number of re-read sequences in the original sequence as L3, the total number of read subsequences in the original sequence as L4, and the ratio of L1 to L4 as the ratio of the linker sequences in the original sequence; determining the ratio of L2 to L4 as the proportion of discarded sequences in the original sequence; determining the ratio of L3 to L4 as the proportion of repeated sequences in the original sequence; in addition, the total number of N bases in the original sequence and the total number of N bases in the sequence to be detected can be counted, and the ratio of the total number of N bases in the original sequence to the total number of all bases in the original sequence is further determined as the proportion of N bases in the original sequence; and determining the ratio of the total number of the N bases in the sequence to be detected to the total number of all the bases in the sequence to be detected as the proportion of the N bases in the sequence to be detected.

The invention can remove the joint sequence of the original sequence, remove the base meeting the preset conditions, remove the repeated sequence and finally obtain the quality control indexes of the original sequence and the sequence to be detected by adopting the steps at one time, thereby improving the data processing efficiency and being convenient for users to use; data are not required to be preprocessed by a plurality of software, so that the input and output operations of the data of each software are reduced; and the read sequence is converted into sequence numbers to preprocess the original sequence, so that the memory space occupied by data processing is reduced, and the running speed of the equipment is increased.

The data processing procedure of the present invention is explained below by taking two SE read subsequences as examples:

the first SE read subsequence is:

@idl

CCAAAAATACCTTCCAGCAC

+

CICDDCCCCIFCCIID2IDH

the second SE read subsequence is:

@id2

CAAAAATACCTTCCAGCACT

+

CICDDCCCCIFCCIID2ID2；

wherein each SE read subsequence is represented by 4 lines of characters, the first line starts with "@", followed by a unique sequence ID identifier; a second behavior sequence character; the third line starts with "+", if + is followed by content, the content must be the same as after the first line "@"; the fourth row is base quality characters, each corresponding to the quality of the base or amino acid at the corresponding position in the second row.

The specific data processing process comprises the following steps:

1. cleavage linker sequences

Assuming that the predetermined adaptor sequence is CCAA, the predetermined adaptor sequence CCAA is cleaved by a step length of one base according to the method of the above embodiment, and the obtained adaptor subsequence has C, CC, CCA, and CCAA, and the sequence number corresponding to the adaptor subsequence C is 18.78, the sequence number corresponding to the adaptor subsequence CC is 66.06, the sequence number corresponding to the adaptor subsequence CCA is 300.89, and the sequence number corresponding to the adaptor subsequence CCAA is 814.18, so that a predetermined array [814.18, 300.89, 66.06, 18.78] can be formed.

Performing sequence interception on the first SE read subsequence CCAAAAATACCTTCCAGCAC from the beginning according to the length 4 of the preset adaptor sequence and a first preset step length 1 to obtain CCAA, CCA, CC and C, wherein the sequence number corresponding to the CCAA is 814.18, and the CCAA is determined to be the adaptor sequence because 814.18 is included in the preset array; the number of the sequence corresponding to the CCA is 300.89, and since 300.89 is included in the preset array, the CCA is determined to be a connector sequence; the number of the sequence corresponding to the CC is 66.06, and the CC is determined to be a connector sequence because 66.06 is contained in a preset array; c corresponds to sequence number 18.78, and since 18.78 is included in the preset array, C is determined to be a linker sequence, so that CCAA in the read sub-sequence CCAAAAATACCTTCCAGCAC is excised, and the first read sequence with the linker sequence removed is:

@idl

AAATACCTTCCAGCAC

+

DCCCCIFCCIID2IDH

performing sequence interception on the second SE read sub-sequence CAAAAATACCTTCCAGCACT from the beginning according to the length 4 of the preset adaptor sequence and the first preset step length 1 to obtain CAAA, CAA, CA and C, wherein the sequence number corresponding to the CAAA is 478.64, and since 478.64 is not included in the preset array, determining that the CAAA is not the adaptor sequence; the number of the sequence corresponding to the CAA is 223.7, and since 223.7 is not included in the preset array, it is determined that the CAA is not a linker sequence; the number of the sequence corresponding to CA is 81.34, since 81.34 is not included in the preset array, it is determined that CA is not a linker sequence; c corresponds to a sequence number of 18.78, and since 18.78 is included in the preset array, C is determined to be a linker sequence, so that cutting off C in the read sub-sequence CAAAAATACCTTCCAGCACT results in a second read sequence with the linker sequence removed:

@id2

AAAAATACCTTCCAGCACT

+

ICDDCCCCIFCCIID2ID2。

2. removing N-containing bases with the proportion larger than a preset threshold value and bases with the base mass smaller than a first preset base mass according to a sliding window method:

assuming that the preset window length is 2, the first preset base quality is 15, the preset threshold value is 0.5, and the preset length is 5, for the first read sequence AAATACCTTCCAGCAC with the linker sequence removed, sliding from the left by the 1 st window, the sequence is "AA", the base quality D of the first a is converted from ASCII code to number 35, the base quality C of the second a is converted from ASCII code to number 34, both 35 and 34 are greater than 20, so the sliding from the left is stopped; from the 1 st window of the right slide, the sequence is "AC", the base mass D of A is converted from ASCII code to number 35, the base mass H of C is converted from ASCII code to number 39, both 35 and 39 are greater than 20, so the right slide is also stopped, and the resulting residual sequence is AAATACCTTCCAGCAC.

For the second read sequence AAAAATACCTTCCAGC ACT with the linker sequence removed, sliding the 1 st window from the left, the sequence is "AA", the base quality I of the first A is converted from ASCII code to number 40, the base quality C of the second A is converted from ASCII code to number 34, both 40 and 34 are greater than 20, so the sliding from the left window is stopped; sliding right for the 1 st window, the sequence is "CT", the base quality D of C is converted from ASCII code to number 35, the base quality 2 of T is converted from ASCII code to number 17, then 35 is more than 20, 17 is less than 20, the base quality less than 20 is half of the total base 2 of the current window, so the sequence "CT" of the current window is cut; at this time, the window is continuously slid inwards from the right side to obtain a sequence "CA", wherein the base quality 2 of C is converted into a number of 17 from an ASCII code, the base quality I of A is converted into a number of 40 from an ASCII code, 40 is more than 20, 17 is less than 20, and the base with the base quality of less than 20 accounts for half of the total base 2 of the current window, so that the sequence "CA" of the current window is cut; continuing the inward sliding window to obtain the sequence "AG", the base quality I of A is converted from ASCII code to number 40, the base quality D of G is converted from ASCII code to number 35, and since 40 and 35 are both greater than 20, the rightward sliding window is stopped, and the resulting remaining sequence is AAAAATACCTTCCAG.

It can be known that the lengths of the first remaining sequence and the second remaining sequence are both greater than 5, at this time, the sequence number corresponding to the first remaining sequence and the sequence number corresponding to the second remaining sequence are respectively calculated, the sequence number corresponding to the first remaining sequence is obtained as 1564374433.18, the sequence number corresponding to the second remaining sequence is obtained as 26426074326.7, and if the sequence numbers are not equal to each other, the sequence is determined not to be a repeated sequence, and the finally obtained sequence to be detected includes the first remaining sequence AAATACCTTCCAGCAC and the second remaining sequence AAAAATACCTTCCAG.

The present invention provides a data processing apparatus for gene detection, and the data processing apparatus for gene detection described below and the data processing method for gene detection described above can be referred to in correspondence with each other.

FIG. 6 is a schematic structural diagram of a data processing apparatus for gene detection provided by the present invention, and as shown in FIG. 6, the data processing apparatus for gene detection includes an acquisition unit 610 and a detection unit 620; wherein:

an obtaining unit 610, configured to obtain a sequence to be detected;

the detecting unit 620 is configured to input the sequence to be detected into a gene detection model, and obtain a gene detection result output by the gene detection model;

the obtaining unit 610 obtains the sequence to be detected specifically by:

determining sequence numbers corresponding to the reading segment sequences;

According to the data processing device for gene detection, provided by the invention, the sequence number corresponding to each read segment sequence in the original sequence is determined, and the corresponding read segment sequence is preprocessed according to the sequence number without using a plurality of software, so that the preprocessing step is simplified; and determining each read sequence after pretreatment as a sequence to be detected, inputting the sequence to be detected into a gene detection model for gene detection, and improving the efficiency of the whole gene detection process.

Based on any of the above embodiments, the obtaining unit 610 is specifically configured to:

Based on any of the above embodiments, the obtaining unit 610 is further specifically configured to:

converting the target digit sequence into a joint multiplication matrix;

determining a trace of the target matrix as the sequence number.

cleaving the linker sequence in the read sequence;

Based on any of the above embodiments, the apparatus further comprises a statistics unit;

and the statistical unit is used for counting the first quality control index of the original sequence and the second quality control index of the sequence to be detected.

Based on any of the above embodiments, the statistical unit is specifically configured to:

Fig. 7 is a schematic physical structure diagram of an electronic device provided in the present invention, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logic instructions in memory 730 to perform a data processing method for gene testing, the method comprising: acquiring a sequence to be detected;

wherein the sequence to be detected is obtained by the following method:

determining sequence numbers corresponding to the reading segment sequences;

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the data processing method for gene detection provided by the above methods, the method including: acquiring a sequence to be detected;

wherein the sequence to be detected is obtained by the following method:

determining sequence numbers corresponding to the reading segment sequences;

In still another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method for gene detection provided by the above methods, the method including: acquiring a sequence to be detected;

wherein the sequence to be detected is obtained by the following method:

determining sequence numbers corresponding to the reading segment sequences;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data processing method for gene detection, comprising:

acquiring a sequence to be detected;

wherein the sequence to be detected is obtained by the following method:

determining sequence numbers corresponding to the reading segment sequences;

2. The data processing method for gene testing according to claim 1, wherein said determining the sequence number corresponding to the read sequence comprises:

3. The data processing method for gene detection according to claim 2, wherein the determining the sequence number corresponding to each of the subsequences of the reads comprises:

converting the target digit sequence into a joint multiplication matrix;

determining a trace of the target matrix as the sequence number.

4. The data processing method for gene detection according to claim 3, wherein the converting the read subsequence into a target digital sequence based on a predetermined number of bases comprises:

5. The data processing method for gene detection according to claim 3, wherein the converting the target number sequence into a joint multiplication matrix comprises:

6. The data processing method for gene detection according to claim 2, wherein the preprocessing the corresponding sequence of reads based on the sequence number comprises:

cleaving the linker sequence in the read sequence;

7. The data processing method for gene detection according to claim 6, wherein the determining the linker sequence in the read sequence based on the sequence number corresponding to each of the read subsequences comprises:

8. The data processing method for gene detection according to claim 6, wherein the removing bases satisfying a predetermined condition from the read sequence after cleavage of the linker sequence comprises:

9. The data processing method for gene detection according to claim 6, wherein the determining the sequence to be detected based on each of the target remaining sequences comprises:

the method further comprises the following steps:

10. The data processing method for gene detection according to any one of claims 1 to 9, further comprising:

11. The data processing method for gene testing according to claim 10, wherein the counting the first quality control indicators of the original sequences comprises:

12. A data processing device for gene detection, comprising:

an acquisition unit, configured to acquire a sequence to be detected;

determining sequence numbers corresponding to the reading segment sequences;

13. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the data processing method for gene testing according to any one of claims 1 to 11 when executing the program.

14. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the data processing method for gene testing according to any one of claims 1 to 11.

15. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the data processing method for gene testing according to any one of claims 1 to 11.