CN107516020B

CN107516020B - Method, device, equipment and storage medium for determining importance of sequence sites

Info

Publication number: CN107516020B
Application number: CN201710708490.4A
Authority: CN
Inventors: 赵苗苗; 陈世雄; 林闯; 李光林
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2021-05-14
Anticipated expiration: 2037-08-17
Also published as: CN107516020A

Abstract

The invention discloses a method, device, equipment and storage medium for determining the importance of sequence sites. The method includes determining the number of sequence sites in a sequence feature string in a set of fixed-length sequence strings, generating a site weight vector with a set number of dimensions equal to the number of sequence sites; initializing each of the site weight vectors to obtain the A set number of initial site weight vectors with initial component values; based on the selected optimal solution search algorithm, each initial site weight vector is iteratively processed to obtain a target site weight vector; the target site weight vector is Each target component value in is correspondingly determined as the importance of each sequence site in the sequence feature string. Using this method, the importance of each sequence site in the sequence feature string can be determined accurately and quickly, which provides effective prediction information for the subsequent prediction of the transcription factor binding site of the sequence feature string, thereby ensuring the prediction of transcription factor binding sites. processing accuracy.

Description

Method, device, equipment and storage medium for determining importance of sequence sites

Technical Field

The invention relates to the technical field of computer equipment, in particular to a method, a device, equipment and a storage medium for determining the importance of sequence sites.

Background

Transcription is the first stage of gene expression in an organism, and transcription of DNA requires the control of transcription factors, where transcription must bind to DNA to control the transcription process, and the site on DNA to which a transcription factor binds is called a transcription factor binding site, which is generally a sequence feature string, corresponding to a plurality of sequence sites.

The prediction and judgment of whether the sequence characteristic string in the transcription factor is the transcription factor binding site is helpful for understanding the transcription regulation mechanism and the growth process of cells, and has very important significance for determining the drug target, so that researchers usually adopt a biological experiment method or search the transcription factor binding site through a calculation method. However, biological assays are not only time-consuming but also expensive, and many hundreds or thousands of potential binding sites can be predicted by such techniques at great expense to researchers.

Therefore, prediction of transcription factor binding sites by a calculation method becomes a common means for researchers, and common calculation methods such as a hidden Markov model method, a site-specific scoring matrix method and the like. However, when the transcription factor binding site prediction is performed on a given sequence feature string by using the existing calculation method, the prediction is performed on the premise that the importance of each sequence site in the default sequence feature string is the same, which greatly influences the accuracy of the transcription factor binding site prediction.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for determining the importance of sequence sites, which are used for determining the importance of the sequence sites in a transcription factor sequence characteristic string.

In a first aspect, an embodiment of the present invention provides a method for determining importance of a sequence site, including:

determining the number of sequence bits of sequence feature strings in a fixed-length sequence string set, and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits;

initializing each site weight vector to obtain the initial site weight vectors with the set number of initial component values;

iteratively processing each initial site weight vector based on a selected optimal solution search algorithm to obtain a target site weight vector;

and correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

In a second aspect, an embodiment of the present invention provides an apparatus for determining importance of sequence sites, including:

the vector generation module is used for determining the number of sequence bits of the sequence feature strings in the fixed-length sequence string set and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits;

a vector initialization module, configured to initialize each of the location weight vectors to obtain a set number of initial location weight vectors with initial component values;

the vector processing module is used for iteratively processing each initial locus weight vector based on the selected optimal solution search algorithm to obtain a target locus weight vector;

and the importance determining module is used for correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

In a third aspect, an embodiment of the present invention provides a computer device, including:

one or more processors;

storage means for storing one or more programs;

the one or more programs are executed by the one or more processors, so that the one or more processors implement the method for determining the importance of the sequence position provided by the embodiment of the invention.

In a fourth aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining the importance of sequence sites provided by the present invention.

In the method, the device, the equipment and the storage medium for determining the importance of the sequence sites, the method firstly determines the number of the sequence sites of the sequence feature strings in a fixed-length sequence string set, and generates the site weight vector with the set number of dimensions as the number of the sequence sites; then initializing the site weight vectors to obtain a set number of initial site weight vectors with initial component values; and then, iteratively processing each initial position point weight vector based on a selected optimal solution search algorithm to obtain an object position point weight vector, and finally correspondingly determining an object component value in the object position point weight vector as the importance of each sequence position point in the sequence characteristic string. The technical scheme can accurately and quickly determine the importance of each sequence site in the sequence feature string, and provides effective prediction information for the subsequent prediction of the transcription factor binding site of the sequence feature string, thereby ensuring the accuracy of the prediction processing of the transcription factor binding site.

Drawings

FIG. 1 is a schematic flow chart of a method for determining the importance of sequence loci according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for determining the importance of sequence loci according to a second embodiment of the present invention;

fig. 3 is a block diagram of a device for determining the importance of sequence loci according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of a hardware structure of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a schematic flowchart of a method for determining the importance of a sequence site according to an embodiment of the present invention, where the method is suitable for determining the importance of a sequence site in a feature string of a transcription factor sequence, and the method can be executed by a device for determining the importance of a sequence site, where the device can be implemented by software and/or hardware and is generally integrated into a computer device.

As shown in fig. 1, a method for determining the importance of a sequence site provided in an embodiment of the present invention includes the following operations:

it should be noted that, when the transcription factor binding site prediction is performed on the sequence feature string in the transcription factor based on the existing prediction method, the different importance of each sequence site of the sequence feature string is not considered, so that the technical personnel lack the research on the importance of the sequence site. The implementation provides a method for determining the importance of the sequence sites, and the importance of the sequence sites is considered in the prediction of the transcription factor binding sites, so that the accuracy of the prediction result is improved.

S101, determining the number of sequence bits of sequence feature strings in a fixed-length sequence string set, and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits.

The sequence signature string in this example is understood to be a piece of transcription factor data used to predict transcription factor binding sites. Generally, the length of each transcription factor data is given, and the given length corresponds to the number of sequence sites to be determined in this embodiment, for example, the length of Ebox transcription factor data is 10, the length of MYc transcription factor data is 7, that is, the number of points of the sequence corresponding to Ebox transcription factor data is 10, and the number of points of the sequence corresponding to MYc transcription factor data is 7. Before the operation of the present embodiment is performed, a fixed-length sequence string set including a plurality of sequence feature strings having the same length is obtained by a specific data processing method.

Specifically, in order to ensure the accuracy of the result of the subsequently determined importance, the present embodiment selects transcription factor data that is commonly used in academia and some of the more important transcription factor data as the sequence feature string, for example, the present embodiment may download multiple sets of transcription factor data from a trans fac database, and select part of the transcription factor data from the multiple sets of transcription factor data as the selection object of the sequence feature string that meets the requirements of the present embodiment, where the trans fac database is a database of description information about transcription factors, their binding sites on the genome, and their binding with DNA.

In this embodiment, any set of transcription factor data is selected from the trasfac database, and the process of processing and obtaining the fixed-length sequence string set can be described as follows: 1) extracting a plurality of sequence feature strings with the same data length from a set of transcription factor data; 2) obtaining a target gene sequence and promoter region information corresponding to each sequence feature string from an Ensembl database to judge whether the corresponding sequence feature string is determined to be a transcription factor binding site or not, wherein the Ensembl database specifically stores the transcription factor binding site information determined by biological experiments at present; 3) recording as a positive sequence feature string the sequence feature string determined to be a transcription factor binding site based on 2), and recording as a negative sequence feature string the sequence feature string determined not to be a transcription factor binding site based on 2); 4) selecting negative sequence feature strings with the quantity 10 times that of the positive sequence feature strings, and simultaneously ensuring that the data of each sequence site in the positive sequence feature strings and the negative sequence feature strings are different, namely ensuring that the positive sequence feature strings and the negative sequence feature strings are not overlapped; 5) and forming a fixed-length sequence string set based on the determined positive sequence feature string and the selected negative sequence feature string. It will be appreciated that a fixed length set of sequence strings corresponds to a selected set of transcription factor data.

In this embodiment, the above-mentioned data processing operation is preferably performed on 13 sets of transcription factor data in the trans fac database, so that the number of positive sequence feature strings and the number of negative sequence feature strings in the fixed-length sequence string set corresponding to each set of transcription factor data can be represented as in table 1 below:

TABLE 1 number of sequence feature strings contained in different fixed-length sequence string sets

Set of sequence strings	Q6MAZ	Q601_MAZ	Q3-SREBP	Q6-SREBP2	Q5	Q6	P53
								Positive sequence crosstalk	12	27	8	19	48	61	46
Negative sequence crosstalk	120	270	80	200	469	599	460
								Set of sequence strings	Q3PHCR	Q601HNF	Q602HNF	Q603HNF	Ebox	Myc	----
Positive sequence crosstalk	31	58	58	58	119	21	----
								Negative sequence crosstalk	310	580	580	580	1190	210	----

It can be found that the number of negative sequence feature strings in each fixed-length sequence string set is mostly 10 times the number of positive sequence feature strings.

After one or more fixed-length sequence string sets are obtained based on the method, the number of sequence sites of the sequence feature strings in the fixed-length sequence string sets can be determined based on the operation of the step, and after the number of the sequence sites is determined, the number of the sequence sites can be used as the dimension of the site weight vector to be generated. Preferably, a set number of site weight vectors are generated in this step, and the dimension of each site weight vector is equal to the number of the determined sequence sites, it can be understood that the component value of each site weight vector is set to 0 by default in this step.

S102, initializing each site weight vector, and obtaining the set number of initial site weight vectors with initial component values.

In this embodiment, initialization operations may be performed on each component in the set number of location weight vectors by using set initial conditions, so as to obtain initial component values after initialization, and the initial location weight vectors are referred to as initial location weight vectors. This embodiment can determine the importance of the sequence locus based on the set number of initial locus weight vectors.

In this embodiment, the initial condition for initializing the location weight vector may be actually set according to an actual situation, for example, a selection range of an initial component value is set, and then data is selected as the initial component value in the selection range.

S103, iteratively processing each initial site weight vector based on the selected optimal solution search algorithm to obtain a target site weight vector.

Generally, the optimal solution search algorithm is equivalent to an algorithm for iteratively searching an optimal solution in a problem solution space composed of an exhaustive list, and common optimal solution search algorithms include a genetic algorithm, a simulated annealing algorithm, a particle swarm algorithm, an ant colony algorithm, and the like.

In this embodiment, a set number of initial locus weight vectors may be regarded as an exhaustively composed problem solution space, and based on the optimal solution search algorithm selected by the present embodiment, the initial problem solution space is processed through a finite number of iterative cycles, so that a result meeting the conditions can be finally found in the problem solution space formed after the iteration is completed. The embodiment finally obtains one or a set number of target site weight vectors meeting the condition. The object position weight vector is equivalent to the object component values obtained after the component values of all dimensions in the vector are subjected to multiple iterative processing.

S104, correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

It can be understood that the obtained object locus weight vector contains the same number of components as the sequence locus, and the current component values are denoted as object component values. In this step, the plurality of object component values of the sequence position may be determined as the importance of each sequence position in the sequence feature string in a one-to-one correspondence.

For example, assuming that the number of sequence sites is 7, each sequence site in the sequence feature string may be denoted as "abcdefg", and the target site weight vector is denoted as W ═ W1, W2, W3, W4, W5, W6, W7], at this time, the W1 correspondence may be determined as the importance of the sequence site a, and the W7 correspondence may be sequentially determined as the importance of the sequence site g.

The method for determining the importance of the sequence sites provided by the embodiment of the invention comprises the steps of firstly determining the number of sequence sites of sequence feature strings in a fixed-length sequence string set, and generating site weight vectors with set number of dimensions as the number of the sequence sites; then initializing the site weight vectors to obtain a set number of initial site weight vectors with initial component values; and then, iteratively processing each initial position point weight vector based on a selected optimal solution search algorithm to obtain an object position point weight vector, and finally correspondingly determining an object component value in the object position point weight vector as the importance of each sequence position point in the sequence characteristic string. The technical scheme can accurately and quickly determine the importance of each sequence site in the sequence feature string, and provides effective prediction information for the subsequent prediction of the transcription factor binding site of the sequence feature string, thereby ensuring the accuracy of the prediction processing of the transcription factor binding site.

Example two

Fig. 2 is a schematic flow chart of a method for determining the importance of sequence sites according to the second embodiment of the present invention, which is optimized based on the second embodiment of the present invention, in which initializing each of the weight vectors of the sites is further specifically optimized as follows: and randomly selecting initial component values of components in the site weight vectors within a set value range, wherein the set value range is (0, 1).

Further, in this embodiment, each initial location weight vector is iteratively processed based on the selected optimal solution search algorithm, and the obtained target location weight vector is specifically optimized as follows: taking each initial site weight vector as an individual of the current population in the selected genetic algorithm; determining the adaptive value of each individual in the current population relative to the equal-length sequence string set; if the current iteration termination condition is met, determining a target adaptive value meeting the target selection condition, and taking an individual corresponding to the target adaptive value as a target site weight vector; otherwise, determining the next generation population according to the adaptive value, and returning the next generation population as a new current population to execute the determination operation of the adaptive value.

In this embodiment, after correspondingly determining each object component value in the object location weight vector as the importance of each sequence location in the sequence feature string, the further optimization includes: and predicting the transcription factor binding sites of the sequence feature strings in the fixed-length sequence string set by adopting a set prediction strategy according to the importance of each sequence site and a set similarity scoring formula.

As shown in fig. 2, the method for determining the importance of a sequence locus provided in the second embodiment of the present invention specifically includes the following operations:

s201, determining the number of sequence bits of sequence feature strings in a fixed-length sequence string set, and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits.

In this embodiment, the fixed-length sequence string set may be obtained from any set of transcription factor data downloaded in the trasfac database by a preset data processing method. The fixed-length sequence string set obtained contains a plurality of sequence feature strings of the same length, the length of the sequence feature strings is equivalent to the number of sequence sites in the embodiment, some sequence feature strings may be positive sequence feature strings determined as transcription factor binding sites, and the rest sequence feature strings may be negative sequence feature strings determined as non-transcription factor binding sites.

In this step, the number of sequence positions is regarded as the dimension of the vector to be generated, so that a set number of position weight vectors with the dimension of the number of sequence positions can be generated, wherein the set number can be preferably 100.

S202, randomly selecting initial component values of components in the site weight vectors in a set value range, and obtaining the set number of initial site weight vectors with the initial component values.

For example, when the site weight vector is initialized in this step, a condition of the initialization operation is set, where the condition is that randomly selected data is used as an initial component value of the vector within a set value range, where the set value range is preferably (0, 1). Through the operation of this step, a set number of initial locus weight vectors can be obtained. The set number is the same as the set number in the above step, and is particularly preferably 100.

In this embodiment, a genetic algorithm is preferably used as the optimal solution search algorithm, and the following steps S203 to S208 are equivalent to the determination of the importance of the sequence locus by the genetic algorithm.

S203, using each initial site weight vector as an individual of the current population in the selected genetic algorithm.

Specifically, in this embodiment, the set number of initial site weight vectors is regarded as an initial population in the genetic algorithm (for convenience of description, the initial population may also be regarded as a current population), which is equivalent to the first iteration of the genetic algorithm, and each initial site weight vector is equivalent to an individual in the current population. In this embodiment, it is preferable to set each initial weight vector to W₀＝[w₀₁,w₀₂,w₀₃,w₀₄,w₀₅,w₀₆,w₀₇]The specific value corresponding to each component can be determined by the above-mentioned S202.

S204, determining the adaptive value of each individual in the current population relative to the equal-length sequence string set.

It will be appreciated that in genetic algorithms, fitness values are particularly the most basic reference data for next generation population selection. Generally, different conditions can be set based on actual application purposes to determine the adaptive value corresponding to each individual in the current population.

The application purpose of this embodiment is equivalent to performing transcription factor binding site prediction on a sequence feature string in a fixed-length sequence string set by using a set prediction strategy, verifying comparison between a prediction result and a result corresponding to the reality of the sequence feature string, and determining the validity of the prediction strategy by using a set evaluation criterion, where accuracy (Ac) is one of the set evaluation criteria, and the evaluation criterion can be specifically expressed as:

and the TP, the TN, the FP and the FN respectively test the sequence feature strings in the fixed-length sequence string set by adopting a prediction strategy based on the importance of the currently given sequence sites (the component values of individuals in the current population) to obtain the number of the sequence feature strings of true positive, true negative, false positive and false negative.

Specifically, True Positives (TPs) correspond to strings predicted to be positive sequence features, and are themselves positive sequence features; false Positives (FP) correspond to strings predicted to be positive sequence features, but are themselves negative sequence features; false Negatives (FNs) correspond to strings predicted to be negative sequence features, but are themselves positive sequence features; true Negatives (TN) correspond to strings predicted to be negative sequence features, and are themselves negative sequence features.

In this prediction strategy, the importance of each sequence site in the sequence feature string is required, and Ac is preferably used as the adaptive value in this embodiment, in this case, the present step uses the component values currently corresponding to each individual as the importance of the sequence sites in the sequence feature string, so that the transcription factor binding site is predicted by using the set prediction strategy based on the importance of the sequence sites determined under each individual, and the Ac value currently corresponding to each individual is determined as the adaptive value of the fixed-length sequence string set currently corresponding to each individual.

S205, judging whether the current condition is in accordance with the iteration termination condition, if so, executing S206; if not, go to S207.

In general, the process of determining the next generation population based on genetic algorithms can be considered as one iteration. The present embodiment preferably determines the set number of iterations as an iteration termination condition of the algorithm. This embodiment preferably sets the number of iterations to 10000. Specifically, if the current iteration number does not reach 10000 times, which is equivalent to not meeting the iteration termination condition, S207 may be executed at this time; otherwise, S206 needs to be performed.

S206, determining a target adaptive value meeting the target selection condition, taking an individual corresponding to the target adaptive value as a target site weight vector, and then executing S208.

Specifically, after the iteration termination condition is met, it is equivalent to no longer determining the next generation population based on the individuals in the current population, and at this time, component values included in each individual (the current site weight vector, which may not be the initial site weight vector nor the target site weight vector) in the current population may be obtained, and at the same time, each individual also has an adaptive value with respect to the fixed-length sequence string set.

In this embodiment, the adaptive value may preferably be set as a target selection condition for selecting the target site weight vector, where the target selection condition is preferably: after sorting according to the size of the adaptive value, the adaptive value of the top 5 is ranked. The present embodiment determines the adaptive value that meets the target selection condition as the target adaptive value. The individuals associated with the target adaptive values can be determined according to the target adaptive values, and the individuals can be used as the target site weight vector of the embodiment, and after the target site weight vector is determined, the operation can be skipped to S208 to execute the importance determination operation.

S207, determining a next generation population according to the adaptive value, taking the next generation population as a new current population, and returning to execute S204.

When the iteration termination condition is not met, it is further determined that a next generation population is formed, and the next generation population mainly contains the changed component values of the individuals compared with the current population.

Further, this embodiment preferably sets that the determining the next generation population according to the adaptive value includes: selecting individuals meeting set selection conditions from the current population according to the adaptive value to serve as a next generation candidate population; and processing the individuals in the next generation candidate population according to the set crossover operator and mutation operator to generate a next generation population.

Specifically, the present embodiment may first sort the fitness values of the individuals in the current population by size, directly use the individual corresponding to the fitness value ranked at the top 2 as the individual in the next generation candidate population, and then determine the remaining individuals in the next generation candidate population based on the roulette wheel selection method among the remaining individuals in the current population. It can be known that the total number of individuals of the next generation candidate population is equal to the total number of individuals of the current population, and is also equal to the total number of individuals of the next generation population.

Then, the embodiment can perform cross mutation processing on the individuals in the next generation candidate population through a given cross operator and a mutation operator, and finally generate the next generation population through cross mutation on the basis of the next generation candidate population.

In this embodiment, it is preferable to set the crossover operator as arithmetic crossover, and the specific description is as follows: grouping the individuals in the next generation candidate population pairwise, and after grouping, two individuals W in each group_iAnd W_jRespectively, corresponding to parents in genetics, the parents can be cross-combined by the following formula:

wherein, W_iAnd W_yRespectively representing two sub-individuals generated after the crossing, lambda₁Is a random number between (0,1), λ₁＝1-λ₂(ii) a This embodiment preferably sets λ₁Is 0.8, lambda₂Is 0.2.

Carrying out pairwise crossing on individuals in the next generation of candidate population based on the crossing operator to form new individuals; the mutation operator set based on this embodiment performs mutation processing on the individuals generated by the crossover. In the present embodiment, preferably, the mutation operator is a simple random value-selecting mutation, that is, the component values of the individuals that cross each other and generate are randomly determined within a set [0,1] range, so as to implement the individual mutation, and finally generate the next generation population based on the mutated individuals.

To implement the iterative loop of the genetic algorithm, the present embodiment determines the newly generated next generation population as a new current population, and then jumps to S204 to perform the next iterative process.

S208, correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

For example, in this step, after the object location weight vector is determined in S206, the importance of each sequence location is determined by correspondingly determining each object component value in the object location weight vector. It can be understood that, in the present embodiment, 5 target site weight vectors are determined based on S206, and therefore, the present embodiment may determine 5 importance degrees for each sequence site of the sequence feature string. The biological significance corresponding to different importance degrees and different importance degrees of sequence sites are mainly considered, and the different importance degrees of the sequence sites can have different research significance in biology, particularly in prediction research of transcription factor binding sites.

S209, according to the importance of each sequence site and a set similarity scoring formula, adopting a set prediction strategy to predict the transcription factor binding site of the sequence feature strings in the fixed-length sequence string set.

In this embodiment, the transcription factor binding site prediction can be performed on the sequence feature strings in the fixed-length sequence string set according to the determined importance of each sequence site and the set similarity scoring formula under the rule of the set prediction strategy.

Specifically, the set prediction strategy can be expressed as: 1) determining a positive sequence characteristic string and a negative sequence characteristic string contained in the fixed-length sequence string set; 2) correspondingly dividing the positive sequence characteristic string and the negative sequence characteristic string into set equal parts respectively; 3) sequentially selecting one equal part of positive sequence characteristic string as a training data set, and taking the other equal part of positive sequence characteristic string and the corresponding equal part of negative sequence characteristic string as a data set to be detected; 4) determining similarity scores of the sequence feature strings in the data set to be tested and the sequence feature strings in the training data set based on a set similarity scoring formula and the importance of each sequence site, and returning to execute the step 3) until the set equal parts are repeated; 5) determining the highest similarity score corresponding to each sequence feature string in the fixed-length sequence string set after the circulation is finished; 6) and predicting the sequence characteristic string with the highest similarity score larger than or equal to a set threshold value in the fixed-length sequence string set as a transcription factor binding site.

The prediction strategy described above is equivalent to a K-fold cross-validation prediction method, and it can be understood that the accuracy of the proposed prediction strategy can be verified by the prediction of the sequence feature string that has been determined to be the transcription factor binding site. The method comprises the steps of setting equal parts by K, dividing a positive sequence feature string and a negative sequence feature string into K parts respectively, wherein K is not particularly limited and can be set manually, and for the K positive sequence feature strings, each positive sequence feature string can be used as a training data set, and at the moment, the positive sequence feature strings of the rest parts and the negative sequence feature strings corresponding to the rest parts are randomly scrambled to serve as a data set to be tested, so that similarity scores corresponding to the number of the training data sets can be calculated for the sequence feature strings in the data set to be tested according to a set similarity scoring formula (the importance degree of each sequence site is related in the similarity scoring formula).

After each positive sequence feature string is taken as a training data set in turn, each sequence feature string in the fixed-length sequence string set corresponds to a plurality of similarity scores, and at the moment, the highest similarity score corresponding to each sequence feature string can be determined; finally, it can be determined whether a sequence feature string with the highest similarity score greater than or equal to a predetermined threshold exists, if exists, corresponding to the prediction of the sequence feature string as a transcription factor binding site, according to the predetermined threshold.

Further, the similarity score formula is expressed as:

wherein A represents any sequence feature string in the fixed-length sequence string set; b represents a positive sequence characteristic string different from A in the fixed-length sequence string set; a [ i ]]And B [ i ]]Site data representing the ith sequence site in A and B, respectively; s_DNA(A[i],B[i]) Representing the substitution score in the set DNA substitution matrix; w (i) represents the importance of the ith sequence site; l represents the number of sequence positions of the sequence feature string.

It can be found that the important link of transcription factor binding site prediction lies in the similarity scoring stage, and the importance of determining sequence sites based on the method of the embodiment ensures the accuracy of similarity scoring. It should be noted that the substitution score in the similarity scoring formula is specifically set according to the relationship between the characters constituting the transcription factor sequence in biology. In biology, it can be said that a transcription factor sequence is composed of A, G, C and T representing a dotted purine, guanine, cytosine and thymine in particular, and it is known that in chemical nature, purine and purine are similar, i.e., A and G, and pyrimidine are similar, i.e., T and C, but purine and pyrimidine are not similar. The present embodiment thus employs replacement of the replacement score expressed in the form of a matrix in the following manner:

	A	C	G	T
					A	2	-1	1	-1
C	-1	2	-1	1
					G	1	-1	2	-1
T	-1	1	-1	2

based on the above substitution matrix, it can be known that when the characters of the ith position in the above A and B are both 'A', S can be determined_DNA(A[i],B[i]) 2; similarly, when the characters of the i-th site in the above-mentioned A and B are 'A' and 'C', respectively, S_DNA(A[i],B[i])＝-1。

In order to verify the accuracy and effectiveness of the prediction result when the prediction strategy is adopted to predict the transcription factor binding site based on the determined importance of the sequence site, the embodiment further provides a verification operation for verifying the prediction result.

Specifically, the three other evaluation criteria except Ac are given in this example as follows: sensitivity (Sn), specificity (Sp), and correlation coefficient (MCC), wherein Sn, Sp, and MCC are defined as:

wherein, the meanings of TP, TN, FP and FN in the above formula have been explained after S204.

In this embodiment, the fixed-length sequence string set named Q5 in table 1 of the first embodiment is used to predict the transcription factor binding site based on the method provided in this embodiment and the site-specific scoring matrix method (comparison method), respectively. The prediction strategies in the two methods adopt a 4-time cross validation method, the set threshold value of the implementation is 1.6, and the set threshold values of the comparison methods are respectively selected from 0.73, 0.769 and 0.824.

TABLE 2 comparison of the evaluation criteria of the method provided in this example with the site-specific scoring matrix method

Table 2 above shows the comparison result of the evaluation standard after processing the sequence feature string in the fixed-length sequence string set Q5 by the method provided in this embodiment and the site-specific scoring matrix method. It should be noted that, the higher the set threshold is, the higher the selection requirement for the highest similarity score value of the sequence feature string is, and based on the above table, it can be found that the set threshold set in the method provided in this embodiment is far higher than the set threshold in the corresponding method, but at this time, the evaluation standard values of the items corresponding to the method of this embodiment are better than the evaluation standard values corresponding to the comparative method. Therefore, after the determination method for the importance of the sequence site provided by the embodiment determines the importance of the sequence site, the accuracy and effectiveness of the transcription factor binding site prediction are relatively higher than those of the existing prediction methods.

The method for determining the importance of the sequence sites, provided by the embodiment of the invention, specifically describes a process for determining the weight vector of the initial sites, describes a process for obtaining the weight vector of the target sites, and additionally increases a prediction operation for predicting the transcription factor binding sites by performing the sequence feature strings based on the importance of each sequence site. By using the method, the site weight vector with the optimal component value can be determined according to the determined initial site weight vector and the selected genetic algorithm to be used as the target site weight vector, the optimal component value in the target site weight vector is finally determined as the importance of each sequence site in the sequence feature string, and the determined importance provides effective prediction information for the subsequent transcription factor binding site prediction of the sequence feature string, so that the accuracy of the transcription factor binding site prediction processing is ensured.

EXAMPLE III

Fig. 3 is a block diagram of a determination apparatus for determining the importance of sequence loci according to a third embodiment of the present invention. The device is suitable for determining the importance of the sequence sites in the transcription factor sequence feature string, can be realized by software and/or hardware, and is generally integrated on computer equipment. As shown in fig. 3, the apparatus includes: a vector generation module 31, a vector initialization module 32, a vector processing module 33, and an importance determination module 34.

The vector generation module 31 is configured to determine the number of sequence bits of a sequence feature string in a fixed-length sequence string set, and generate a set number of site weight vectors whose dimensions are the number of the sequence bits;

a vector initialization module 32, configured to initialize each of the location weight vectors to obtain the initial location weight vectors with the set number of initial component values;

the vector processing module 33 is configured to iteratively process each initial location weight vector based on a selected optimal solution search algorithm to obtain a target location weight vector;

and the importance determining module 34 is configured to correspondingly determine each object component value in the object location weight vector as the importance of each sequence location in the sequence feature string.

In this embodiment, the apparatus first determines the number of sequence bits of a sequence feature string in a fixed-length sequence string set through a vector generation module 31, and generates a set number of site weight vectors with dimensions as the number of the sequence bits; then, initializing each site weight vector through a vector initialization module 32 to obtain the set number of initial site weight vectors with initial component values; then, each initial site weight vector is processed in an iterative manner through a vector processing module 33 based on the selected optimal solution search algorithm to obtain a target site weight vector; finally, the importance determining module 34 correspondingly determines the object component values in the object location weight vector as the importance of each sequence location in the sequence feature string.

The determination device for the importance of the sequence sites provided by the third embodiment of the invention can accurately and quickly determine the importance of each sequence site in the sequence feature string, and provides effective prediction information for the subsequent prediction of the transcription factor binding sites of the sequence feature string, thereby ensuring the accuracy of the prediction processing of the transcription factor binding sites.

Further, the vector initialization module 32 is specifically configured to:

and randomly selecting initial component values of components in the site weight vectors in a set value range to obtain the initial site weight vectors with the initial component values in the set number, wherein the set value range is (0, 1).

Further, the vector processing module 33 is specifically configured to:

taking each initial site weight vector as an individual of the current population in the selected genetic algorithm; determining the adaptive value of each individual in the current population relative to the equal-length sequence string set; if the current iteration termination condition is met, determining a target adaptive value meeting the target selection condition, and taking an individual corresponding to the target adaptive value as a target site weight vector; otherwise, determining the next generation population according to the adaptive value, and returning the next generation population as a new current population to execute the determination operation of the adaptive value.

On the basis of the optimization, the determining the next generation population according to the adaptive value comprises the following steps: selecting individuals meeting set selection conditions from the current population according to the adaptive value to serve as a next generation candidate population; and processing the individuals in the next generation candidate population according to the set crossover operator and mutation operator to generate a next generation population.

Further, this embodiment further includes:

and a binding site prediction module 35, configured to, after correspondingly determining each object component value in the object site weight vector as an importance of each sequence site in a sequence feature string, perform transcription factor binding site prediction on the sequence feature string in the fixed-length sequence string set by using a set prediction strategy according to the importance of each sequence site and a set similarity scoring formula.

On the basis of the optimization, the similarity scoring formula is expressed as:

wherein A represents any sequence feature string in the fixed-length sequence string set; b represents the fixed lengthThe sequence characteristic string of positive data of the label different from A in the sequence string set; a [ i ]]And B [ i ]]Site data representing the ith sequence site in A and B, respectively; s_DNA(A[i],B[i]) Representing the substitution score in the set DNA substitution matrix; w (i) represents the importance of the ith sequence site; l represents the number of sequence positions of the sequence feature string.

Example four

A fourth embodiment of the present invention provides a computer device, and fig. 4 is a schematic diagram of a hardware structure of the computer device provided in the fourth embodiment of the present invention, as shown in fig. 4, the computer device includes: a processor 41 and a storage device 42, where the processor in the computer device may be one or more, and fig. 4 illustrates one processor 41 as an example; further, the processor and the storage device may be connected by a bus or other means, and the bus connection is exemplified in fig. 4.

The storage device 42 in the computer device is used as a computer readable storage medium for storing one or more programs, which may be software programs, computer executable programs, and modules, such as corresponding program instructions/modules in the determination device of sequence position importance provided by the embodiment of the present invention (for example, the modules shown in fig. 3, including the vector generation module 31, the vector initialization module 32, the vector processing module 33, and the importance determination module 34, and further including the binding position prediction module 35). The processor 41 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the storage device 42, namely, implements the determination method of the importance of the sequence position in the above method embodiment.

The storage device 42 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the device, and the like. Further, the storage 42 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 42 may further include memory located remotely from processor 41, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And, when one or more programs included in the above-mentioned computer device are executed by the one or more processors 41, one of the programs may perform the following operations:

determining the number of sequence bits of sequence feature strings in a fixed-length sequence string set, and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits; initializing each site weight vector to obtain the initial site weight vectors with the set number of initial component values; iteratively processing each initial site weight vector based on a selected optimal solution search algorithm to obtain a target site weight vector; and correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining the importance of sequence sites provided in the first embodiment and the second embodiment of the present invention, where the method provided in the first embodiment of the present invention includes: determining the number of sequence bits of sequence feature strings in a fixed-length sequence string set, and generating a set number of bit weight vectors with the number of dimensions as the number of the sequence bits; initializing each site weight vector to obtain the initial site weight vectors with the set number of initial component values; iteratively processing each initial site weight vector based on a selected optimal solution search algorithm to obtain a target site weight vector; and correspondingly determining each object component value in the object position weight vector as the importance of each sequence position in the sequence feature string.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the apparatus for determining the importance of sequence sites, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that different embodiments or examples and features of different embodiments or examples described in this specification may be combined and combined by one skilled in the art without contradiction.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for determining the importance of a sequence site, comprising:

correspondingly determining each object component value in the object site weight vector as the importance of each sequence site in a sequence feature string;

wherein, the acquisition process of the fixed-length sequence string set is as follows: extracting a plurality of sequence feature strings with the same data length from a set of transcription factor data; obtaining a target gene sequence and promoter region information corresponding to each sequence feature string from an Ensembl database to judge whether the corresponding sequence feature string is determined to be a transcription factor binding site or not, wherein the Ensembl database specifically stores the transcription factor binding site information determined by biological experiments at present; recording the sequence characteristic string determined as the transcription factor binding site as a positive sequence characteristic string, and recording the sequence characteristic string determined not to be the transcription factor binding site as a negative sequence characteristic string; selecting negative sequence feature strings with the quantity 10 times that of the positive sequence feature strings, and simultaneously ensuring that the data of each sequence site in the positive sequence feature strings and the negative sequence feature strings are different; and forming a fixed-length sequence string set based on the determined positive sequence feature string and the selected negative sequence feature string.

2. The method of claim 1, wherein initializing each of the location weight vectors comprises:

and randomly selecting initial component values of components in the site weight vectors within a set value range, wherein the set value range is (0, 1).

3. The method of claim 1, wherein iteratively processing each of the initial location weight vectors based on the selected optimal solution search algorithm to obtain a target location weight vector comprises:

taking each initial site weight vector as an individual of the current population in the selected genetic algorithm;

determining the adaptive value of each individual in the current population relative to the fixed-length sequence string set;

if the current iteration termination condition is met, determining a target adaptive value meeting the target selection condition, and taking an individual corresponding to the target adaptive value as a target site weight vector;

otherwise, determining the next generation population according to the adaptive value, and returning the next generation population as a new current population to execute the determination operation of the adaptive value.

4. The method of claim 3, wherein determining the next generation population based on the fitness value comprises:

selecting individuals meeting set selection conditions from the current population according to the adaptive value to serve as a next generation candidate population;

and processing the individuals in the next generation candidate population according to the set crossover operator and mutation operator to generate a next generation population.

5. The method according to claim 1, further comprising, after determining each object component value in the object location weight vector as a corresponding importance of each sequence location in a sequence feature string:

and predicting the transcription factor binding sites of the sequence feature strings in the fixed-length sequence string set by adopting a set prediction strategy according to the importance of each sequence site and a set similarity scoring formula.

6. The method of claim 5, wherein the similarity score formula is expressed as:

wherein A represents any sequence feature in the fixed-length sequence string setStringing; b represents sequence characteristic strings of positive data of the labels different from A in the fixed-length sequence string set; a [ i ]]And B [ i ]]Site data representing the ith sequence site in A and B, respectively; s_DNA(A[i],B[i]) Representing the substitution score in the set DNA substitution matrix; w (i) represents the importance of the ith sequence site; l represents the number of sequence positions of the sequence feature string.

7. A device for determining the importance of a sequence site, comprising:

the importance determining module is used for correspondingly determining each object component value in the object site weight vector as the importance of each sequence site in the sequence feature string;

8. The apparatus of claim 7, further comprising:

and the binding site prediction module is used for performing transcription factor binding site prediction on the sequence feature strings in the fixed-length sequence string set by adopting a set prediction strategy according to the importance of each sequence site and a set similarity scoring formula after correspondingly determining each object component value in the object site weight vector as the importance of each sequence site in the sequence feature string.

9. A computer device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs being executable by the one or more processors to cause the one or more processors to implement the method of determining sequence site importance of any of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of determining the importance of a sequence position according to any one of claims 1 to 6.