CN116403645A

CN116403645A - Method and device for predicting transcription factor binding site

Info

Publication number: CN116403645A
Application number: CN202310232514.9A
Authority: CN
Inventors: 杨梓琨; 顾斐
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-07-07
Anticipated expiration: 2043-03-03
Also published as: CN116403645B

Abstract

The embodiment of the specification provides a method and a device for predicting a transcription factor binding site, wherein the method for predicting the transcription factor binding site comprises the following steps: receiving a predictive request for a transcription factor binding site of a cell to be detected; responding to the prediction request, and acquiring a gene sequence of a cell to be detected and target histone modification information of the cell to be detected; inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence. By the method provided by the specification, histone modification information is added in the process of predicting the transcription factor binding site, and the histone modification information provides additional information related to the transcription factor binding site for the model, so that the accuracy of prediction is improved.

Description

Method and device for predicting transcription factor binding site

Technical Field

The embodiment of the specification relates to the technical field of biological information detection, in particular to a method for predicting a transcription factor binding site.

Background

Transcription factor (Transcription Factor, TF) is a protein that can be used to key transcriptional control in specific cells by binding to specific regions on DNA (deoxyribonucleic acid) to initiate and control the transcription process of corresponding genes. Among them, specific DNA fragments binding to transcription factors are called transcription factor binding sites, and finding these transcription factor binding sites in specific cells is of great importance for the study of gene transcription regulation and expression.

The transcription factor binding sites found at present mainly originate from biological experiments, but only a very small part of human transcription factor binding sites are found in the biological experiments, and the biological experiments are also very complicated, and because the transcription factor binding sites have a specific sequence arrangement rule, a novel transcription factor binding site prediction method is needed to improve the prediction efficiency and accuracy of determining the transcription factor binding sites.

Disclosure of Invention

In view of this, the present description examples provide methods for predicting transcription factor binding sites. One or more embodiments of the present specification are also directed to a transcription factor binding site predicting apparatus, a computing device, a computer-readable storage medium, and a computer program, which address the technical shortcomings of the prior art.

According to a first aspect of embodiments of the present specification, there is provided a method of predicting a transcription factor binding site, comprising:

receiving a predictive request for a transcription factor binding site of a cell to be detected;

responding to the prediction request, and acquiring a gene sequence of a cell to be detected and target histone modification information of the cell to be detected;

inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence.

According to a second aspect of embodiments of the present specification, there is provided a device for predicting a transcription factor binding site, comprising:

a receiving module configured to receive a predicted request for a transcription factor binding site of a cell to be detected;

an acquisition module configured to acquire a cell gene sequence to be detected and target histone modification information of the cell to be detected in response to the prediction request;

and the prediction module is configured to input the cell gene sequence to be detected and the target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence.

According to a third aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the transcription factor binding site prediction method described above.

According to a fourth aspect of embodiments of the present specification, there is provided a computer readable storage medium storing computer executable instructions which, when executed by a processor, implement the steps of the above-described transcription factor binding site prediction method.

According to a fifth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described method of predicting a transcription factor binding site.

The method for predicting the transcription factor binding site provided by the embodiment of the specification comprises the steps of receiving a prediction request for the transcription factor binding site of a cell to be detected; responding to the prediction request, and acquiring a gene sequence of a cell to be detected and target histone modification information of the cell to be detected; inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence.

By the method provided by the embodiment of the specification, the histone modification information is added in the process of predicting the transcription factor binding site, the histone modification information provides additional information related to the transcription factor binding site for the model, the prediction accuracy is improved, the histone modification information has the characteristics of different sites in different cells, the capability of predicting the transcription factor binding site for the binding site prediction model is provided, and meanwhile, the detection of an unknown biological sample can be realized.

Drawings

FIG. 1 is a schematic diagram of a method for predicting a transcription factor binding site according to one embodiment of the present disclosure;

FIG. 2 is a flow chart of a method for predicting a transcription factor binding site according to one embodiment of the present disclosure;

FIG. 3 is a flow chart of a binding site predictive model training method provided in one embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of predicting transcription factor binding sites for a gastric mucosal cell scenario for use in one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a device for predicting a transcription factor binding site according to one embodiment of the present disclosure;

FIG. 6 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this specification are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related country and region, and are provided with corresponding operation entries for the user to select authorization or rejection.

First, terms related to one or more embodiments of the present specification will be explained.

TF: transcription Factor, transcription factor.

TFBS: transcription Factor Binding Site, transcription factor binding site.

Transcription factor (Transcription Factor, TF) is a protein that can be used to key transcriptional control in specific cells by binding to specific regions on DNA (deoxyribonucleic acid) to initiate and control the transcription process of corresponding genes. This is a critical step in the regulation of transcription in specific cells, where specific DNA fragments that bind to transcription factors are termed transcription factor binding sites, and finding these transcription factor binding sites in specific cells is of great importance for the study of gene transcription regulation and expression, and finding these positions in different cells, tissues is of great importance for the study of transcription regulation, gene expression, cell differentiation, and formation of gene phenotypic characteristics.

In the present specification, a method of predicting a transcription factor binding site is provided, and the present specification relates to a prediction apparatus of a transcription factor binding site, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following examples.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a method for predicting a transcription factor binding site according to an embodiment of the present disclosure, and as shown in fig. 1, the method for predicting a transcription factor binding site according to the embodiment of the present disclosure is applied to a terminal 100, where the terminal 100 may be a terminal device such as a notebook computer, an intelligent terminal, a server, a cloud server, or the like.

In the terminal 100, a gene sequence of a cell to be detected, which is specifically a cell on a cell, a tissue, such as a cardiovascular cell, a cerebrovascular cell, a pulmonary cell, a lymphocyte, etc., to be detected, is obtained. The gene sequence of the cell to be detected refers to the whole genome sequence collected from the cell to be detected, in practical application, the genome sequence of the same organism is the same, but different transcription factor binding sites exist in different cells, for example, the gene sequence {1, 2, 3, 4, … … } is taken as an example, the transcription factor binding site of the cardiovascular cell may be the site {1, 2, 3, 4, … … }, and the transcription factor binding site of the cerebrovascular cell may be the site {21, 22, 23, 24, … … } … …. In the prior art, the transcription factor binding site corresponding to the cell to be detected cannot be predicted only by the gene sequence of the cell to be detected.

In the method provided by the specification, besides inputting the cell gene sequence to be detected of the cell to be detected, histone modification information is combined, wherein the histone modification information specifically represents a histone modification value corresponding to each gene locus, and the histone modification information corresponding to each cell gene sequence can be obtained in a public data set.

In practical application, because the binding sites of the transcription factors have specific sequence arrangement rules, the specific sequence arrangement rules can be represented by histone modification information, so that the method provided by the specification inputs the cell gene sequence to be detected and corresponding histone modification information into a binding site prediction model at the same time, converts the cell gene sequence to be detected and the histone modification information into corresponding vectors in an embedding layer of the binding site prediction model, calculates a weight value corresponding to each vector through an embedding information balance layer of the binding site prediction model, fuses the two vectors based on the weight values, inputs the two vectors into an encoder for feature extraction, and finally outputs predicted binding site information through an output layer.

By the method provided by the specification, histone modification information is combined in the process of predicting the aggregation site information, so that references are provided for predicting the binding site information, and the accuracy of prediction is improved. By utilizing the characteristics of different modification information of histones in different cells, when different cells to be detected are predicted, the binding site information is predicted according to the histone modification information corresponding to each cell to be detected, so that the prediction accuracy is greatly improved.

Referring to fig. 2, fig. 2 shows a flowchart of a method for predicting a transcription factor binding site according to an embodiment of the present disclosure, which specifically includes the following steps.

Step 202: a predictive request for a transcription factor binding site of a cell to be detected is received.

The cells to be detected specifically refer to cells to be detected, such as cardiovascular cells, cerebrovascular cells, lymphocytes and the like, in practical application, the gene sequences in each type of cells are the same, but the transcription factor binding sites corresponding to each type of cells are different, and currently, the transcription factor binding sites corresponding to each type of cells are determined through biological experiments, but the biological experiments are very complicated, and labor and material resources are consumed. There is therefore a need for a new way to determine the transcription factor binding sites of cells to be tested.

The prediction request specifically refers to a request for predicting the transcription factor binding site of the cell to be detected, for example, for an unknown type of cell a, it is now required to predict the transcription factor binding site corresponding to the cell a, and the prediction request for the transcription factor binding site of the cell a is received.

Based on this, the method provided in the present specification may be applied to a terminal, which may be a personal computer, an intelligent terminal, a server, a cloud server, or the like. When the terminal is a personal computer or a server, the prediction request sent by the user for the transcription factor binding site of the cell to be detected can be directly obtained; when the terminal is a cloud server, the user can send a prediction request for the transcription factor binding site of the cell to be detected to the cloud server at the end-side device, and the cloud server receives the prediction request.

In a specific embodiment provided in the present disclosure, taking a terminal as a cloud server as an example, the cloud server receives a prediction request sent by a user through an end-side device, where the prediction request is used for predicting a transcription factor binding point of a cell a to be detected.

Step 204: and responding to the prediction request, and acquiring the gene sequence of the cell to be detected and the target histone modification information of the cell to be detected.

After receiving the prediction request, the terminal can obtain the gene sequence of the cell to be detected and the target histone modification information of the cell to be detected according to the prediction request, and it should be noted that the gene sequence of the cell to be detected can be the whole genome sequence of the cell to be detected or a part of the gene sequence in the whole genome sequence.

The gene sequence of the cell to be detected can be obtained by carrying out corresponding treatment on the cell to be detected, the target histone modification information corresponding to the cell to be detected can be obtained by a public data set, and can also be obtained by biological experiment data, for example, when the cell to be detected is a cardiovascular cell, the target histone modification information corresponding to the cardiovascular cell can be inquired in the public data set to be histone modification information 1; when the cell to be detected is a brain blood cell, it is necessary to query the common data set that the target histone modification information corresponding to the brain blood cell is histone modification information 2 … …, and it should be noted that the histone modification information is a set of data, and the histone modification information 1 and the histone modification information 2 mentioned in this embodiment are for distinguishing different histone modification information corresponding to different cells.

In a specific embodiment provided in the present specification, taking a to-be-detected cell as a cardiovascular cell as an example, extracting a gene sequence a of the to-be-detected cell corresponding to the cardiovascular cell by a gene extraction technology, and simultaneously querying in a public data set that target histone modification information corresponding to the cardiovascular cell is a-a.

Step 206: inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence.

After receiving the cell gene sequence to be detected and the target histone modification information, inputting the cell gene sequence to be detected and the target histone modification information into a binding site prediction model, predicting by the binding site prediction model, and outputting predicted binding site information corresponding to the cell gene sequence to be detected.

In practical applications, the binding site prediction model is trained to predict predicted binding site information in the cellular gene sequence to be detected based on the cellular gene sequence to be detected and the target histone modification information. The combined site prediction model is a neural network model based on a BERT model, BERT (Bidirectional Encoder Representations) is a pre-trained language model in deep learning, and MHA (Multi-head Attention mechanism) is adopted in an encoder of the BERT model, so that features of input vectors can be extracted from multiple dimensions. Facilitating subsequent feature processing.

Specifically, the binding site prediction model comprises an embedded layer, an embedded information balance layer, a coding layer and an output layer;

correspondingly, inputting the gene sequence of the cell to be detected and the target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the predicted binding site information comprises S2062 to S2068:

s2062, inputting the cell gene sequence to be detected and the target histone modification information into the embedded layer to obtain the characteristic information of the gene sequence corresponding to the cell gene sequence to be detected and the characteristic information of the histone corresponding to the target histone modification information.

The embedding layer is used for carrying out embedding treatment on the cell gene sequence to be detected and the target histone modification information to obtain the characteristic information of the gene sequence corresponding to the cell gene sequence to be detected and obtain the characteristic information of the histone corresponding to the target histone modification information.

In one embodiment provided in the present specification, taking { A, T, C, G, … … A, T, C, G } as an example, the cell gene sequence to be detected is input into the embedded layer for processing, so as to obtain corresponding gene sequence characteristic information 1, the dimension of the gene sequence characteristic information 1 is n, and meanwhile, the target histone modification information is input into the embedded layer for processing, so as to obtain histone characteristic information 0 corresponding to the target histone modification information.

Specifically, inputting the cell gene sequence to be detected and the target histone modification information into the embedded layer to obtain the characteristic information of the gene sequence corresponding to the cell gene sequence to be detected and the characteristic information of the histone corresponding to the target histone modification information, which comprises the following steps:

generating a to-be-detected gene subsequence set according to the to-be-detected cell gene sequence, and inputting the to-be-detected gene subsequence set and the target histone modification information into the embedded layer to obtain a gene sequence feature matrix corresponding to the to-be-detected cell gene sequence and a histone feature matrix corresponding to the target histone modification information.

In practical application, in order to better extract the characteristic information of the cell gene sequence to be detected, a plurality of gene subsequences to be detected can be generated according to the cell gene sequence to be detected to form a gene subsequence set to be detected.

Inputting the gene subsequence set to be detected into an embedding layer for embedding treatment to obtain a gene sequence feature matrix corresponding to the gene subsequence set to be detected, and inputting the target histone modification information into the embedding layer for embedding treatment to obtain a histone feature matrix corresponding to the target histone modification information.

In the process of processing the cell gene sequence to be detected, a K-mer token technology can be adopted to divide the cell gene sequence to be detected into a plurality of gene subsequences to be detected, each gene subsequence to be detected is subjected to embedding processing, and then embedding results are fused together to be used as a gene sequence feature matrix corresponding to the cell gene sequence to be detected. The method for generating the set of the gene subsequences to be detected according to the cell gene sequences to be detected in a specific embodiment provided in the specification comprises the following steps:

determining preset segmentation length information;

and cutting the cell gene sequence to be detected according to the preset cutting length information to obtain a gene subsequence set to be detected.

In the embodiment provided in the specification, taking k as an example of preset segmentation length information, segmenting the cell gene sequence to be detected based on the length k, so as to obtain a plurality of gene subsequences to be detected with the length k. For example, if the preset cut length information is 3 loci long, then when the cell gene sequence to be detected is cut, a plurality of gene subsequences to be detected with 3 loci long can be obtained.

Further, when the cell gene sequence to be detected is segmented, the cell gene sequence to be detected is segmented according to the preset segmentation length information to obtain a subsequence set of the gene to be detected, which comprises the following steps:

determining a cutting start site in the cell gene sequence to be detected;

and cutting the cell gene sequence to be detected based on the cutting starting point and the preset cutting length information, and adding a cutting result to a gene subsequence set to be detected.

The starting point of the segmentation specifically refers to the point of the current segmentation, and in practical application, the starting point of the segmentation can be determined sequentially according to the sequence of the cell gene sequence to be detected. In one embodiment provided herein, the site of the cellular gene sequence to be detected is {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, for example. The cell gene sequence to be detected is { A, T, C, G, A, T, C, G, A, T }, the first time of cutting is used for determining that the point 1 is a cutting starting point, the second time of cutting is used for determining that the point 2 is a cutting starting point, and the third time of cutting is used for determining that the point 3 is a cutting starting point … ….

After determining the start cutting site, the cell gene sequence to be detected can be cut by combining with the preset cutting length information k determined in the above steps, taking the cell gene sequence to be detected as { A, T, C, G, A, T, C, G, A, T }, the preset cutting length information k=3 as an example, determining site 1 as the start cutting site, the corresponding gene subsequence to be detected as { A, T, C }, and determining site 2 as the start cutting site in the second cutting, the corresponding gene subsequence to be detected as { T, C, G } … ….

After each cut is finished, the cut result (i.e. the base factor sequence to be detected) can be added to the subsequence set of the genes to be detected.

Through the segmentation process, one cell gene sequence to be detected is segmented into a plurality of gene subsequences to be detected, so that the characteristic information of the cell gene sequence to be detected can be better extracted, and the processing accuracy is improved when corresponding processing is carried out according to the characteristic information.

In one embodiment provided herein, the embedded layers include a first embedded layer and a second embedded layer;

correspondingly, inputting the gene subsequence set to be detected and the target histone modification information into the embedded layer to obtain a gene sequence feature matrix corresponding to the cell gene sequence to be detected and a histone feature matrix corresponding to the target histone modification information, wherein the method comprises the following steps:

inputting the gene subsequence set to be detected into the first embedding layer to obtain a gene sequence feature matrix;

inputting the target histone modification information into the second embedded layer to obtain a histone characteristic matrix.

The first embedding layer is used for carrying out embedding treatment on the gene subsequence set to be detected, and the second embedding layer is used for carrying out embedding treatment on the target histone modification information. Specifically, when the embedding layer performs embedding treatment on the base factor sequence set to be detected and the target histone modification information, the first embedding layer processes the base factor sequence set to be detected to obtain a gene sequence feature matrix, and the second embedding layer processes the target histone modification information to obtain a histone feature matrix.

In practical application, the first embedding layer pair generates a feature vector corresponding to each gene subsequence to be detected, and according to the feature vector corresponding to each base factor sequence to be detected, a gene sequence feature matrix corresponding to the cell gene sequence to be detected can be generated; similarly, the target histone modification information is a group of numbers, the second embedding layer carries out embedding processing on each number to obtain a feature vector corresponding to each number, and then a histone feature matrix corresponding to the target histone modification information can be generated according to the feature vector corresponding to each number.

S2064, inputting the gene sequence characteristic information and the histone characteristic information into the embedded information balance layer to obtain the gene sequence weight of the gene sequence characteristic information and the histone weight of the histone characteristic information.

The embedded information balance layer is used for distributing corresponding weight information according to the gene sequence feature matrix and the histone feature matrix, and the embedded information balance layer is used for adjusting the proportion between the gene sequence feature matrix and the histone feature matrix, so that subsequent fusion processing is facilitated.

In practical applications, the embedded information balancing layer assigns gene sequence weights and histone weights according to the gene sequence feature matrix and the histone feature matrix. The embedded information balance layer plays a role in balancing information of different dimensions, and can dynamically adjust the balance between the characteristic matrix of the gene sequence and the characteristic matrix of the histone.

It should be noted that the embedded information balancing layer may assign different gene sequence weights and histone weights according to different gene sequence feature matrices and histone feature matrices. For example, when the gene sequence feature matrix is E1 and the histone feature matrix is E0, embedding the gene sequence weight a1 and the histone weight b1 allocated by the information balance layer; when the gene sequence feature matrix is E2 and the histone feature matrix is E0, the gene sequence weight a2 and the histone weight b2 … … distributed by the information balance layer are embedded.

S2066, inputting the gene sequence characteristic information, the histone characteristic information, the gene sequence weight and the histone weight into the coding layer to obtain the gene sequence coding characteristic information corresponding to the gene sequence characteristic information.

After the characteristic information of the gene sequence, the characteristic information of the histone, the weight of the gene sequence and the weight of the histone are determined, the information is input into the coding layer for processing, and the characteristic information of the gene sequence coding corresponding to the characteristic information of the gene sequence is obtained.

Specifically, the gene sequence characteristic information, the histone characteristic information, the gene sequence weight and the histone weight are spliced to generate a combination vector, and then the combination vector is input into a coding layer for coding treatment, so that the gene sequence coding characteristic information output by the coding layer is obtained.

Based on this, inputting the gene sequence characteristic information, the histone characteristic information, the gene sequence weight and the histone weight to the coding layer, obtaining gene sequence coding characteristic information corresponding to the gene sequence characteristic information, including:

combining the gene sequence characteristic information and the histone characteristic information according to the gene sequence weight and the histone weight to obtain gene combination characteristic information to be detected;

inputting the gene combination characteristic information to be detected into the coding layer to obtain the gene sequence coding characteristic information corresponding to the gene sequence characteristic information.

The gene binding characteristic information to be detected specifically refers to characteristic information after fusion of the gene sequence characteristic information and the histone characteristic information according to the gene sequence weight and the histone weight. The specific fusion mode can be that the weighting summation is carried out according to the respective weights, or the characteristic splicing is carried out after the weighting is carried out according to the respective weights. In the embodiments provided in the present specification, the fusion mode is not limited, and the actual application is subject to control.

In one embodiment provided in this specification, a weighted summation is illustrated as an example of a fusion. And weighting and summing the gene sequence characteristic information and the histone characteristic information according to the gene sequence weight and the histone weight to obtain the gene combination characteristic information to be detected.

In a specific embodiment provided in the present disclosure, taking weighted summation as an example, in this embodiment, taking the gene sequence feature information E1 and the histone feature information E0 as examples, the gene sequence weight corresponding to the gene sequence feature information E1 is a1, and the histone weight corresponding to the histone feature information E0 is b1, the target gene binding vector to be detected e=a1×e1+b1×e0 can be obtained by a weighted summation method.

In the method provided by the embodiment of the specification, the coding layer is based on a multi-head attention mechanism, and the multi-head attention layer is included in the coding layer and is used for extracting the characteristic information of the gene combination characteristic information to be detected and finally outputting the gene sequence coding characteristic information corresponding to the gene sequence characteristic information.

S2068, inputting the gene sequence coding characteristic information into the output layer to obtain a prediction result corresponding to the gene sequence coding characteristic information.

After the gene sequence coding characteristic information is obtained, the gene sequence coding characteristic information can be input into an output layer of a binding site prediction model to obtain a prediction result corresponding to the gene sequence coding characteristic information, and then whether the cell gene sequence to be detected is a transcription factor binding site is judged.

In practical application, the expression mode of the prediction result of the gene sequence coding characteristic information is various, for example, the expression mode can be prediction probability, for example, the probability that the cell gene sequence to be detected corresponding to the gene sequence coding characteristic information is a transcription factor binding site; in practical applications, the prediction result may be yes/no, that is, the cellular gene sequence to be detected corresponding to the coding feature information of the gene sequence is a transcription factor binding site, or the cellular gene sequence to be detected corresponding to the coding feature information of the gene sequence is not a transcription factor binding site.

Preferably, the output layer includes a classifier;

inputting the gene sequence coding characteristic information into the output layer to obtain a prediction result corresponding to the gene sequence coding characteristic information, wherein the method comprises the following steps:

inputting the gene sequence coding characteristic information into the classifier;

and obtaining a classification result corresponding to the gene sequence coding characteristic information output by the classifier, wherein the classification result is yes or no.

In a preferred embodiment provided in the present disclosure, the output layer includes a classifier, preferably a classifier, and the genetic sequence coding feature information is input into the classifier, and the classifier determines whether the cellular genetic sequence to be detected corresponding to the genetic sequence coding feature information is a transcription factor binding site, if so, the prediction result is output as yes, otherwise, the prediction result is output as no.

In another embodiment provided in the present specification, referring to fig. 3, fig. 3 shows a flowchart of a method for training a binding site prediction model according to an embodiment of the present specification, wherein the binding site prediction model is obtained by training the following steps:

step 302: the method comprises the steps of obtaining a sample cell gene sequence, sample histone modification information and a sample sequence classification label, wherein the sample histone modification information comprises positive sample histone modification information and negative sample histone modification information, and the sample sequence classification label comprises a sample sequence positive label and a sample sequence negative label.

In practical application, the binding site prediction model is pre-trained, and in a specific embodiment provided in the present specification, sample data for training the binding site prediction model is obtained first, where the sample data includes a sample cell gene sequence, sample histone modification information, and a sample sequence classification tag, and the sample cell gene sequence specifically refers to a gene sequence of a certain cell; the sample histone modification information specifically refers to histone modification information for reference, and the sample histone modification information comprises positive sample histone modification information and negative sample histone modification information; the sample sequence classification tag specifically refers to tag information used for determining whether a sample cell gene sequence is a transcription factor binding site in a model training process, and comprises a sample sequence positive tag and a sample sequence negative tag.

Furthermore, positive sample histone modification information and a positive label of a sample sequence appear in pairs to be used as a positive sample of a sample cell gene sequence; negative sample histone modification information and sample sequence negative tags appear in pairs as negative samples of sample cell gene sequences.

In the examples provided herein, the binding site prediction model is model trained by positive and negative samples of sample cellular gene sequences.

In practical application, obtaining a sample cell gene sequence, sample histone modification information and a sample sequence classification tag comprises:

obtaining a sample cell gene sequence;

determining a sample sequence positive tag and a sample sequence negative tag in the sample cell gene sequence;

and determining histone modification information corresponding to the positive tag of the sample sequence as positive sample histone modification information, and determining histone modification information except the positive sample histone modification information as negative sample histone modification information corresponding to the negative tag of the sample sequence.

Specifically, the cells with transcription factor binding sites determined after biological experiments are taken as sample cells, the sample cell gene sequences corresponding to the sample cells are extracted, and meanwhile, the positive labels and the negative labels of the sample sequences are determined based on the transcription factor binding sites corresponding to the sample cells. Further, the positive tag of the sample sequence includes the determined transcription factor binding site, and the negative tag of the sample sequence does not include the transcription factor binding site.

In practical applications, determining a sample sequence positive tag and a sample sequence negative tag in the sample cell gene sequence comprises:

determining the binding site information in the sample cell gene sequence as a sample sequence positive tag;

randomly determining a sample sequence negative tag in other sites in the sample cell gene sequence except for the binding site information.

Site information for determining that a transcription factor binding site is within a preset range in a sample cell gene sequence is a sample sequence positive tag, and a sample sequence negative tag is randomly determined in gene sites except for the transcription factor binding site.

After the positive sample sequence tag and the negative sample sequence tag are determined, taking histone modification information corresponding to the positive sample sequence tag as positive sample histone modification information; and taking other histone modification information except the positive sample histone modification information as negative sample histone modification information.

In practical application, in order to improve training efficiency of the model, a plurality of sample cells which have undergone biological experiments can be used for generating sample histone modification information and sample sequence classification labels. And the positive samples and the negative samples corresponding to the plurality of sample cells are jointly trained, so that the training efficiency of the model is improved. It should be noted that, in the method provided in the present specification, a classifier is preferably trained, so that positive samples and negative samples corresponding to sample cells can be constructed into a positive-negative sample balanced binary data set, i.e. the number of positive samples and negative samples is the same.

Step 304: and inputting the sample cell gene sequence and the sample histone modification information into a binding site prediction model to obtain a prediction sequence classification label corresponding to the sample cell gene sequence.

After training data of a model are obtained, inputting a sample cell gene sequence and sample histone modification information into a binding site prediction model to obtain a predicted sequence classification tag output by the binding site prediction model, and when the sample cell gene sequence and the positive sample histone modification information are input into the binding site prediction model, obtaining a predicted sequence classification tag of a positive sample; when the sample cell gene sequence and negative sample histone modification information are input into the binding site prediction model, a negative sample predicted sequence classification tag is obtained.

Step 306: and calculating a model loss value according to the prediction sequence classification label and the sample sequence classification label.

After the predicted binding site information is obtained, the model loss value can be calculated according to the predicted sequence classification tag and the sample sequence classification tag, and in the method provided in the present specification, there are many ways of calculating the model loss value, for example, a cross entropy loss function, a maximum loss function, an average loss function, and the like, and in the present specification, the specific way of the loss function is not limited, and the practical application is in control.

Step 308: and adjusting model parameters of the binding site prediction model according to the model loss value, and continuing training the binding site prediction model until a model training stopping condition is reached.

After the model loss value is obtained, the binding site prediction model can be adjusted according to the model loss value, specifically, the model loss value is subjected to back propagation to sequentially update model parameters in the binding site prediction model, and the model parameters are used for assisting the binding site prediction model in generating predicted binding site information according to the cell gene sequence and the target histone modification information.

After the model parameters are adjusted, the steps can be continuously repeated, and the combined site prediction model is continuously trained until reaching the training stopping condition, and in practical application, the training stopping condition of the combined site prediction model comprises the following steps:

the model loss value is smaller than a preset threshold value; and/or

The training round reaches the preset training round.

Specifically, in the process of training the combination site prediction model, the training stop condition of the model may be set to be smaller than the preset threshold value, or the training stop condition may be set to be a preset training round, for example, 10 training rounds, where in the present specification, the preset threshold value of the loss value and/or the preset training round are not specifically limited, and the actual application is in order.

An embedded information balance layer and a coding layer based on a multi-head attention mechanism are added in the binding site prediction model, and the embedded information balance layer can well adjust the proportion between the characteristic information of the gene sequence and the characteristic information of the histone, so that the binding site can be predicted accurately.

The following describes a method for predicting a transcription factor binding site, taking the application of the method for predicting a transcription factor binding site provided in the present specification to a gastric mucosal cell scenario as an example, with reference to fig. 4. FIG. 4 is a flowchart showing a method for predicting a transcription factor binding site according to one embodiment of the present disclosure, and specifically includes the following steps.

Step 402: a predictive request for a transcription factor binding site for a gastric mucosal cell is received.

Step 404: obtaining a cell gene sequence A to be detected and target histone modification information B of the gastric mucosa cells.

Step 406: and inputting the cell gene sequence A to be detected and the target histone modification information B into an embedding layer of a binding site prediction model to obtain gene sequence characteristic information [ E1, E2 … … En ] corresponding to the cell gene sequence A to be detected and histone characteristic information [ D1, D2 … … Dm ] corresponding to the target histone modification information B.

Step 408: the gene sequence characteristic information [ E1, E2 … … En ] and the histone characteristic information [ D1, D2 … … Dm ] are input into an embedded information balance layer of a binding site prediction model, and the gene sequence weight [ a1, a2 … … an ] and the histone weight [ b1, b2 … … bm ] of the gene subsequence vector to be detected are obtained.

Step 410: inputting the gene sequence characteristic information [ E1, E2 … … En ], the histone characteristic information [ D1, D2 … … Dm ], the gene sequence weight [ a1, a2 … … an ] and the histone weight [ b1, b2 … … bm ] into the coding layer to obtain the gene binding characteristic information [ DE1, DE2, … … DEn ] to be detected.

It should be noted that in the process of feature information fusion, the dimension of the feature information is determined according to the sizes of n and m, specifically, the number of the feature information of the gene to be detected is determined according to the larger number of n and m, for example, when n is greater than or equal to m, the feature information of the gene to be detected is [ DE1, DE2, … … DEn ], and when n is less than m, the feature information of the gene to be detected is [ DE1, DE2, … … DEm ].

Step 412: inputting the gene combination characteristic information [ DE1, DE2, … … DEn ] to be detected into the output layer to obtain the classification result corresponding to the gene sequence coding characteristic information.

Step 414: and determining whether the cell gene sequence A to be detected is predicted binding site information of the gastric mucosa cells according to the classification result.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of a device for predicting a transcription factor binding site, and fig. 5 shows a schematic structural diagram of a device for predicting a transcription factor binding site according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus includes:

a receiving module 502 configured to receive a predicted request for a transcription factor binding site of a cell to be detected;

an obtaining module 504 configured to obtain a cell gene sequence to be detected and target histone modification information of the cell to be detected in response to the prediction request;

a prediction module 506 configured to input the cellular gene sequence to be detected and the target histone modification information into a binding site prediction model, and obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting a transcription factor binding site of the cellular gene sequence.

Optionally, the binding site prediction model includes an embedded layer, an embedded information balance layer, a coding layer, and an output layer;

the prediction module 506 is further configured to:

inputting the cell gene sequence to be detected and the target histone modification information into the embedding layer to obtain gene sequence characteristic information corresponding to the cell gene sequence to be detected and histone characteristic information corresponding to the target histone modification information;

inputting the gene sequence characteristic information and the histone characteristic information into the embedded information balance layer to obtain a gene sequence weight of the gene sequence characteristic information and a histone weight of the histone characteristic information;

inputting the gene sequence characteristic information, the histone characteristic information, the gene sequence weight and the histone weight into the coding layer to obtain gene sequence coding characteristic information corresponding to the gene sequence characteristic information;

inputting the gene sequence coding characteristic information into the output layer to obtain a prediction result corresponding to the gene sequence coding characteristic information.

Optionally, the prediction module 506 is further configured to:

Optionally, the embedded layer includes a first embedded layer and a second embedded layer;

accordingly, the prediction module 506 is further configured to:

inputting the gene subsequence set to be detected and the target histone modification information into the embedding layer to obtain a gene sequence feature matrix corresponding to the cell gene sequence to be detected and a histone feature matrix corresponding to the target histone modification information, wherein the method comprises the following steps:

Optionally, the prediction module 506 is further configured to:

determining preset segmentation length information;

Optionally, the prediction module 506 is further configured to:

determining a cutting start site in the cell gene sequence to be detected;

Optionally, the prediction module 506 is further configured to:

Optionally, the output layer includes a classifier;

optionally, the prediction module 506 is further configured to:

Optionally, the apparatus further includes: a training module configured to:

obtaining a sample cell gene sequence, sample histone modification information and a sample sequence classification tag, wherein the sample histone modification information comprises positive sample histone modification information and negative sample histone modification information, and the sample sequence classification tag comprises a sample sequence positive tag and a sample sequence negative tag;

Inputting the sample cell gene sequence and sample histone modification information into a binding site prediction model to obtain a prediction sequence classification label corresponding to the sample cell gene sequence;

calculating a model loss value according to the prediction sequence classification tag and the sample sequence classification tag;

and adjusting model parameters of the binding site prediction model according to the model loss value, and continuing training the binding site prediction model until a model training stopping condition is reached.

Optionally, the training module is further configured to:

obtaining a sample cell gene sequence;

Optionally, the training module is further configured to:

The device for predicting the transcription factor binding site provided by the embodiment of the specification comprises the steps of receiving a prediction request of the transcription factor binding site of a cell to be detected; responding to the prediction request, and acquiring a gene sequence of a cell to be detected and target histone modification information of the cell to be detected; inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the binding site prediction model is used for predicting the transcription factor binding site of the cell gene sequence.

By means of the device provided by the embodiment of the specification, histone modification information is added in the process of predicting the transcription factor binding site, the histone modification information provides additional information related to the transcription factor binding site for the model, the prediction accuracy is improved, the histone modification information has different site characteristics in different cells, the capability of predicting the transcription factor binding site for the binding site prediction model is provided, and meanwhile detection of an unknown biological sample can be achieved.

The above is a schematic scheme of a device for predicting a transcription factor binding site of the present embodiment. It should be noted that, the technical scheme of the device for predicting a transcription factor binding site and the technical scheme of the method for predicting a transcription factor binding site belong to the same concept, and details of the technical scheme of the device for predicting a transcription factor binding site, which are not described in detail, can be referred to the description of the technical scheme of the method for predicting a transcription factor binding site.

Fig. 6 illustrates a block diagram of a computing device 600 provided in accordance with one embodiment of the present description. The components of computing device 600 include, but are not limited to, memory 610 and processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to hold data.

Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include public switched telephone networks (PSTN, public Switched Telephone Network), local area networks (LAN, local Area Network), wide area networks (WAN, wide Area Network), personal area networks (PAN, personal Area Network), or combinations of communication networks such as the internet. The access device 640 may include one or more of any type of network interface, wired or wireless, such as a network interface card (NIC, network interface controller), such as an IEEE802.11 wireless local area network (WLAN, wireless Local Area Network) wireless interface, a worldwide interoperability for microwave access (Wi-MAX, worldwide Interoperability for Microwave Access) interface, an ethernet interface, a universal serial bus (USB, universal Serial Bus) interface, a cellular network interface, a bluetooth interface, near field communication (NFC, near Field Communication).

In one embodiment of the present description, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 6 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or personal computer (PC, personal Computer). Computing device 600 may also be a mobile or stationary server.

Wherein the processor 620 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the data processing method described above. The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the method for predicting the transcription factor binding site belong to the same concept, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solution of the method for predicting the transcription factor binding site.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform the steps of the method for predicting a transcription factor binding site described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above method for predicting a transcription factor binding site belong to the same concept, and details of the technical solution of the storage medium, which are not described in detail, can be referred to the description of the technical solution of the above method for predicting a transcription factor binding site.

An embodiment of the present disclosure also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described method for predicting a transcription factor binding site.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical scheme of the computer program and the technical scheme of the method for predicting the transcription factor binding site belong to the same concept, and details of the technical scheme of the computer program, which are not described in detail, can be referred to the description of the technical scheme of the method for predicting the transcription factor binding site.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of predicting a transcription factor binding site, comprising:

2. The method of claim 1, the binding site prediction model comprising an embedded layer, an embedded information balance layer, an encoding layer, an output layer;

inputting the cell gene sequence to be detected and target histone modification information into a binding site prediction model to obtain predicted binding site information output by the binding site prediction model, wherein the method comprises the following steps of:

3. The method according to claim 2, wherein inputting the cellular gene sequence to be detected and the target histone modification information into the embedded layer, obtaining the characteristic information of the gene sequence corresponding to the cellular gene sequence to be detected and the characteristic information of the histone corresponding to the target histone modification information, comprises:

4. The method of claim 3, the embedded layer comprising a first embedded layer and a second embedded layer;

5. The method of claim 3, generating a set of gene subsequences to be detected from the cellular gene sequences to be detected, comprising:

determining preset segmentation length information;

6. The method according to claim 5, wherein the step of slicing the gene sequence of the cell to be detected according to the preset slicing length information to obtain a set of subsequences of the gene to be detected comprises:

determining a cutting start site in the cell gene sequence to be detected;

7. The method of claim 2, inputting the gene sequence characteristic information, the histone characteristic information, the gene sequence weight and the histone weight to the coding layer, obtaining gene sequence coding characteristic information corresponding to the gene sequence characteristic information, comprising:

8. The method of claim 7, the output layer comprising a classifier;

9. The method of claim 1, wherein the binding site predictive model is obtained by training the following steps:

10. The method of claim 9, obtaining a sample cell gene sequence, sample histone modification information, a sample sequence classification tag, comprising:

Obtaining a sample cell gene sequence;

11. The method of claim 10, determining a sample sequence positive tag and a sample sequence negative tag in the sample cell gene sequence, comprising:

12. A device for predicting a transcription factor binding site, comprising:

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions, the processor being configured to execute the computer executable instructions, which when executed by the processor, implement the steps of the method of any one of claims 1 to 11.

14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the method of any one of claims 1 to 11.