CN117292208B

CN117292208B - Method and system for classifying error patterns in data processing process

Info

Publication number: CN117292208B
Application number: CN202311579646.5A
Authority: CN
Inventors: 朱慧敏; 黎晖
Original assignee: Guangzhou University of Traditional Chinese Medicine
Current assignee: Guangzhou University of Traditional Chinese Medicine
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-02-23
Anticipated expiration: 2043-11-24
Also published as: CN117292208A

Abstract

The invention discloses a method and a system for classifying error patterns in a data processing process, which relate to the technical field of signal processing in information science and comprise the following steps: generating an original data signal, and performing data processing by taking the original data signal as actual input data; based on the actual input data, input observation data and output observation data in the data processing process are obtained; constructing an error pattern classification Bayesian model based on the input observation data and the output observation data; converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model; and carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process. The invention can eliminate the influence of data randomness and random noise interference and accurately classify the error patterns.

Description

Method and system for classifying error patterns in data processing process

Technical Field

The present invention relates to the field of signal processing in information science, and more particularly, to a method and a system for classifying error patterns in a data processing process.

Background

In the fields of machinery, circuits, biological information, when a fault occurs in the processing process of a 0/1 digital sequence due to the influence of factors such as environment, noise and the like, the situation that 0 bit is turned over to 1 bit or 1 bit is turned over to 0 bit occurs at a specific position of the data sequence; for the detection of error patterns, i.e. the estimated occurrence of bit flipping sites when a fault occurs. There are three main difficulties with the detection of error patterns: firstly, an input data sequence in a data processing process has a certain random characteristic; second, the output data sequence of the data processing process is the "modulo-2 addition" of the data input sequence and the error pattern, rather than a conventional decimal wig; finally, when the electronic instrument is used for measuring the input data sequence and the output data sequence in the data processing process, the interference factors of thermal noise in the electronic instrument need to be considered in the conversion of the logic signal and the physical signal. The conventional error pattern detection method generally uses the output data measured value minus the input data measured value in the data processing process to obtain an error pattern estimated value; and forming a set of error pattern estimated values, and clustering the set of error pattern error values by using methods such as K-Means or Hierarchy to obtain class labels of faults in the data processing process. The above method has the following problems: (1) Neglecting the interference caused by physical transformation process and random noise in the measuring process of the electronic instrument, wherein the error pattern estimated value obtained by subtracting the input data measured value from the output data measured value is not a true error pattern; (2) Ignoring the inherent correlation between the data generation different channels; essentially, the data generated by each channel of data generation should be the same, but when the input data sequence to be generated is long, the random noise factor can cause the value of some part of points to be flipped, so that the subsequent estimation accuracy is reduced.

Disclosure of Invention

The invention provides a classification method and a system for error patterns in the data processing process, which can eliminate the influence of the data randomness and the random noise interference and accurately classify the error patterns in order to overcome the defect that the error patterns cannot be accurately classified due to the data randomness and the random noise interference in the data processing process in the prior art.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the invention provides a method for classifying error patterns in a data processing process, which comprises the following steps:

s1: generating an original data signal, and performing data processing by taking the original data signal as actual input data;

s2: based on the actual input data, input observation data and output observation data in the data processing process are obtained;

s3: constructing an error pattern classification Bayesian model based on the input observation data and the output observation data;

s4: converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model;

s5: and carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process.

Preferably, in step S1, the specific method for generating the original data signal is as follows:

generating an original data sequence，/>Representing the +.>Individual site value, ->Is the total number of sites; original data sequence->The value of each site obeys the first parameter +.>The 0/1 binomial distribution of (2), the prior probability is：

Setting a random noise sequence of the generation channel, and then the original data signal is:

in the method, in the process of the invention,indicate->Original data signal of individual channels,/>，/>Indicate->The +.o. of the original data signal of the individual channels>Individual site value, ->Representing an exclusive-or operation; />Indicate->Random noise sequence of individual channels,/>，/>Indicate->Random noise sequence of individual channels +.>Random noise situation of individual sites,/->When it indicates that random noise occurs, < >>When it indicates no random noise and +.>Subject to the second parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

To the original data signalAnd performing data processing as actual input data.

Preferably, the original data sequence is generated by any one of a sender, a source encoder or a channel encoder. The random noise sequence is obtained by performing a process equivalent to channel fading, channel noise interference, modulation demodulation, or channel equalization.

Preferably, in step S3, the specific method for constructing the error pattern classification bayesian model based on the input observation data and the output observation data is as follows:

raw data signalThe data processing is carried out as actual input data, faults occur in the data processing process, the original data signals are overturned, and the overturned patterns are set, so that the actual output data in the data processing process is as follows:

in the method, in the process of the invention,indicate->Actual output data of the individual channels,/>，/>Indicate->' ShutongThe +.o. of the actual output data of the track>A personal site value; />Indicate->The flip patterns corresponding to the data of the individual channels,，/>indicate->Personal channel data>A flip pattern corresponding to the location;

setting commonality in data processingError pattern, flip pattern +.>One of the error patterns is:

in the method, in the process of the invention,indicate->Error pattern->，/>Indicate->Error pattern->The site situation obeys a third parameter of +.>Is 0/1 binomial distribution of a priori probability +.>；/>For coding vectors, express +.>Whether the flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is not +. >Error pattern->Indicate->Error pattern type of flip pattern of each channel obeys polynomial distribution, and the prior probability is +.>；

The electronic instrument is used for measuring the actual input data and the actual output data respectively to obtain the input observation dataAnd output observation data +.>；

Establishing a first transition probability of the input observed data and the actual input data:

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of (2); />Indicate->Individual channel->Input observations of individual sites, +.>Indicate->Input observations of individual channels, < >>Indicate->The +.o. of the original data signal of the individual channels>A personal site value; />Respectively representing a first mean estimate and a first variance estimate; />Representing a second mean estimate and a second variance estimate, respectively;

establishing a second transition probability of the output observed data and the actual output data:

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Indicate->Individual channel->Output observations of individual loci, +.>Indicate->Output observations of individual channels, +.>Indicate->The +.o. of the actual output data of the individual channels >A personal site value; />Respectively representing a third mean value estimation and a third difference estimation; />Respectively representing a fourth mean estimate and a fourth difference estimate;

and (5) completing the construction of the Bayesian model for classifying the error patterns.

Preferably, in step S4, the specific method for converting the error pattern classification bayesian model into a factor graph form and obtaining a factor graph message transmission model is as follows:

the factor graph message transfer model comprises a plurality of variable nodes, a plurality of factor nodes and undirected edges for connecting the variable nodes and the factor nodes;

will be、/>、/>、/>、/>、/>And->Respectively serving as a first variable node, a second variable node, a third variable node, a fourth variable node, a fifth variable node, a sixth variable node and a seventh variable node;

will be a priori probabilityAs a first factor node, will +.>And->Conditional probability betweenAs a second factor node, will +.>And->Conditional probability between->As a third factor node, will +.>And->Conditional probability between->As a fourth factor node, the prior probability +.>As a fifth factor node, the first transition probability +.>As a sixth factor node, the second transition probability +.>As a seventh factor node, the prior probability +.>As an eighth factor node, the prior probability is calculated As a ninth factor node;

the first factor node, the first variable node, the second factor node, the second variable node, the third factor node, the fourth variable node, the fourth factor node, the seventh variable node and the ninth factor node are sequentially connected;

the second factor node is also connected with a fifth variable node and a fifth factor node in sequence;

the third factor node is also connected with a third variable node and a seventh factor node in sequence;

the fourth variable node is also connected with a sixth factor node;

the fourth factor node is also connected with a sixth variable node and an eighth factor node in sequence;

and dividing the factor graph into a main factor subgraph and a classification factor subgraph from the second variable nodes.

Preferably, in step S5, the factor graph message passing model is solved iteratively by using a message passing EM algorithm, and the specific method for obtaining the type of the error pattern in the data processing process is as follows:

s5.1: initializing a message ratio of a second factor node to a second variable node, a message ratio of a fifth variable node to the second factor node and a message ratio of a first variable node to the second factor node based on a factor graph message transfer model, and initializing a parameter set;

S5.2: performing message transfer iteration on the main factor subgraph based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node and the message ratio of the first variable node to the second factor node to obtain posterior distribution of the third variable node, the fourth variable node, the sixth variable node and the seventh variable node and the message ratio of the third factor node to the second variable node;

s5.3: based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node, the message ratio of the first variable node to the second factor node and the iteration message ratio of the third factor node to the second variable node, carrying out message transmission iteration on the split factor subgraphs to obtain posterior distribution of the first variable node and the fifth variable node;

s5.4: updating the parameter set based on posterior distribution of the first variable node, the third variable node, the fourth variable node, the sixth variable node of the fifth variable node and the seventh variable node;

s5.5: repeating the steps S5.2-S5.4 until the preset iteration times are reached;

S5.6: and taking the corresponding error pattern type as a final classification result to obtain the type of the error pattern in the data processing process.

The invention also provides a classification system of error patterns in the data processing process, which is used for realizing the classification method, and comprises the following steps:

the signal generation module is used for generating an original data signal and performing data processing by taking the original data signal as actual input data;

the data processing module is used for acquiring input observation data and output observation data in the data processing process based on the actual input data;

the model construction module is used for constructing an error pattern classification Bayesian model based on the input observation data and the output observation data;

the model conversion module is used for converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model;

and the data classification module is used for carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention processes data by generating an original data signal and using the original data signal as actual input data; based on the actual input data, input observation data and output observation data in the data processing process are obtained; constructing an error pattern classification Bayesian model based on the input observation data and the output observation data; converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model; and carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process. The invention can eliminate the influence of data randomness and random noise interference and accurately classify the error patterns.

Drawings

FIG. 1 is a flow chart of a method for classifying error patterns in a data processing process according to the embodiment 1;

FIG. 2 is a schematic diagram of the factor graph message passing model according to embodiment 2;

FIG. 3 is a schematic diagram of the main factor graph described in example 2;

FIG. 4 is a schematic diagram of a classification factor graph according to example 2;

fig. 5 is a schematic diagram of a classification system for error patterns in a data processing process according to embodiment 4.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a method for classifying error patterns in a data processing process, as shown in fig. 1, including:

In a specific implementation process, the embodiment performs data processing by generating an original data signal and using the original data signal as actual input data; based on the actual input data, input observation data and output observation data in the data processing process are obtained; constructing an error pattern classification Bayesian model based on the input observation data and the output observation data; converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model; and carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process. The embodiment can eliminate the influence of data randomness and random noise interference and accurately classify the error patterns.

Example 2

The embodiment provides a classification method of error patterns in a data processing process, taking error patterns of an information processing module for processing 0/1 digital sequences in a detection circuit as an example, and describing the technical scheme of the embodiment; comprising the following steps:

s1: generating an original data signal, and performing data processing by taking the original data signal as actual input data; specific:

the data generator generates a batch of 0/1 binary data as a raw data sequence，/>Representing the +.>A personal site value; original data sequence->The value of each site obeys the first parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

The original data sequence is transmitted to the information processing module through different channels, and when the original data sequence passes through each channel, the original data sequence is interfered by random noise to a certain extent, and the value of the interfered original data sequence is inverted, namely, the value of the interfered original data sequence is changed from 0 to 1 or from 1 to 0;

in the method, in the process of the invention,indicate->Original data signal of individual channels,/>，/>Indicate->The +.o. of the original data signal of the individual channels>A personal site value; />Indicate->A random noise sequence of the individual channels, ，/>Indicate->Random noise sequence of individual channels +.>Random noise situation of individual sites,/->When it indicates that random noise occurs, < >>When it indicates no random noise and +.>Obeying the second parameter asIs 0/1 binomial distribution of a priori probability +.>：

To the original data signalAs actual input data, the input information processing module performs data processing.

S2: based on the actual input data, input observation data and output observation data in the data processing process are obtained; specific:

the information processing module may fail during the processing process, so that the actual input data is overturned again, and then the information processing module outputs the actual output data; setting a flip patternThe actual output data is:

in the method, in the process of the invention,indicate->Actual output data of the individual channels,/>，/>Indicate->The +.o. of the actual output data of the individual channels>A personal site value; />Indicate->The flip patterns corresponding to the data of the individual channels,，/>indicate->Personal channel data>And the position corresponds to the turnover pattern.

The known information processing module is commonly likely to generateError patterns, but each errorThe specific inversion condition of the error pattern is unknown, and the inversion pattern is +.>One of the error patterns; set- >Indicate->Error pattern, then:

in the method, in the process of the invention,indicate->Error pattern->，/>Indicate->Error pattern->The site situation obeys a third parameter of +.>Is 0/1 binomial distribution of a priori probability +.>：

For coding vectors, express +.>Whether the flip pattern of the individual channels is +.>Error pattern->Represent the firstThe flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is not +.>Error pattern->Indicate->Error pattern type of flip pattern of individual channels, obeying the parameter +.>Is recorded with a priori probability +.>The following steps are:

in the parameters ofIndicate->Number proportion estimate of error patterns, +.>The representation is abbreviated as;

the electronic instrument is used for measuring the actual input data and the actual output data respectively to obtain the input observation dataAnd output observation data +.>。

S3: constructing an error pattern classification Bayesian model based on the input observation data and the output observation data; specific:

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of (2); / >Indicate->Individual channel->Input observations of individual sites, +.>Indicate->Input observations of individual channels, < >>Indicate->The +.o. of the original data signal of the individual channels>A personal site value; />Respectively representing a first mean estimate and a first variance estimate; />Representing a second mean estimate and a second variance estimate, respectively;

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Indicate->Individual channel->Output observations of individual loci, +.>Indicate->Output observations of individual channels, +.>Indicate->The +.o. of the actual output data of the individual channels>A personal site value; />Respectively representing a third mean value estimation and a third difference estimation; />Respectively representing a fourth mean estimate and a fourth difference estimate;

S4: converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model; specific:

as shown in fig. 2, the factor graph message passing model includes a plurality of variable nodes, a plurality of factor nodes, undirected edges connecting the variable nodes and the factor nodes;

Will be、/>、/>、/>、/>、/>And->Respectively serving as a first variable node, a second variable node, a third variable node, a fourth variable node, a fifth variable node, a sixth variable node and a seventh variable node; will be a priori probabilityAs a first factor node, will +.>And->Conditional probability between->As a second factor node, will +.>And->Conditional probability between->As a third factor node, will +.>And (3) withConditional probability between->As a fourth factor node, the prior probability +.>As a fifth factor node, the first transition probability +.>As a sixth factor node, the second transition probability +.>As a seventh factor node, the prior probability +.>As an eighth factor node, the prior probability +.>As a ninth factor node;

wherein,and->Conditional probability between->The calculation formula of (2) is as follows:

and->Conditional probability between->The calculation formula of (2) is as follows:

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking the same valueProbability of (2); />Is a pulse function and has the following properties:

in the method, in the process of the invention,representing the variable, i.e. when the variable->When the value is 0, the weight is added >The value is 1, otherwise->The value is 0. Such as: when (when)When (I)>；/>In the time-course of which the first and second contact surfaces,。

the fourth variable node is also connected with a sixth factor node;

FIG. 3 is a schematic diagram of a main factor graph; as shown in fig. 4, a schematic diagram of a classification factor subgraph is shown.

S5: carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process; specific:

s5.1: based on the factor graph message passing model, the second factor node is selectedMarked as->Initializing transfer of the second factor node to the second variable node +.>Message ratio of (2):

Wherein,representing the ratio of messages transferred by the second factor node to the second variable node, +.>Representing an all 1 matrix of size +.>；

Initializing a fifth variable nodeMessage ratio to second factor node:

in the method, in the process of the invention,message ratio representing the transfer of the fifth variable node to the second factor node, +.>Representing an all 1 matrix of size +.>；/>Is super-parameter (herba Cinchi Oleracei)>Is->Is set to an initial value of (1);

initializing a first variable nodeMessage ratio to second factor node:

in the method, in the process of the invention,message ratio representing the transfer of the first variable node to the second factor node, +.>Representing an all 1 matrix of size +.>；

And initializing parameter setsIn this embodiment, the initialization value is；

S5.2: performing message transfer iteration on the main factor subgraph based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node and the message ratio of the first variable node to the second factor node to obtain posterior distribution of the third variable node, the fourth variable node, the sixth variable node and the seventh variable node and the message ratio of the third factor node to the second variable node; specific:

as shown in FIG. 3, in the principal factor subgraph, the second variable node Second factor node->Third variable node->Seventh factor node->Commonly marked as->Then->Transfer to->The message ratio of (2) is:

node the fourth factorMarked as->Then pass from the fourth factor node to the seventh variable nodeThe message ratio of (2) is:

the seventh variable node is related to the input observation and the output observation setThe posterior distribution of (2) is:

in the method, in the process of the invention,representation->About->The posterior distribution of (2) takes a value of 1;

from the seventh variable nodeThe message ratio delivered to the fourth factor node is:

then pass to the sixth variable nodeThe information ratio of (2) is:

sixth variable nodeAbout input observations and output observations set +.>The posterior distribution of (2) is:

then pass to the sixth variable nodeThe information ratio of (2) is:

from the fourth factor node to the fourth variable nodeThe message ratio of (2) is:

fourth variable nodeAbout input observations and output observations set +. >The posterior distribution of (2) is:

in the method, in the process of the invention,representation->About->The posterior distribution of (1) takes a value of 1, < >>Representation->About->The posterior distribution of (2) takes a value of 0;

then slave isTransfer to->The message ratio of (2) is:

the third factor is savedPoint(s)Marked as->Then pass from the third factor node to the third variable node +.>The message ratio of (2) is:

third variable nodeAbout input observations and output observations set +.>The posterior distribution of (2) is:

from the third factor node to the second variable nodeThe message ratio of (2) is:

in the method, in the process of the invention,representing a message ratio of the third factor node to the second variable node;

s5.3: based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node, the message ratio of the first variable node to the second factor node and the iteration message ratio of the third factor node to the second variable node, carrying out message transmission iteration on the split factor subgraphs to obtain posterior distribution of the first variable node and the fifth variable node; specific:

As shown in fig. 4, in the classification factor subgraph, the message ratio transferred from the second variable node to the second factor node is equal to the message ratio transferred from the third factor node to the second variable node, namely:

then pass from the second factor node to the first variable nodeThe message ratio of (2) is:

in the middle ofRepresenting a first intermediate expression, specifically: />

Then pass to the first variable nodeThe message ratio of (2) is:

first variable nodeAbout input observations and output observations set +.>The posterior distribution of (2) is:

from the first variable nodeThe message ratio delivered to the second factor node is:

after iteration, pass from the second factor node to the second variable nodeThe message ratio of (2) is:

from the second factor node to the fifth variable nodeThe message ratio of (2) is:

wherein,representing a second intermediate expression, specifically:

then the fifth variable nodeAbout input observations and output observations set +.>The posterior distribution of (2) is:

from the fifth variable node after iterationThe message ratio delivered to the second factor node is:

in the method, in the process of the invention,representing the message ratio of the cong fifth variable node to the second factor node after iteration;

S5.4: updating the parameter set to update the parameter set based on posterior distribution of the first variable node, the third variable node, the fourth variable node, the fifth variable node, the sixth variable node, and the seventh variable nodeThe specific method of (a) is as follows: />

in this embodiment, the preset iteration number is 30-50; i.e.；

S5.6: the corresponding error pattern type is used as a final classification result, and the type of the error pattern in the data processing process is obtained; specific:

when the preset iteration times are reached, the posterior distribution of the fifth variable node corresponding to the iteration times is used for calculating the abnormal subtype type of the sample as a final disease classification result, namely:

in the method, in the process of the invention,indicate->Error pattern final kind of flip pattern of individual channels will +.>As a final classification result, namely: />Then->The final classification result of the flip pattern of the individual channels is belonging to +.>Error patterns are used for classifying and disposing, maintaining or replacing the information processing module.

Example 3

The present embodiment provides a method for classifying error patterns in a data processing process, taking an example of error patterns in a data processing process for detecting methylation data of a human body, N6-adenosylmethylation (m 6A, N6-methylidenosine) is a base modification behavior widely existing on mRNA, and occurrence of abnormality in m6A modification will cause a series of diseases. Cancer is one of the diseases of high complexity and heterogeneity. Cancers can be classified into different subtypes based on several characteristics, such as histomorphology, molecular profile, and specific mutations; the technical scheme of the embodiment is described; comprising the following steps:

When the data processing process of the human body methylation data is targeted, the following operations are also needed:

the prior published TCGA data set is obtained, or the published TCGA data set is used for generating sample data of a user by referring to the published TCGA data set, wherein the sample data comprise paired tumor tissue m6A methylation data and paired paracancerous tissue m6A methylation data; taking the sample data as actual input data;

preprocessing the data by the sample, and calculating a tumor tissue methylation observation value sequence and a paired cancer tissue methylation observation value sequence, wherein the tumor tissue methylation observation value sequence and the paired cancer tissue methylation observation value sequence are used as input observation data and output observation data in the data processing process; specific:

acquisition ofThe data of the sample will be +.>Tumor tissue m6A methylation data of individual samples were recorded as +.>First->Paired paracancerous tissue m6A methylation data for each sample was recorded as +.>The method comprises the steps of carrying out a first treatment on the surface of the Each sample has +.>And calculating methylation observed values of tumor tissue m6A methylation data and paired paracancerous tissue m6A methylation data of each sample pair data by site, wherein the calculation formula is as follows:

in the method, in the process of the invention,indicate->No.>Methylation observations at individual sites, +.>Indicate->No. >M6A methylation data for individual sites; />Indicate->Sample No. I of tumor tissue>Methylation observations at individual sites, +.>Indicate->Sample No. I of tumor tissue>M6A methylation data for individual sites; />，/>；

Using methylation observations at all sites of paired paracancerous tissue for all samples to construct paired paracancerous tissue methylation observation sequencesForming a tumor tissue methylation observation sequence by using methylation observations of all sites of tumor tissues of all samples, wherein the methylation observations are respectively used as input observation data and output observation data;

constructing an error pattern classification Bayes model based on the tumor tissue methylation observed value sequence and the paired beside-cancer tissue methylation observed value sequence; specific:

setting the real conditions of each sample to each site of the data, including methylation and methylation, and respectively representing the methylation and the methylation by 1 and 0; methylation of paired paracancerous tissues from different sample pair data is determined by a normal cell sequenceEvolutionary, normal cell sequence->I.e.the value of each site of the original data sequence is +.>Is 0/1 binomial distribution of a priori probability +.>：

Methylation of paired paracancerous tissues was used as the raw data signal, namely:

In the method, in the process of the invention,indicate->No.>Methylation of individual sites,/->Indicates the normal cell sequence->Value of individual site,/->Representing exclusive OR operation, ++>Indicate->No.>Individual site non-pathogenic mutation case,/->When it indicates that a non-pathogenic mutation has occurred, < >>When it indicates that no non-pathogenic mutation has occurred, and +.>Subject to the second parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

The methylation of the tumor tissue is obtained by pathogenic mutation of the methylation of the corresponding paired paracancerous tissue, and the methylation of the tumor tissue is obtained by setting the pathogenic mutation:

tumor is treated byMethylation of the tissue is used as the actual output data, where,indicate->Sample No. I of tumor tissue>Methylation of individual sites,/->Indicate->Sample No. I of tumor tissue>Site-pathogenic mutation situation, i.e. flip pattern,/->Pathogenic mutations occur in the case of->Pathogenic mutation did not occur at this time;

setting commonsThe abnormal subtype is the wrong pattern, and the pathogenic mutation of the same abnormal subtype is the same; setting the pathogenic mutation status of each abnormal subtype, the methylation status of the tumor tissue is one of the pathogenic mutation status of each abnormal subtype, namely:

In the method, in the process of the invention,indicate->No. 2 of the abnormal subtype>Site-pathogenic mutation situation obeying the third parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

For coding vectors, express +.>Whether or not the sample belongs to->Abnormal subtype of species, ->Indicate->The sample belongs to->Abnormal subtype of species, ->Indicate->The sample belongs to->Abnormal subtype of species, ->Representation->Abnormal subtype species of individual samples, subject to the parameter +.>Is recorded with a priori probability +.>The following steps are:

in the parameters ofIndicate->Number proportion estimate of the species abnormality subtype, +.>The representation is abbreviated as;

establishing a first transition probability of methylation of the paired paracancerous tissue with the paired paracancerous tissue methylation observation sequence:

if at firstNo.>Methylation of individual sites>Methylated, corresponding firstNo.>Methylation observations at the individual sites +.>Obeying Gaussian distribution->The method comprises the steps of carrying out a first treatment on the surface of the If at firstNo.>Methylation of individual sites>Unmethylated, corresponding->No.>Methylation observations at the individual sites +.>Obeying Gaussian distribution->Specific:

In the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Mean and variance estimates representing unmethylated sites in paired paracancerous tissue, respectively;mean and variance estimates representing methylated sites in paired paracancerous tissue, respectively;

establishing a second transition probability between the methylation status of the tumor tissue and the methylation observed value sequence of the tumor tissue:

if at firstSample No. I of tumor tissue>Methylation of individual sites>Methylated, corresponding->Sample No. I of tumor tissue>Methylation M values of the individual sites obey Gaussian distribution +.>The method comprises the steps of carrying out a first treatment on the surface of the If%>Sample No. I of tumor tissue>Methylation of individual sites>Unmethylated, corresponding->Sample No. I of tumor tissue>Methylation M values of the individual sites obey Gaussian distribution +.>Specific:

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Mean and variance estimates representing unmethylated sites in tumor tissue, respectively; />Mean and variance estimates representing methylated sites in tumor tissue, respectively;

S4: converting the error pattern classification Bayesian model into a factor graph form to obtain a factor graph message transfer model; the factor graph message transfer model comprises a plurality of variable nodes, a plurality of factor nodes and undirected edges for connecting the variable nodes and the factor nodes;

pathogenic mutation conditions of abnormal subtypesAs a first variable node, pathogenic mutation of tumor tissue is +.>Methylation of tumor tissue as second variable node +.>Methylation of the paracancerous tissue to be paired as third variable node +.>As a fourth variable node, the abnormal subtype class of the sample ++>As a fifth variable node, the non-pathogenic mutation situation of the paracancerous tissue will be paired +.>As a sixth variable node, the normal cell sequence +.>As a seventh variable node;

prior probability of pathogenic mutation conditions of abnormal subtypeAs a first factor node, will +.>And (3) withConditional probability between->As a second factor node, will +.>And->Conditional probability between->As a third factor node, will +.>And->Conditional probability between->As a fourth factor node, the prior probability of an abnormal subtype class is +.>As a fifth factor node, the first relation As a sixth factor node, the second relation +.>As a seventh factor node, the prior probability of the non-pathogenic mutation situation of the paracancerous tissue will be paired +.>As an eighth factor node, the prior probability of the site value of the normal cell sequence is +.>As a ninth factor node;

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking the same valueProbability of (2); />As a pulse function, there are the following properties:

in the method, in the process of the invention,representing the variable, i.e. when the variable->When the value is 0, the weight is added>The value is 1, otherwise->The value is 0. For example, whenWhen (I)>；/>In the time-course of which the first and second contact surfaces,。

The fourth variable node is also connected with a sixth factor node;

S5: and carrying out iterative solution on the factor graph message transfer model by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process of the human body methylation data, namely the final classification result of the methylation data.

In a specific implementation process, the classification effect of the method provided by the embodiment is evaluated by introducing the adjustment Rankine index ARI and the standard mutual information NMI, AE, AS;

1) Constructing a pure data set as sample pair data of a user, and setting the number of samplesSite number->Abnormal subtype number->2/3/4, i.e. 100 samples per sample pair, with +.>The abnormal subtypes are respectively 2, 3 and 4 sites; initializing parameters:randomly generated coincidence parameter is +>Is of Bernoulli distribution>The coincidence parameter is->Is of Bernoulli distribution>The coincidence parameter is->Is of Bernoulli distribution>According to polynomial distribution->Is->The method comprises the steps of carrying out a first treatment on the surface of the Go->Calculation to get->If->Then->At mean value +. >Variance is->Gaussian distribution of (1) if +.>Then->At mean value +.>Variance is->Is sampled in a gaussian distribution; calculate->Further calculateThe method comprises the steps of carrying out a first treatment on the surface of the If->Then->At mean value +.>Variance is->Sampling in Gaussian distribution; if->Then->At mean value +.>Variance is->Sampling in Gaussian distribution; finally use->And testing accuracy. 10 independent experiments were performed on 2, 3, and 4 abnormal subtypes, respectively, and the experimental results are shown in table 1 below:

TABLE 1

The parameter predictions are shown in table 2 below:

TABLE 2

The data in brackets of the experimental column are variance values;

it can be seen from the above table that when the abnormal subtypes are respectively 2, 3 and 4, the classification accuracy pairs are 1, ae and AS are 100%, and the predicted value of the parameter is equal to the true value of the parameter, which indicates that the method provided by the embodiment can accurately classify the abnormal subtypes.

2) Since tumor heterogeneity affects the distribution of subtypes in the clinic, it is often of interest in data analysis procedures that tumor tissue contains multiple abnormal subtypes or is contaminated with paired paracancerous tissue cells during dissection. To test our method's performance on "dirty" data sets, we run 9 additional experiments, still setting the same sample number, number of bits, number of abnormal subtypes and initialization parameter set; the settings contain different numbers of true subtypes and different proportions of components, Is composed of three proportions and +.>>/>,/>>/>The method comprises the steps of carrying out a first treatment on the surface of the Randomly generated coincidence parameter +.>Is of Bernoulli distribution>The coincidence parameter is->Is of Bernoulli distribution>Go on->Calculation to get->The method comprises the steps of carrying out a first treatment on the surface of the Re-random generation of coincidence parameter +.>Is of Bernoulli distribution>Go on->Calculation to get->The method comprises the steps of carrying out a first treatment on the surface of the Order theGo on->Calculation to get->The method comprises the steps of carrying out a first treatment on the surface of the If->Then->At mean value +.>Variance is->Gaussian distribution of (1) if +.>Then->At mean value +.>Variance is->Gaussian distributed lining sampling of (c) and the same thing obtains +.>The method comprises the steps of carrying out a first treatment on the surface of the Final calculation->：

Generated by the same methodThe method comprises the steps of carrying out a first treatment on the surface of the Randomly generated coincidence parameter +.>Is of Bernoulli distribution>According to polynomial distribution +.>Is->Utilize->Testing accuracy; 10 independent experiments were performed on 2, 3, and 4 abnormal subtypes, respectively, and the average experimental results were obtained for the 10 independent experiments, as shown in table 3 below:

TABLE 3 Table 3

It can be seen that the method provided by the embodiment can accurately classify the abnormal subtypes on the pollution data set; and the factor graph message transfer model and the message transfer EM algorithm have low randomness, the obtained result is more stable, the convergence is faster, and the parameter estimation has little fluctuation after convergence.

Example 4

The present embodiment provides a classification system for error patterns in a data processing process, which is configured to implement the classification method described in embodiments 1, 2 or 3, as shown in fig. 5, including:

In the signal generation module, the specific method for generating the original data signal comprises the following steps:

generating an original data sequence，/>Representing the +.>A personal site value; original data sequence->The value of each site obeys the first parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

in the method, in the process of the invention,indicate->Original data signal of individual channels,/ >，/>Indicate->The +.o. of the original data signal of the individual channels>A personal site value; />Indicate->A random noise sequence of the individual channels,，/>indicate->Random noise sequence of individual channels +.>Random noise situation of individual sites,/->When it indicates that random noise occurs, < >>When it indicates no random noise and +.>Subject to the second parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

To the original data signalAnd performing data processing as actual input data.

In the model construction module, the specific method for constructing the error pattern classification Bayesian model based on the input observation data and the output observation data comprises the following steps:

in the method, in the process of the invention,indicate->Actual output data of the individual channels,/>，/>Indicate->The +.o. of the actual output data of the individual channels>A personal site value; />Indicate->The flip patterns corresponding to the data of the individual channels,，/>indicate->Personal channel data>A flip pattern corresponding to the location;

setting commonality in data processingError pattern, flip pattern +. >One of the error patterns is:

For coding vectors, express +.>Whether the flip pattern of the individual channels is +.>Error pattern->Represent the firstThe flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual sample channels is not +.>Error pattern->Indicate->The error pattern type of the flip pattern of each channel obeys polynomial distribution, and the prior probability is that：

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of (2); />Indicate->Individual channel->Input observations of individual sites, +.>Indicate->Input observations of individual channels, < >>Indicate->The +.o. of the original data signal of the individual channels >A personal site value; />Respectively representing a first mean estimate and a first variance estimate; />Representing a second mean estimate and a second variance estimate, respectivelyCounting;

In the model conversion module, the error pattern classification Bayesian model is converted into a factor graph form, and the specific method for obtaining the factor graph message transmission model is as follows:

will be、/>、/>、/>、/>、/>And->Respectively as a first variable node, a second variable node, a third variable node, a fourth variable node, a fifth variable node and a sixth variable A quantity node and a seventh variable node;

will be a priori probabilityAs a first factor node, will +.>And->Conditional probability betweenAs a second factor node, will +.>And->Conditional probability between->As a third factor node, will +.>And->Conditional probability between->As a fourth factor node, the prior probability +.>As a fifth factor node, the first transition probability +.>As a sixth factor node, the second transition probability +.>As a seventh factor node, the prior probability +.>As an eighth factor node, the prior probability is calculatedAs a ninth factor node;

the fourth variable node is also connected with a sixth factor node;

In the data classification module, the factor graph message transfer model is iteratively solved by using a message transfer EM algorithm to obtain the type of the error pattern in the data processing process, and the method specifically comprises the following steps:

the initialization submodule is used for initializing a message ratio of a second factor node to a second variable node, a message ratio of a fifth variable node to the second factor node and a message ratio of a first variable node to the second factor node based on a factor graph message transfer model, and initializing a parameter set;

the first iteration submodule is used for carrying out message transmission iteration on the main factor subgraph based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node and the message ratio of the first variable node to the second factor node to obtain posterior distribution of the third variable node, the fourth variable node, the sixth variable node and the seventh variable node and the message ratio of the third factor node to the second variable node;

the second iteration submodule is used for carrying out message transmission iteration on the split factor subgraphs based on the initialized message ratio of the second factor node to the second variable node, the message ratio of the fifth variable node to the second factor node, the message ratio of the first variable node to the second factor node and the iteration message ratio of the third factor node to the second variable node, so as to obtain posterior distribution of the first variable node and the fifth variable node;

A parameter updating sub-module, configured to update a parameter set based on posterior distribution of a first variable node, a third variable node, a fourth variable node, a fifth variable node, a sixth variable node, and a seventh variable node;

the iteration updating sub-module is used for returning to the first iteration sub-module until the preset iteration times are reached;

the classifying result obtaining sub-module is used for taking the corresponding error pattern type as a final classifying result to obtain the type of the error pattern in the data processing process.

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A method for classifying an error pattern in a data processing process, comprising:

s1: generating an original data signal, and performing data processing by taking the original data signal as actual input data; the specific method for generating the original data signal is as follows:

in the method, in the process of the invention,indicate->Original of individual channelsStart data signal,/->，/>Indicate->The +.o. of the original data signal of the individual channels>A personal site value; />Representing an exclusive-or operation; />Indicate->A random noise sequence of the individual channels,，/>indicate->Random noise sequence of individual channels +.>Random noise situation of individual sites,/->When it indicates that random noise occurs, < >>When it indicates no random noise and +.>Obeying the second parameter asIs 0/1 binomial distribution of a priori probability +.>：

To the original data signalData processing is carried out as actual input data;

S3: an error pattern classification Bayesian model is constructed based on input observation data and output observation data, and the specific method comprises the following steps:

setting commonality in data processingError pattern, flip pattern +.>One of the error patterns, then:

in the method, in the process of the invention,indicate->Error pattern->，/>Indicate->Error pattern->The site situation obeys a third parameter of +.>Is 0/1 binomial distribution of a priori probability +.>；/>For coding vectors, express +.>Whether the flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is not +.>Error pattern- >Indicate->Error pattern type of flip pattern of each channel obeys polynomial distribution, and the prior probability is +.>；

Measuring actual input data and actual output data by using an electronic instrument to obtain input observation data and output observation data;

completing the construction of an error pattern classification Bayesian model;

2. The method for classifying an error pattern in a data processing process according to claim 1, wherein in step S4, the specific method for converting the error pattern classification bayesian model into a factor graph form and obtaining a factor graph message passing model is as follows:

will be a priori probabilityAs a first factor node, will +.>And->Conditional probability between As a second factor node, will +.>And->Conditional probability between->As a third factor node, will +.>And->Conditional probability between->As a fourth factor node, the prior probability +.>As a fifth factor node, the first transition probability +.>As a sixth factor node, the second transition probability +.>As a seventh factor node, the prior probability +.>As an eighth factor node, the prior probability is calculatedAs a ninth factor node;

the fourth variable node is also connected with a sixth factor node;

3. The method for classifying an error pattern in a data processing process according to claim 2, wherein in step S5, the factor graph message passing model is iteratively solved by using a message passing EM algorithm, and the specific method for obtaining the type of the error pattern in the data processing process is as follows:

4. A classification system for error patterns in a data processing process for implementing the classification method of any one of claims 1-3, comprising:

the signal generation module is used for generating an original data signal and performing data processing by taking the original data signal as actual input data; the specific method for generating the original data signal is as follows:

generating an original data sequence，/>Representing the +.>Individual site value, ->Is the total number of sites; original data sequence->The value of each site obeys the first parameter +.>Is 0/1 binomial distribution of a priori probability +.>：

in the method, in the process of the invention,indicate->Original data signal of individual channels,/>，/>Indicate->The +.o. of the original data signal of the individual channels >A personal site value; />Representing an exclusive-or operation; />Indicate->A random noise sequence of the individual channels,，/>indicate->Random noise sequence of individual channels +.>Random noise situation of individual sites,/->When it indicates that random noise occurs, < >>When it indicates no random noise and +.>Obeying the second parameter asIs 0/1 binomial distribution of a priori probability +.>：

To the original data signalData processing is carried out as actual input data;

the model construction module is used for constructing an error pattern classification Bayesian model based on the input observation data and the output observation data, and the specific method comprises the following steps:

raw data signalAs actual input data to process data, the data processing process has faults, the original data signal is overturned, and the overturned pattern is set, so that the actual input of the data processing processThe output data are:

in the method, in the process of the invention,indicate->Actual output data of the individual channels,/>，/>Indicate->The +.o. of the actual output data of the individual channels>A personal site value; />Indicate->The flip patterns corresponding to the data of the individual channels,，/>indicate->Personal channel data >A flip pattern corresponding to the location;

in the method, in the process of the invention,indicate->Error pattern->，/>Indicate->Error pattern->The site situation obeys a third parameter of +.>Is 0/1 binomial distribution of a priori probability +.>；/>For codingVector, express->Whether the flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is +.>Error pattern->Indicate->The flip pattern of the individual channels is not +.>Error pattern->Indicate->Error pattern type of flip pattern of each channel obeys polynomial distribution, and the prior probability is +.>；

in the method, in the process of the invention,representation->Under the condition of taking different values +.>Probability of->Representation->Under the condition of taking different values +.>Probability of (2); />Indicate->Individual channel->Input observations of individual sites, +.>Indicate->Input observations of individual channels, < >>Indicate- >The +.o. of the original data signal of the individual channels>A personal site value; />Respectively representing a first mean estimate and a first variance estimate; />Representing a second mean estimate and a second variance estimate, respectively;

completing the construction of an error pattern classification Bayesian model;

5. The system for classifying an error pattern in a data processing process according to claim 4, wherein the model transformation module transforms the error pattern classification bayesian model into a factor graph form, and the specific method for obtaining a factor graph message transmission model is as follows:

the fourth variable node is also connected with a sixth factor node;

6. The system for classifying an error pattern in a data processing process according to claim 5, wherein the data classifying module performs iterative solution to a factor graph message passing model by using a message passing EM algorithm to obtain a type of the error pattern in the data processing process, and specifically comprises: