CN111199782B

CN111199782B - Etiology analysis method, device, storage medium and electronic equipment

Info

Publication number: CN111199782B
Application number: CN201911396700.6A
Authority: CN
Inventors: 孙浩; 侯广健; 刘满兰; 刘志鹏; 邹存璐; 王�锋
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-09-29
Anticipated expiration: 2039-12-30
Also published as: CN111199782A

Abstract

The present disclosure relates to a method, an apparatus, a storage medium, and an electronic device for etiology analysis, so as to provide a new etiology analysis method, and implement etiology automation analysis. The method comprises the following steps: acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of each case in the case group are the same; determining the data type of each attribute item according to the value data under each attribute item; inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the symptoms; the attribute item information comprises value data of an attribute item and a data type of the attribute item, and the data processing model is used for processing the value data of the attribute item according to a data processing algorithm corresponding to the data type of the attribute item.

Description

Etiology analysis method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method, an apparatus, a storage medium, and an electronic device for etiology analysis.

Background

Etiology analysis is an important research direction in the field of medical science, and mainly explores the reasons for occurrence of diseases, the mutual effects among related factors and the influence of each factor on occurrence and development of the diseases.

In the related art, the etiology analysis process mainly includes three steps: general data analysis, single factor analysis and multi-factor analysis. When general data analysis is performed, scientific researchers are required to label each variable word in the data to be analyzed, all variables in the data to be analyzed are classified based on labeling results, and then different algorithms are adopted for analyzing each type of variables. When single-factor analysis and multi-factor analysis are performed, the variables are required to be classified and labeled again, and then the reclassified variables of different types are analyzed. In this way, a significant amount of personnel is required to manually label each variable word. In the process of manually labeling a large number of variables, the types of the variables are easily mislabeled, and the mislabeling of the types of the variables can cause obvious inaccuracy of the etiology analysis result. However, if the result of the etiology analysis is obviously inaccurate, the scientific researchers can label and check a large number of variable words, which is certainly time-consuming, and if all variables are remarked, a large amount of time is also consumed. Therefore, the manual labeling mode makes the labor cost of the disease analysis high and the efficiency low.

Disclosure of Invention

The invention aims to provide a method, a device, a storage medium and electronic equipment for analyzing etiology, so as to provide a novel method for analyzing the etiology and realize automatic analysis of the etiology.

To achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a etiology analysis method including:

acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of each case in the case group are the same;

determining the data type of each attribute item according to the value data under each attribute item;

inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the symptoms;

the attribute item information comprises value data of an attribute item and a data type of the attribute item, and the data processing model is used for processing the value data of the attribute item according to a data processing algorithm corresponding to the data type of the attribute item.

Optionally, the determining the data type of each attribute item according to the valued data under each attribute item includes:

Determining that the data types of the attribute items with two value types of the value data are qualitative comparability types;

determining that the value types of the value data are not two, the value data are numerical data, and the data types of the attribute items, of which the value data accord with normal distribution, are quantitative types;

determining that the value types of the value data are not two, the value data are numerical data, and the data types of the attribute items, of which the value data do not accord with normal distribution, are the qualitative comparability types;

determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items in the knowledge base of the value data are qualitative incomparable types;

and determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items of which the value data exist in the knowledge base are the qualitative comparison types.

Optionally, the processing of the value data of each attribute item by the data processing model includes:

for the attribute items with the data types of the quantitative types, checking at least one of rank sum checking, T checking and T' checking to obtain a first intermediate attribute item;

For attribute items with data types being qualitative types, checking through a chi-square checking algorithm to obtain a second intermediate attribute item, wherein the qualitative types comprise the qualitative comparable type and the qualitative incomparable type;

and carrying out single factor analysis on the first intermediate attribute item and the second intermediate attribute item to obtain a first target attribute item related to the disorder, wherein the target attribute item comprises the first target attribute item.

Optionally, the single factor analysis includes performing a segmentation discretization process on the value data of each attribute item in the first intermediate attribute item, where a segmentation process in the segmentation discretization process includes:

determining a numerical interval of the attribute item according to the maximum value and the minimum value of the attribute item;

segmenting the numerical value interval according to each super parameter in a preset super parameter space to obtain a segmented interval sequence set under all segmentation conditions;

and calculating a P value representing the statistical significance of each segmented interval sequence in the segmented interval sequence set, and taking the segmented interval sequence with the minimum P value as a segmentation result.

Optionally, the processing of the value data of each attribute item by the data processing model further includes:

Performing multi-factor analysis on the first target attribute item to obtain a second target attribute item, wherein the target attribute item comprises the second target attribute item;

wherein the multi-factor analysis comprises:

generating a corresponding number of dummy variables according to the type of the value data of each attribute item of which the data type is the qualitative incomparable type in the first target attribute item;

and generating a comparability coefficient corresponding to each value data under the attribute item according to each dummy variable of the attribute item.

According to a second aspect of embodiments of the present disclosure, there is provided a etiology analysis device, the device comprising:

the acquisition module is used for acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of each case in the case group are the same;

the determining module is used for determining the data type of each attribute item according to the value data under each attribute item;

the input module is used for inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the symptoms;

Optionally, the determining module includes:

the first determination submodule is used for determining that the data types of the attribute items with two value types of the value data are qualitative comparability types;

the second determination submodule is used for determining that the value types of the value data are not two, the value data are numerical value data, and the data types of the attribute items, of which the value data accord with normal distribution, are quantitative types;

the third determining submodule is used for determining that the value types of the value data are not two, the value data are numerical value data, and the data types of the attribute items, of which the value data do not accord with normal distribution, are the qualitative comparability types;

the fourth determination submodule is used for determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items in the knowledge base of the value data are qualitative incomparable types;

and the fifth determination submodule is used for determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items of which the value data exist in the knowledge base are the qualitative comparison types.

Optionally, the data processing model is configured to:

Optionally, the data processing model is further configured to:

wherein the multi-factor analysis comprises:

By adopting the technical scheme, at least the following technical effects can be achieved:

acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of a sample and value data of the sample under each attribute item; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the data types of the control group and the case group are determined into a data processing model for processing, and obtaining a target attribute item which is output by the data processing model and is related to the symptoms of the case group. The etiology analysis mode does not need to manually participate in the analysis process, realizes the etiology automatic analysis, and can avoid the problems in the related technology.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a method of etiology analysis according to an exemplary embodiment of the present disclosure.

FIG. 2 is a flow chart illustrating one method of determining a data type of an attribute item according to an exemplary embodiment of the present disclosure.

FIG. 3 is a flow chart illustrating another method of determining a data type of an attribute item according to an exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating a etiology analysis apparatus according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.

Detailed Description

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Etiology analysis is an important research direction in the field of medical science, and mainly explores the reasons for occurrence of diseases, the mutual effects among related factors and the influence of each factor on occurrence and development of the diseases. That is, the etiology analysis is to conduct scientific research on the disease cause of the patient after diagnosing the disease.

In the related art, the etiology analysis process is mainly divided into three parts: general data analysis, single factor analysis and multi-factor analysis. When general data analysis is performed, scientific researchers are required to label each variable word in the data to be analyzed, all variables in the data to be analyzed are classified based on labeling results, and then different algorithms are adopted for analyzing each type of variables. After the end of the general data analysis, single-factor analysis and multi-factor analysis are performed based on the result of the general data analysis. The classification requirements of the single-factor analysis and the multi-factor analysis methods on the variables are different from those of the general data analysis, so that when the single-factor analysis and the multi-factor analysis are carried out, the variables are required to be classified and marked again, and then the reclassified variables of different types are analyzed.

In this way, a great deal of manpower is required to manually label each variable word during the etiology analysis. In the process of manually labeling a large number of variables, the types of the variables are easily mislabeled, and the mislabeling of the types of the variables can cause inaccurate etiology analysis results. However, if the result of the etiology analysis is inaccurate, the scientific researchers can label and check a large number of variable words, which is certainly time-consuming, and if all variables are remarked, a large amount of time is also consumed. Therefore, the manual labeling mode makes the labor cost of the disease analysis high and the efficiency low. When the result of the etiology analysis is inaccurate, the obtained data has no value, and can not provide clues for the targeted experimental design, such as RCT, queue research and the like.

In view of the above, the embodiments of the present disclosure provide a method, an apparatus, a storage medium, and an electronic device for analyzing a cause of a disease, so as to provide a new method for analyzing a cause of a disease, and implement automated analysis of a cause of a disease, thereby solving the problems in the related art.

FIG. 1 is a flow chart illustrating a method of etiology analysis, as shown in FIG. 1, according to an exemplary embodiment of the present disclosure, the method comprising:

S101, acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of all cases in the case group are the same.

In performing the etiology analysis, it is first determined which disorder is to be etiologically analyzed. Two sets of sample data are then selected, one set being a case set for which a disease has been diagnosed and one set being a control set for a condition other than the case set. It is worth noting that the condition is the same for each case in the case group. Illustratively, analyzing the etiology of gastric cancer conditions, two sets of sample data are selected: one group is a gastric cancer group and one group is a non-gastric cancer group.

The selected two groups of sample data comprise a plurality of attribute items of each sample and the value data of each sample under each attribute item. By way of example, the attribute items included in the sample data may be the name, sex, ethnicity, age, academic, systolic blood pressure, diastolic blood pressure, blood glucose content, etc. of each sample; the value data of each sample under the attribute term refers to the specific value of each sample under each attribute term. For example, the value data of the sample a in the control group under the name attribute item is Zhang three, and the value data of the sample B in the case group under the name attribute item is Lifour; for example, the data of sample A under the blood pressure systolic attribute item is 110mmHg, and the data of sample B under the blood pressure systolic attribute item is 100mmHg.

Wherein, it should be understood by those of ordinary skill in the art that each sample in the control group and the case group has the same number of the same attribute items when performing the etiology analysis. Sample a has 100 attribute entries, and sample B has the same 100 attribute entries, for example.

In one implementation, sample data for the control group and the case group may be obtained from a clinical data center CDR database.

S102, determining the data type of each attribute item according to the value data under each attribute item.

The data type of each attribute item can be determined according to the value data under each attribute item. For example, if the attribute item is a ethnic attribute item, the valued data of the ethnic attribute item of each sample may be a han family, a Miao family, a Hui family, or the like; based on these valued data of the ethnic property items of all samples, the data type of the ethnic property item can be determined.

As another example, if the attribute item is a blood pressure diastolic pressure, the blood pressure diastolic pressure may be 120mmhg,100mmhg,80mmhg, or the like; according to the value data of the blood pressure diastolic blood pressure attribute items, the data type of the blood pressure diastolic blood pressure attribute items can be determined.

According to the method for determining the data type of each attribute item according to the value data under each attribute item, scientific researchers do not need to manually mark the data type of each attribute item, so that the labor cost in the etiology analysis process is reduced.

S103, inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the symptoms.

After the data type of each attribute item of the control group and the case group is determined, inputting the information of each attribute item into the data processing model for analysis and processing to obtain the target attribute item which is output by the data processing model and is related to the symptoms of the case group.

The target attribute item is the result of etiology analysis and is a risk factor for causing diseases. For example, assuming a causal analysis of gastric cancer conditions, the target attribute items that lead to gastric cancer may be diet, stay up, etc.

It will be appreciated by those skilled in the art that in the course of etiology analysis, different data processing algorithms may be employed to process the value data of each attribute item for attribute items of different data types. For example, in the related art, the value data of the blood pressure systolic pressure attribute item, the blood pressure diastolic pressure attribute item, and the age attribute item are subjected to the normalization test, then the attribute item conforming to the normalization test is subjected to the T test, and the attribute item not conforming to the normalization test is subjected to the rank sum test.

Thus, in the present disclosure, the attribute item information in the input data processing model includes the value data of the attribute item and the data type of the attribute item. The data processing model is used for selecting a corresponding data processing algorithm according to the data type of the attribute item to process the value data of the attribute item.

By adopting the method, the sample data of the control group and the sample data of the case group are obtained, wherein the sample data comprises various attribute items of the sample and the value data of the sample under each attribute item; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the data types of the control group and the case group are determined into a data processing model for processing, and obtaining a target attribute item which is output by the data processing model and is related to the symptoms of the case group. The etiology analysis mode does not need to manually participate in the analysis process, thereby realizing the etiology automatic analysis, and the etiology automatic analysis can avoid the problems caused by manually marking attribute items in the related technology.

In a possible implementation manner, as shown in fig. 2, the determining the data type of each attribute item according to the value data under each attribute item may include the following steps:

s201, determining that the data types of the attribute items with two value types of the value data are qualitative comparability types.

If the types of the valued data of the attribute item are two, determining that the data type of the attribute item is a qualitative comparison type. That is, if the value data of the attribute item is either a or B, the data type of the attribute item is a qualitatively comparable type. Illustratively, the value data of the attribute item is 0 or 1; the value data of the attribute items are yes or no; the value data of the attribute items is 0.01 or 0.02; the data type of such attribute items is determined to be a qualitatively comparable type.

S202, determining that the value types of the value data are not two, the value data are numerical value data, and the data types of the attribute items, of which the value data accord with the normal distribution, are quantitative types.

The types of the value data of the attribute items are not two, namely that the types of the value data are one, three or more.

Numerical (numerical) data is often characterized by the letter N, which is data consisting of numbers, decimal points, signs, and the letter E.

If the types of the value data of the attribute item are not two, the value data are numerical data, and the value data accord with the normal distribution, the data type of the attribute item is determined to be a quantitative type. In one implementation, whether the value data of the attribute item conforms to the normal distribution can be verified by a normal check.

S203, determining that the value types of the value data are not two, the value data are numerical value data, and the data types of the attribute items, of which the value data do not accord with the normal distribution, are the qualitative comparison types.

If the types of the value data of the attribute item are not two, the value data are numerical data, and the value data do not accord with the normal distribution, the data type of the attribute item is determined to be a qualitative comparison type.

S204, determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items in the knowledge base, which are not stored in the value data, are qualitative incomparable types.

The non-numeric data is single character data or character string data having no computing power such as chinese characters, english characters, numeric characters, ascii characters, and the like.

Knowledge base refers to a knowledge base related to medical treatment. The knowledge base is established after text analysis, word segmentation, part-of-speech tagging and other processing are carried out on medical data. In the knowledge base, according to the value data of the attribute items, the value data of the attribute items are segmented to obtain a plurality of value intervals, and then each value interval corresponds to a value representing a conclusion, such as high, medium and low conclusion words.

By way of example, it will be appreciated by those of ordinary skill in the art that in the medical arts, multiple value intervals for certain attribute items correspond to the conclusion categories for that attribute item, respectively. For example, the conclusion category corresponding to the blood pressure systolic pressure value in the 120-130mmHg interval is normal systolic pressure; the conclusion category corresponding to the blood pressure systolic pressure value in the 130-140mmHg interval is light high systolic pressure; the corresponding conclusion category is high systolic pressure when the value of the systolic blood pressure is more than 140 mmHg.

Then for the attribute items of which the value data is normal contraction pressure, slight high contraction pressure and high contraction pressure, if the value data does not exist in the knowledge base, the data type of the attribute item is determined to be a qualitative incomparable type.

It should be noted that the knowledge base may also be a medical knowledge graph with a complex structure and good maintenance, if possible.

S205, determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items of which the value data exist in the knowledge base are the qualitative comparison types.

If the value data are attribute items of normal contraction pressure, slight high contraction pressure and high contraction pressure, if the value data exist in the knowledge base, the data type of the attribute item is determined to be a qualitative comparison type.

It should be noted that, since the knowledge base directly affects the determination result of the data type of the attribute item in steps S204 and S205, in an implementation manner, the result of the data type determination of the attribute item in steps S204 and S205 may be manually checked. For example, the attribute item of the qualitative comparable type determined in step S205 may be adjusted to a qualitative incomparable type. If the knowledge base is a well-maintained knowledge base, the determination results of the data types of the attribute items in steps S204 and S205 are also more accurate, so that the determination results of both may not be adjusted.

In addition, it should be noted that, for the determination result of the data type of the attribute item in steps S202 and S203, manual adjustment may also be performed. For example, the data type of the attribute item of the qualitative comparable type determined in step S203 may be readjusted to a quantitative type.

It should be noted here that the present disclosure is not limited to the order of steps S201 to S205.

The method for classifying the data types of the attribute items replaces the manual classification labeling method for the attribute items in the related technology. The labor cost is reduced.

FIG. 3 is a flow chart illustrating one method of determining a data type of an attribute item according to an exemplary embodiment of the present disclosure. FIG. 3 illustrates a specific implementation flow of the method according to the method of FIG. 2 for determining the data type of an attribute term.

In one possible implementation manner, the processing of the value data of each attribute item by the data processing model includes: for the attribute items with the data types of the quantitative types, checking at least one of rank sum checking, T checking and T' checking to obtain a first intermediate attribute item;

In the related art, when analyzing the general data of the sample data, all attribute items in the sample data are required to be divided into two types, and then the analysis is performed based on the classification result. The general process of the general data analysis is that the value data of the first type attribute items in the sample data are subjected to normal test, and the value data of the attribute items conforming to normal distribution are subjected to T test or T' test; and carrying out rank sum check on the valued data of the attribute items which do not accord with the normal distribution. And carrying out chi-square test on the valued data of the second type attribute items in the sample data.

Thus, for this general data analysis approach in the related art, the present disclosure defines the data type of the attribute items of the first class as a quantitative type and the data type of the attribute items of the second class as a qualitative type. Then, checking at least one of rank sum check, T check and T' check for the attribute item with the data type being the quantitative type to obtain a first intermediate attribute item; and checking the attribute items with the data types being qualitative types through a chi-square checking algorithm to obtain second intermediate attribute items.

The first intermediate attribute term and the second intermediate attribute term characterize the results of general profile analysis in the related art. The number of attribute items included in the first intermediate attribute item and the second intermediate attribute item is smaller than the number of attribute items included in the sample data of the control group and the case group.

And carrying out single factor analysis on all attribute items in the first intermediate attribute item and the second intermediate attribute item to obtain a first target attribute item related to the symptoms in the case group. The target attribute items include the first target attribute item, that is, each attribute item in the result of the one-factor analysis may be a result of the etiology analysis.

And obtaining a first intermediate attribute item and a second intermediate attribute item by determining the data type of the attribute item and then carrying out general data analysis on the attribute items of the quantitative type and the qualitative type. In this way, no scientific research personnel is required to carry out labeling classification on each attribute item. This approach reduces labor costs compared to the related art.

In the case of the single-factor analysis for the first intermediate attribute item and the second intermediate attribute item, since the attribute items of the qualitative type are further divided into the attribute items of the qualitative comparable type and the qualitative incomparable type in the above steps, the single-factor analysis can be directly performed for the attribute items of the quantitative type, the qualitative comparable type, and the qualitative incomparable type in the first intermediate attribute item and the second intermediate attribute item. In this way, compared with the related art, it is unnecessary to re-classify each attribute item. This way further reduces the labor costs in the related art.

It should be noted that, in the related art, the process of single factor analysis generally includes discretizing with respect to a quantitative type of attribute item, performing Logistic regression analysis on a qualitative comparable type of attribute item, performing dummy coding analysis on a qualitative incomparable type of attribute item, and the like.

It will be appreciated by those skilled in the art that single factor analysis is based primarily on Logistic regression and corresponding OR and P values to analyze the impact of a single attribute on disease occurrence. The important index OR value is used for measuring the multiple of disease risk improvement when the value data of the attribute items are increased by one granularity.

Therefore, when the first intermediate attribute item and the second intermediate attribute item are subjected to single-factor analysis, if the quantitative type attribute item is subjected to the segmented discretization treatment, the quantitative type attribute item can have better statistical significance (P value). The statistical significance of the result is an estimation of the degree of realism of the result (which can represent the population). The greater the P value, the less likely the association of attribute items in the sample can be considered to be a reliable indicator of the association of attribute items in the population. For example, if the P value is 0.05, five percent of the associations of attribute terms in the characterization sample may be occasional.

In an implementation manner, the single factor analysis includes performing a segmentation discretization processing on the value data of each attribute item in the first intermediate attribute item, where a segmentation process in the segmentation discretization processing includes:

firstly, determining the numerical interval of the attribute item according to the maximum value and the minimum value of the attribute item.

For example, if the maximum value of an age attribute item is 100 and the minimum value is 0, the numerical interval of the age attribute item is [0, 100].

And then, segmenting the numerical value interval according to each super parameter in a preset super parameter space to obtain a segmented interval sequence set under all segmentation conditions.

For example, if the super-parameter space is (2, 10), then the super-parameters in the super-parameter space are 2,3, 4,5, 6, 7, 8, 9, 10.

Segmenting the value interval of the attribute item according to each super parameter, for example, dividing the value interval [0, 100] into two segments according to super parameter 2 to obtain all the conditions of dividing the value interval into two segments, such as [0,1], [2, 100]; [0,2], [3, 100]; [0,3], [4, 100], etc. (not all of the two-piece cases are listed here); as another example, the numerical interval [0, 100] is divided into three segments according to the super parameter 3, resulting in all the cases of being divided into three segments, for example, [0,1], [2,3], [4, 100]; [0,2], [3,4], [5, 100]; [0,3], [4,5], [6, 100], and so forth. And segmenting the numerical interval of the attribute item according to each super parameter to obtain a segmented interval sequence set under all segmentation conditions.

In a feasible real-time mode, the numerical intervals of the attribute items are segmented according to each super parameter to obtain a segmented interval sequence set under all segmentation conditions, and the segmentation can be realized by adopting a Bayesian optimization algorithm.

And then, calculating a P value representing the statistical significance of each segmented interval sequence in the segmented interval sequence set, and taking the segmented interval sequence with the minimum P value as a segmented result.

In one implementation, the P value for each sequence of segment intervals may be calculated as follows:

firstly, inputting each segment interval sequence in a segment interval sequence set into a Logistic regression model for analysis to obtain a variable coefficient and a variable standard error corresponding to each segment interval sequence.

In the related art, logistic regression analysis is a generalized linear regression analysis model. It will be understood by those skilled in the art that when each segment interval sequence in the segment interval sequence set is input into the Logistic regression model for analysis, a set of variable coefficients and variable standard errors are obtained for each segment interval sequence.

And then according to each obtained variable coefficient and variable standard error, calculating a corresponding wald χ2 value according to the following formula: wald χ2= (b _j /s _j ) ² Wherein b _j Characterization of the coefficient of variation, s _j Variable standard errors are characterized. And obtaining a corresponding P value by looking up a table for each wald χ2 obtained by calculation.

Next, a sequence of segment intervals having the smallest P value is selected as a segment result, and then each segment interval is converted in turn for such a segment result. Illustratively, if the segmentation result is [0,2], [3,4], [5, 100], then the segmentation interval [0,2] of the attribute term is converted to 1; converting the segment interval [3,4] of the attribute item into 2; the segment interval [5, 100] of the attribute item is converted to 3. Thus, the segment discretization process for the quantitative type of attribute item ends.

By adopting the method, the quantitative type attribute items are subjected to the segmented discretization treatment, so that the quantitative type attribute items have statistical significance, and further, the result of single-factor analysis on the quantitative type attribute items can be more accurate. And the quantitative type attribute items in the single factor analysis result can better explain the diseases of the case group.

In one possible implementation manner, the processing of the value data of each attribute item by the data processing model further includes: and carrying out multi-factor analysis on the first target attribute item to obtain a second target attribute item, wherein the target attribute item comprises the second target attribute item.

After the single-factor analysis, multi-factor analysis can be performed on the result of the single-factor analysis to analyze the effect of the combination of multiple attribute items on the symptoms of the case group. That is, multi-factor analysis is the reason for analyzing whether a plurality of combinations of attribute items are diseased.

In the related art, multi-factor analysis is mainly based on the influence degree of Logistic regression analysis on disease occurrence after multi-attribute item combination. When multi-factor analysis is carried out on the attribute items, dummy coding is carried out on the attribute items of qualitative incomparable types in the attribute items, and then each dummy code is input into a Logistic regression model for analysis.

However, this approach may have an impact on the multi-factor analysis results, e.g., one dummy code of the attribute term may be used as an attribute term related to the condition of the case group, and another dummy code may be used as an attribute term unrelated to the condition of the case group.

In view of this, in the present disclosure, performing the multi-factor analysis for attribute items of qualitatively incomparable type includes:

For example, if the type of the attribute item value data is n, n dummy variables are generated.

And generating a comparability coefficient corresponding to each value data under the attribute item according to each dummy variable of the attribute item. Specifically, according to each dummy variable of the attribute item, a logic model coefficient of each dummy code is correspondingly generated.

Then each value data under the attribute itemThe comparison coefficient (Logistic model coefficient) is input into the following calculation formula to calculate so as to obtain the corresponding wald χ2 value: wald χ2= (qβ) ^T [Qvar(β)Q ^T ](Qβ)；

It should be noted that, the assumption of the calculation formula is that: beta ₀ ＝β ₁ ＝.....β _n-1 =0, where β ₀ ，β ₁ ，.....β _n-1 And representing the Logistic model coefficients corresponding to each dummy variable.

It will be appreciated by those of ordinary skill in the art that assumptions need to be set when performing Logistic regression analysis. When the assumption preconditions for the setting are different, the derived wald χ2 formula is different.

In the wald χ2 formula, β represents a dummy variable coefficient, var (β) represents a standard error corresponding to the coefficient, T represents a transpose of the matrix, and Q is defined as:the number of rows of the matrix Q is n-1, the number of columns is n, the first column is all 0, and n represents the value type of the attribute item.

According to the calculated wald χ2 value, a corresponding P value is obtained by looking up a table, and according to the obtained P value, it can be determined whether or not to exclude the attribute item of the qualitative incomparable type at the time of multi-factor analysis. For example, assuming that the preset threshold is 0.05, if the obtained P value is greater than 0.05, the attribute item is excluded.

In this way, by taking all the value data of the attribute items of the qualitative incomparable type as a whole and then calculating the P value of the whole, the problem caused by calculating the P value for each value data of the attribute item in the related art can be avoided.

Based on the same inventive concept, the embodiments of the present disclosure further provide a etiology analysis apparatus, as shown in fig. 4, the apparatus 400 includes:

the obtaining module 410 is configured to obtain sample data of a control group and sample data of a case group, where the sample data includes multiple attribute items of a sample and value data of the sample under each attribute item, and conditions of each case in the case group are the same;

the determining module 420 is configured to determine a data type of each attribute item according to the value data under each attribute item;

the input module 430 is configured to input information of each attribute item of the control group and the case group into a data processing model, so as to obtain a target attribute item related to the disorder output by the data processing model;

By adopting the device, the sample data of the control group and the sample data of the case group are obtained, wherein the sample data comprises various attribute items of the sample and value data of the sample under each attribute item; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the data types of the control group and the case group are determined into a data processing model for processing, and obtaining a target attribute item which is output by the data processing model and is related to the symptoms of the case group. The etiology analysis mode does not need to manually participate in the analysis process, thereby realizing the etiology automatic analysis, and the etiology automatic analysis can avoid the problems caused by manually marking attribute items in the related technology.

Optionally, the determining module 420 includes:

Optionally, the data processing model is configured to:

Optionally, the data processing model is further configured to:

wherein the multi-factor analysis comprises:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 5 is a block diagram of an electronic device 700, according to an example embodiment. As shown in fig. 5, the electronic device 700 may include: a processor 701, a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

Wherein the processor 701 is configured to control the overall operation of the electronic device 700 to perform all or part of the steps of the etiology analysis method described above. The memory 702 is used to store various types of data to support operation on the electronic device 700, which may include, for example, instructions for any application or method operating on the electronic device 700, as well as application-related data, such as contact data, messages sent and received, pictures, audio, video, and so forth. The Memory 702 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 703 can include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 702 or transmitted through the communication component 705. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or a combination of more of them, is not limited herein. The corresponding communication component 705 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic device 700 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated ASIC), digital signal processor (Digital Signal Processor, abbreviated DSP), digital signal processing device (Digital Signal Processing Device, abbreviated DSPD), programmable logic device (Programmable Logic Device, abbreviated PLD), field programmable gate array (Field Programmable Gate Array, abbreviated FPGA), controller, microcontroller, microprocessor, or other electronic components for performing the above-described etiology analysis method.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the etiology analysis method described above is also provided. For example, the computer readable storage medium may be the memory 702 including program instructions described above, which are executable by the processor 701 of the electronic device 700 to perform the etiology analysis method described above.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described etiology analysis method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present disclosure within the scope of the technical concept of the present disclosure, and all the simple modifications belong to the protection scope of the present disclosure.

In addition, the specific features described in the above embodiments may be combined in any suitable manner without contradiction. The various possible combinations are not described further in this disclosure in order to avoid unnecessary repetition.

Moreover, any combination between the various embodiments of the present disclosure is possible as long as it does not depart from the spirit of the present disclosure, which should also be construed as the disclosure of the present disclosure.

Claims

1. A method of etiology analysis, the method comprising:

acquiring sample data of a control group and sample data of a case group, wherein the two groups of sample data comprise various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of each case in the case group are the same;

the attribute item information comprises value data of an attribute item and a data type of the attribute item, and the data processing model is used for processing the value data of the attribute item according to a data processing algorithm corresponding to the data type of the attribute item;

the processing of the data processing model to the value data of each attribute item comprises the following steps:

for the attribute items with the data types being quantitative types, checking at least one of rank sum checking, T checking and T' checking to obtain a first intermediate attribute item;

for the attribute items with the data types being qualitative types, checking through a chi-square checking algorithm to obtain second intermediate attribute items, wherein the qualitative types comprise qualitative comparable types and qualitative incomparable types;

performing single factor analysis on the first intermediate attribute item and the second intermediate attribute item to obtain a first target attribute item related to the disorder, wherein the target attribute item comprises the first target attribute item;

The single factor analysis comprises the step of carrying out segmentation discretization processing on the value data of each attribute item in the first intermediate attribute item, wherein the segmentation process in the segmentation discretization processing comprises the following steps:

2. The method of claim 1, wherein determining the data type of each of the attribute items based on the value data under each of the attribute items comprises:

3. The method of claim 1, wherein the processing of the value data of each attribute item by the data processing model further comprises:

wherein the multi-factor analysis comprises:

4. A etiology analysis device, the device comprising:

the acquisition module is used for acquiring sample data of a control group and sample data of a case group, wherein the two groups of sample data comprise various attribute items of a sample and value data of the sample under each attribute item, and the symptoms of each case in the case group are the same;

the data processing model is used for:

5. The apparatus of claim 4, wherein the means for determining comprises:

6. The apparatus of claim 4, wherein the data processing model is further configured to:

wherein the multi-factor analysis comprises:

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-3.

8. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1-3.