CN111199782A

CN111199782A - Etiology analysis method, etiology analysis device, storage medium, and electronic apparatus

Info

Publication number: CN111199782A
Application number: CN201911396700.6A
Authority: CN
Inventors: 孙浩; 侯广健; 刘满兰; 刘志鹏; 邹存璐; 王�锋
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-26
Anticipated expiration: 2039-12-30
Also published as: CN111199782B

Abstract

The present disclosure relates to a method, an apparatus, a storage medium and an electronic device for analyzing etiology, which provides a new etiology analysis method to realize automated analysis of etiology. The method comprises the following steps: acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises multiple attribute items of the samples and value data of the samples under the attribute items, and the symptoms of the cases in the case group are the same; determining the data type of each attribute item according to the value data under each attribute item; inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the disease; the attribute item information comprises value data of the attribute item and a data type of the attribute item, and the data processing model is used for processing the value data of the attribute item according to a data processing algorithm corresponding to the data type of the attribute item.

Description

Etiology analysis method, etiology analysis device, storage medium, and electronic apparatus

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a storage medium, and an electronic device for analyzing a cause of disease.

Background

Etiology analysis is an important research direction in the field of medical research, and mainly explores the causes of diseases, the mutual effects among related factors and the influence of all factors on the occurrence and development of the diseases.

In the related art, the etiology analysis process mainly comprises three steps: general data analysis, single factor analysis and multi-factor analysis. When general data analysis is performed, scientific research personnel are required to label each variable word in the data to be analyzed, all variables in the data to be analyzed are classified based on labeling results, and then different algorithms are adopted for analyzing each type of variables. When single-factor analysis and multi-factor analysis are performed, the variables need to be classified and labeled again, and then different types of variables after reclassification are analyzed. In this manner, a significant amount of personnel is required to manually label each variable term. In the process of manually marking a large number of variables, the types of the variables are easy to be wrongly marked, and the cause analysis result is obviously inaccurate due to the wrongly marked types of the variables. If the result of the etiology analysis is obviously inaccurate, a scientific researcher can perform labeling inspection on a large number of variable words, which undoubtedly requires a large amount of time, and if all the variables are re-labeled, a large amount of time is also consumed. Therefore, the manual labeling mode causes high labor cost and low efficiency of etiological analysis.

Disclosure of Invention

The invention aims to provide a method, a device, a storage medium and an electronic device for analyzing etiology, so as to provide a new etiology analysis mode and realize automatic analysis of etiology.

In order to achieve the above object, according to a first aspect of embodiments of the present disclosure, there is provided a method of cause analysis, the method including:

acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises multiple attribute items of the samples and value data of the samples under the attribute items, and the symptoms of the cases in the case group are the same;

determining the data type of each attribute item according to the value data under each attribute item;

inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the disease;

the attribute item information comprises value data of the attribute item and a data type of the attribute item, and the data processing model is used for processing the value data of the attribute item according to a data processing algorithm corresponding to the data type of the attribute item.

Optionally, the determining the data type of each attribute item according to the value data under each attribute item includes:

determining the data types of the attribute items with two value types of the value data as qualitative comparable types;

determining that the data type of the attribute item, the value type of which is not two, the value type of which is numerical data and the value type of which accords with the normal distribution, is a quantitative type;

determining the data type of the attribute item, of which the value data is not two, of which the value data is numerical data and the value data does not conform to the normal distribution, as the qualitative comparable type;

determining that the value types of the value data are not two, the value data are non-numerical data, and the data type of the attribute item, in which the value data do not exist in the knowledge base, is a qualitative non-comparable type;

and determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items of which the value data are stored in the knowledge base are the qualitative comparable types.

Optionally, the processing, by the data processing model, of the value data of each attribute item includes:

for the attribute item with the data type being the quantitative type, performing inspection through at least one of rank sum inspection, T inspection and T' inspection to obtain a first intermediate attribute item;

for the attribute items of which the data types are qualitative types, carrying out verification through a chi-square verification algorithm to obtain second intermediate attribute items, wherein the qualitative types comprise the qualitative comparable types and the qualitative non-comparable types;

and performing single-factor analysis on the first intermediate attribute item and the second intermediate attribute item to obtain a first target attribute item related to the disease condition, wherein the target attribute item comprises the first target attribute item.

Optionally, the single-factor analysis includes performing segmentation discretization on value data of each attribute item in the first intermediate attribute item, where a segmentation process in the segmentation discretization includes:

determining a value interval of the attribute item according to the maximum value and the minimum value of the attribute item;

segmenting the numerical interval according to each hyper-parameter in a preset hyper-parameter space to obtain a segmented interval sequence set under all segmentation conditions;

and calculating a P value representing the statistical significance of each segment interval sequence in the segment interval sequence set, and taking the segment interval sequence with the minimum P value as a segment result.

Optionally, the processing, by the data processing model, of the value data of each attribute item further includes:

performing multi-factor analysis on the first target attribute item to obtain a second target attribute item, wherein the target attribute item comprises the second target attribute item;

wherein the multi-factor analysis comprises:

generating corresponding number of dummy variables according to the type of the attribute item value data aiming at each attribute item of which the data type is the qualitative incomparable type in the first target attribute item;

and generating a comparative coefficient corresponding to each value data under the attribute item according to each dummy variable of the attribute item.

According to a second aspect of embodiments of the present disclosure, there is provided a cause analysis device, the device including:

the acquisition module is used for acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises a plurality of attribute items of the sample and value data of the sample under each attribute item, and the symptoms of the cases in the case group are the same;

the determining module is used for determining the data type of each attribute item according to the value data under each attribute item;

the input module is used for inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the disease;

Optionally, the determining module includes:

the first determining submodule is used for determining that the data types of the attribute items, of which the value types of the value data are two, are qualitative comparable types;

the second determining submodule is used for determining that the data types of the attribute items, of which the value data are not two, the value data are numerical data and the value data accord with the normal distribution, are quantitative types;

the third determining submodule is used for determining that the data types of the attribute items, of which the value data are not two, the value data are numerical data and the value data do not accord with the normal distribution, are the qualitative comparable types;

the fourth determining submodule is used for determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items, of which the value data do not exist in the knowledge base, are qualitative non-comparable types;

and the fifth determining submodule is used for determining that the value types of the value data are not two, the value data are non-numerical data, and the data types of the attribute items of which the value data are stored in the knowledge base are the qualitative comparable types.

Optionally, the data processing model is for:

Optionally, the data processing model is further configured to:

wherein the multi-factor analysis comprises:

By adopting the technical scheme, the following technical effects can be at least achieved:

acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises various attribute items of the samples and value data of the samples under the attribute items; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the determined data types of the control group and the case group into a data processing model for processing to obtain a target attribute item which is output by the data processing model and is related to the disease symptoms of the case group. The etiology analysis mode does not need manual participation in the analysis process, realizes automatic analysis of the etiology, and the automatic analysis of the etiology can avoid the problems in the related technology.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a method of etiology analysis according to an exemplary embodiment of the present disclosure.

FIG. 2 is a flow diagram illustrating a method for determining a data type of a property item according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating another method for determining a data type of a property item according to an exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating an etiology analysis device according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Etiology analysis is an important research direction in the field of medical research, and mainly explores the causes of diseases, the mutual effects among related factors and the influence of all factors on the occurrence and development of the diseases. That is, the etiology analysis is to investigate the disease cause of the patient after the disease is diagnosed.

In the related art, the etiology analysis process is mainly divided into three parts: general data analysis, single factor analysis and multi-factor analysis. When general data analysis is performed, scientific research personnel are required to label each variable word in the data to be analyzed, all variables in the data to be analyzed are classified based on labeling results, and then different algorithms are adopted for analyzing each type of variables. After the general data analysis is finished, single-factor analysis and multi-factor analysis are performed based on the result of the general data analysis. The classification requirements of the single-factor analysis and multi-factor analysis methods on the variables are different from the classification requirements of general data analysis on the variables, so that the variables need to be classified and labeled again when the single-factor analysis and the multi-factor analysis are carried out, and then the reclassified variables are analyzed according to different types.

In this way, it takes a lot of manpower to manually label each variable word in the etiology analysis process. In the process of manually marking a large number of variables, the types of the variables are easy to be wrongly marked, and the cause analysis result is inaccurate due to the wrongly marked types of the variables. If the result of the etiology analysis is inaccurate, a large amount of time is required to label and check a large number of variable words by scientific research personnel, and if all the variables are re-labeled, a large amount of time is also required. Therefore, the manual labeling mode causes high labor cost and low efficiency of etiological analysis. When the result of the etiology analysis is inaccurate, the obtained data is worthless, and clues such as RCT (central control transfer) and queue research cannot be provided for targeted experimental design.

In view of the above, embodiments of the present disclosure provide a method, an apparatus, a storage medium, and an electronic device for analyzing a cause of disease, so as to provide a new cause of disease analysis method and implement automated cause of disease analysis, thereby solving the problems in the related art.

Fig. 1 is a flow chart illustrating a method of etiology analysis, according to an exemplary embodiment of the present disclosure, as shown in fig. 1, the method comprising:

s101, acquiring sample data of a control group and sample data of a case group, wherein the sample data comprises multiple attribute items of the sample and value data of the sample under each attribute item, and the symptoms of the cases in the case group are the same.

When analyzing the etiology, it is necessary to determine which disease is analyzed. Two sets of sample data were then selected, one set being the case group that had been diagnosed with the disease and one set being the control group for the condition of the non-affected case group. It is worth mentioning that the condition of each case in the case group is the same. Illustratively, two sets of sample data selected to analyze the etiology of gastric cancer disorders are: one group was gastric cancer group, and one group was non-gastric cancer group.

And the two groups of selected sample data comprise various attribute items of each sample and value data of each sample under the attribute items. For example, the attribute items included in the sample data may be the name, gender, ethnicity, age, academic calendar, systolic blood pressure, diastolic blood pressure, blood glucose content, and the like of each sample; the value data of each sample under the attribute item refers to the specific value of each sample under each attribute item. For example, the value data of the sample a in the control group under the name attribute item is zhang, and the value data of the sample B in the case group under the name attribute item is lie iv; for example, the value data of the sample A under the attribute item of the systolic blood pressure is 110mmHg, and the value data of the sample B under the attribute item of the systolic blood pressure is 100 mmHg.

Among them, it will be understood by those of ordinary skill in the art that each sample in the control group and the case group has the same number of the same attribute items when the etiology analysis is performed. For example, sample A has 100 attribute entries, and sample B also has the same 100 attribute entries.

In one implementable embodiment, sample data for the control group and case group may be obtained from a clinical data center CDR database.

And S102, determining the data type of each attribute item according to the value data under each attribute item.

According to the value data under each attribute item, the data type of each attribute item can be determined. For example, if the attribute item is a national attribute item, the value data of the national attribute item of each sample may be a chinese, or a miao, or a Hui nationality, or the like; and determining the data type of the ethnic attribute items according to the value data of the ethnic attribute items of all the samples.

For another example, if the attribute item is the diastolic blood pressure, the data of the diastolic blood pressure may be 120mmHg, 100mmHg, 80mmHg, or the like; according to the value data of the blood pressure diastolic pressure attribute item, the data type of the blood pressure diastolic pressure attribute item can be determined.

The data type of each attribute item is determined according to the value data of each attribute item, scientific research personnel do not need to manually mark the data type of each attribute item, and therefore labor cost in the etiology analysis process is reduced.

S103, inputting the information of each attribute item of the control group and the case group into a data processing model to obtain a target attribute item which is output by the data processing model and is related to the disease.

After the data type of each attribute item of the control group and the case group is determined, the information of each attribute item is input into a data processing model for analysis processing, and a target attribute item which is output by the data processing model and is related to the disease condition of the case group is obtained.

The target attribute items are the result of etiological analysis and are risk factors for causing diseases. For example, assuming etiological analysis of gastric cancer disorders, the target attribute items that cause gastric cancer may be diet, stay up night, and the like.

It should be understood by those skilled in the art that in the process of etiology analysis, for attribute items of different data types, different data processing algorithms are used to process the value data of each attribute item. For example, in the related art, the normalization test is performed on the value data of the systolic blood pressure attribute item, the diastolic blood pressure attribute item, and the age attribute item, then the T test is performed on the attribute items conforming to the normalization test, and the rank-sum test is performed on the attribute items not conforming to the normalization test.

Therefore, in the present disclosure, the attribute item information input into the data processing model includes the value data of the attribute item and the data type of the attribute item. And the data processing model is used for selecting a corresponding data processing algorithm according to the data type of the attribute item to process the value data of the attribute item.

By adopting the method, the sample data of the control group and the sample data of the case group are obtained, wherein the sample data comprises various attribute items of the sample and value data of the sample under each attribute item; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the determined data types of the control group and the case group into a data processing model for processing to obtain a target attribute item which is output by the data processing model and is related to the disease symptoms of the case group. The etiology analysis mode realizes automatic analysis of etiology without manual participation in the analysis process, and the automatic analysis of etiology can avoid the problems caused by manual labeling of attribute items in the related technology.

In a possible implementation manner, as shown in fig. 2, the determining a data type of each attribute item according to the value data under each attribute item may include the following steps:

s201, determining the data types of the attribute items with the two value types of the value data as qualitative comparable types.

And if the types of the value data of the attribute item are two, determining that the data type of the attribute item is a qualitative comparable type. That is, if the value data of the attribute item is either a or B, the data type of the attribute item is a qualitative comparable type. Exemplarily, the value data of the attribute items is 0 or 1; the value data of the attribute items is yes or no; the value data of the attribute items is 0.01 or 0.02; then the data type of such attribute items is determined to be a qualitatively comparable type.

S202, determining that the data types of the attribute items, of which the value data are numerical data and the value data accord with the normal distribution, are quantitative types, wherein the value types of the value data are not two.

The types of the value data of the attribute items are not two, that is, the types of the value data are one, three or more.

Numeric (Numeric) data is often characterized by the letter N, which is data consisting of a number, a decimal point, a sign, and the letter E.

And if the types of the value data of the attribute item are not two, the value data is numerical data, and the value data accords with the normal distribution, determining the data type of the attribute item as a quantitative type. In an implementable embodiment, whether the value data of the attribute items conform to the normality distribution can be verified through a normality check.

S203, determining that the data type of the attribute item, of which the value data is numerical data and the value data does not conform to the normal distribution, is the qualitative comparable type.

And if the types of the value data of the attribute item are not two, the value data is numerical data, and the value data does not conform to the normal distribution, determining the data type of the attribute item as a qualitative comparable type.

S204, determining that the value types of the value data are not two, the value data are non-numerical data, and the data type of the attribute item, in which the value data do not exist in the knowledge base, is a qualitative non-comparable type.

The non-numerical data refers to single character data or character string data without calculation capability, such as Chinese characters, English characters, numeric characters, ASCII characters, and the like.

The knowledge base refers to a knowledge base related to medical treatment. The knowledge base is established after text analysis, word segmentation processing, part of speech tagging and the like are carried out on medical data. In the knowledge base, the value data of the attribute items are segmented according to the value data of the attribute items to obtain a plurality of value intervals, and then each value interval corresponds to a value representing a conclusion, such as high, medium and low conclusion words.

Illustratively, it will be understood by those of ordinary skill in the art that, in the medical field, a plurality of value intervals of some attribute items respectively correspond to conclusion categories of the attribute items. Illustratively, for example, the conclusion category corresponding to the blood pressure systolic pressure value in the 120-130mmHg interval is the normal systolic pressure; the corresponding conclusion category when the value of the systolic blood pressure is in the interval of 130-140mmHg is mild high systolic pressure; when the blood pressure systolic pressure is above 140mmHg, the corresponding conclusion category is high systolic pressure.

And if the value data do not exist in the knowledge base, determining that the data type of the attribute item is a qualitative irreducible type.

It is worth mentioning that the knowledge base may also be a well-maintained medical knowledge map with a complex structure, in a possible case.

S205, determining that the value types of the value data are not two, the value data are non-numerical data, and the data type of the attribute item of which the value data are stored in the knowledge base is the qualitative comparable type.

In a possible case, if the value data is attribute items of normal systolic pressure, mild high systolic pressure and high systolic pressure, and if the value data is stored in the knowledge base, the data type of the attribute item is determined to be a qualitative comparable type.

It should be noted that, since the knowledge base directly affects the result of the determination of the data type of the attribute item in steps S204 and S205, in an implementable embodiment, manual proofreading can be performed on the result of the determination of the data type of the attribute item in steps S204 and S205. For example, the attribute items of the qualitative comparable type determined in step S205 may be adjusted to the qualitative non-comparable type. If the knowledge base is a well-maintained knowledge base, the determination results of the data types of the attribute items in steps S204 and S205 are also more accurate, and therefore the determination results of the two may not be adjusted.

It should be noted that, as to the determination results of the data types of the attribute items in steps S202 and S203, manual adjustment may be performed. For example, the data type of the attribute items of the qualitatively comparable type determined in step S203 may be readjusted to be of the quantitative type.

It should be noted here that the present disclosure does not limit the sequence of the above steps S201 to S205.

The method for classifying the data types of the attribute items replaces the method for manually classifying and labeling the attribute items in the related technology. The labor cost is reduced.

FIG. 3 is a flow diagram illustrating a method for determining a data type of a property item according to an exemplary embodiment of the present disclosure. Fig. 3 shows a flow of a specific implementation of the method according to the method for determining the data type of the attribute item in fig. 2.

In a possible implementation manner, the processing of the value data of each attribute item by the data processing model includes: for the attribute item with the data type being the quantitative type, performing inspection through at least one of rank sum inspection, T inspection and T' inspection to obtain a first intermediate attribute item;

In the related art, when general data analysis is performed on sample data, all attribute items in the sample data need to be classified into two types, and then analysis is performed based on the classification result. The general data analysis process includes performing normality test on the value data of the first type of attribute items in sample data, and performing T test or T' test on the value data of the attribute items conforming to normal distribution; and carrying out rank sum test on the value data of the attribute items which do not conform to the normal distribution. And performing chi-square inspection on the value data of the second type attribute items in the sample data.

Therefore, for such a general data analysis manner in the related art, the present disclosure defines the data type of the attribute item of the first class as a quantitative type, and the data type of the attribute item of the second class as a qualitative type. Then, for the attribute item of which the data type is a quantitative type, carrying out inspection through at least one of rank sum inspection, T inspection and T' inspection to obtain a first intermediate attribute item; and for the attribute items of which the data types are qualitative types, checking through a chi-square checking algorithm to obtain second intermediate attribute items.

The first intermediate attribute item and the second intermediate attribute item characterize the results of general data analysis in the related art. The number of attribute items included in the first intermediate attribute item and the second intermediate attribute item is smaller than the number of attribute items included in the sample data of the control group and the case group.

And performing single-factor analysis on all the attribute items in the first intermediate attribute item and the second intermediate attribute item to obtain a first target attribute item related to the disease symptoms in the case group. The target attribute item includes the first target attribute item, that is, each attribute item in the result of the single-factor analysis can be the result of the etiology analysis.

And obtaining a first intermediate attribute item and a second intermediate attribute item by determining the data type of the attribute items and then carrying out general data analysis on the attribute items of the quantitative type and the qualitative type. In this way, scientific researchers do not need to label and classify each attribute item. This way, compared to the related art, the labor cost is reduced.

When single-factor analysis is performed on the first intermediate attribute item and the second intermediate attribute item, because the attribute items of the qualitative type are further divided into attribute items of the qualitative comparable type and the qualitative irrelevable type in the above steps, single-factor analysis can be directly performed on the attribute items of the quantitative type, the qualitative comparable type and the qualitative irrelevable type in the first intermediate attribute item and the second intermediate attribute item. Compared with the related art, the method does not need to carry out label classification on each attribute item again. This approach further reduces labor costs in the related art.

It should be noted that, in the related art, the process of the single-factor analysis roughly includes discretization processing for a quantitative type of attribute items, Logistic regression analysis for a qualitative comparable type of attribute items, dummy code analysis for a qualitative non-comparable type of attribute items, and the like.

It will be appreciated by those skilled in the art that single factor analysis is based primarily on Logistic regression and the corresponding OR value, P value, to analyze the effect of individual attribute terms on disease occurrence. The important index OR value is used for measuring the times of disease risk improvement when the value data of the attribute items are increased by one granularity.

Therefore, when the first intermediate attribute item and the second intermediate attribute item are subjected to single-factor analysis, if the attribute item of the quantitative type is subjected to segmented discretization, the attribute item of the quantitative type can be made to have better statistical significance (P value). The statistical significance of the results is an estimation of the trueness of the results (which can represent the population). The larger the P value is, the more the association of the attribute items in the sample cannot be considered as a reliable index of the association of the attribute items in the population. For example, if the P value is 0.05, it may be contingent to characterize five percent of attribute entries in the sample being associated.

In an implementation manner, the single-factor analysis includes performing segmentation discretization on the value data of each attribute item in the first intermediate attribute item, where a segmentation process in the segmentation discretization includes:

firstly, determining a value interval of the attribute item according to the maximum value and the minimum value of the attribute item.

For example, if the maximum value of the age attribute item is 100 and the minimum value is 0, the value interval of the age attribute item is [0, 100 ].

And then, segmenting the numerical interval according to each hyper-parameter in a preset hyper-parameter space to obtain a segmented interval sequence set under all segmentation conditions.

Illustratively, if the hyper-parameter space is (2, 10), then the hyper-parameters in the hyper-parameter space are 2, 3, 4, 5, 6, 7, 8, 9, 10.

Segmenting the value interval of the attribute item according to each hyper-parameter, illustratively, segmenting the value interval [0, 100] into two segments according to the hyper-parameter 2, and obtaining all the two segment-segmented cases, such as [0, 1], [2, 100 ]; [0, 2], [3, 100 ]; [0, 3], [4, 100], etc. (not all cases of two-stage division are listed here); further, for example, according to the hyper-parameter 3, the numerical interval [0, 100] is divided into three segments, and all the three-segment cases are obtained, such as [0, 1], [2, 3], [4, 100 ]; [0, 2], [3, 4], [5, 100 ]; [0, 3], [4, 5], [6, 100], and so on. Until the value interval of the attribute item is segmented according to each hyper-parameter, a segmented interval sequence set under all the segmentation conditions is obtained.

In a feasible real-time mode, the numerical interval of the attribute item is segmented according to each hyper-parameter to obtain a segmented interval sequence set under all segmentation conditions, and the segmentation interval sequence set can be realized by adopting a Bayesian optimization algorithm.

Then, for each segment interval sequence in the segment interval sequence set, a P value representing the statistical significance of the segment interval sequence is calculated, and the segment interval sequence with the minimum P value is used as a segment result.

In an implementation, the P value of each segment interval sequence can be calculated as follows:

firstly, inputting each segment interval sequence in the segment interval sequence set into a Logistic regression model for analysis, and obtaining a variable coefficient and a variable standard error corresponding to each segment interval sequence.

In the related art, Logistic regression analysis is a generalized linear regression analysis model. It will be understood by those skilled in the art that when each segment interval sequence in the segment interval sequence set is input into the Logistic regression model for analysis, a set of variable coefficients and variable standard errors are obtained for each segment interval sequence.

Then, according to each set of variable coefficient and variable standard error, calculating a corresponding wald χ 2 value by the following formula: wald χ 2 ═ b_j/s_j)²Wherein b is_jCoefficient of characteristic variable, s_jStandard error of characterization variable. And aiming at each wald χ 2 obtained by calculation, obtaining a corresponding P value by looking up a table.

Then, a sequence of segment intervals with the smallest P value is selected as a segmentation result, and then each segment interval is sequentially converted for such segmentation result. Exemplarily, if the segmentation result is [0, 2], [3, 4], [5, 100], the segmentation interval [0, 2] of the attribute item is converted into 1; converting the subsection interval [3, 4] of the attribute item into 2; the segmentation interval [5, 100] of the attribute item is converted to 3. In this way, the segmentation discretization process for the attribute items of the quantitative type is completed.

By adopting the method, the attribute items of the quantitative type can be subjected to segmented discretization treatment, so that the attribute items of the quantitative type have statistical significance, and further, the result of performing single-factor analysis on the attribute items of the quantitative type can be more accurate. And the quantitative type attribute items in the single-factor analysis result can better explain the diseases of the case group.

In a possible implementation manner, the processing, by the data processing model, of the value data of each attribute item further includes: and performing multi-factor analysis on the first target attribute item to obtain a second target attribute item, wherein the target attribute item comprises the second target attribute item.

After the single factor analysis, multi-factor analysis can be performed on the result of the single factor analysis to analyze the influence effect of the combination of the plurality of attribute items on the symptoms of the case group as a whole. That is, the multi-factor analysis is to analyze whether a combination of a plurality of attribute items is a cause of a disease.

In the related technology, the multi-factor analysis is mainly based on Logistic regression analysis of the influence degree of the combination of multiple attribute items on the occurrence of diseases. When multi-factor analysis is carried out on the attribute items, the attribute items of the qualitative incomparable type need to be subjected to dummy coding, and then each dummy coding is input into a Logistic regression model for analysis.

However, this method may affect the result of the multi-factor analysis, for example, a certain dummy code of the attribute item may be used as the attribute item related to the disease condition of the case group, and another dummy code may be used as the attribute item unrelated to the disease condition of the case group.

In view of this, in the present disclosure, performing the multi-factor analysis on qualitatively comparable type attribute items includes:

for example, if the type of the attribute item value data is n, n dummy variables are generated.

And generating a comparative coefficient corresponding to each value data under the attribute item according to each dummy variable of the attribute item. Specifically, Logistic model coefficients of each dummy code are generated according to each dummy variable of the attribute item.

Then, the comparable coefficient (Logistic model coefficient) of each value data under the attribute item is input into the following calculation formula to be calculated so as to obtain the corresponding wald chi 2 value, wherein the wald chi 2 value is (Q β)^T[Qvar(β)Q^T](Qβ)；

It should be noted that the assumption of this calculation formula is β₀＝β₁＝.....β_n-10, wherein, β₀，β₁，.....β_n-1And characterizing the Logistic model coefficient corresponding to each dummy variable.

It will be understood by those skilled in the art that assumptions may be set when performing Logistic regression analysis. The derived wald chi 2 formula is different when the assumed premise of setting is different.

In the above wald χ 2 equation, β characterizes the dummy variable coefficients, var (β) characterizes the standard error for the coefficients, T characterizes the transpose of the matrix, and Q is defined as:

the row number of the matrix Q is n-1, the column number is n, the first columns are all 0, and n represents the value type of the attribute item.

And obtaining a corresponding P value through table look-up according to the calculated wald x 2 value, and determining whether to exclude the attribute item of the qualitative non-comparable type during multi-factor analysis according to the obtained P value. For example, assuming that the preset threshold is 0.05, if the obtained P value is greater than 0.05, the attribute item is excluded.

By adopting the mode, all the value data of the attribute items of the qualitative and incomparable type are taken as a whole, and then the integral P value is calculated, so that the problem caused by calculating the P value aiming at each value data of the attribute items in the related technology can be avoided.

Based on the same inventive concept, the disclosed embodiment further provides a cause analysis device, as shown in fig. 4, the device 400 includes:

the obtaining module 410 is configured to obtain sample data of a control group and sample data of a case group, where the sample data includes multiple attribute items of a sample and value data of the sample under each attribute item, and a disease condition of each case in the case group is the same;

the determining module 420 is configured to determine a data type of each attribute item according to the value data of each attribute item;

the input module 430 is configured to input information of each attribute item of the control group and the case group into a data processing model, so as to obtain a target attribute item output by the data processing model and related to the disease condition;

By adopting the device, the sample data of the control group and the sample data of the case group are obtained, wherein the sample data comprises various attribute items of the sample and value data of the sample under each attribute item; determining the data type of each attribute item according to the value data under each attribute item; in this way, the data type of each attribute item is automatically determined without manually classifying and labeling each attribute item. And inputting the information of each attribute item after the determined data types of the control group and the case group into a data processing model for processing to obtain a target attribute item which is output by the data processing model and is related to the disease symptoms of the case group. The etiology analysis mode realizes automatic analysis of etiology without manual participation in the analysis process, and the automatic analysis of etiology can avoid the problems caused by manual labeling of attribute items in the related technology.

Optionally, the determining module 420 includes:

Optionally, the data processing model is for:

Optionally, the data processing model is further configured to:

wherein the multi-factor analysis comprises:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating an electronic device 700 according to an example embodiment. As shown in fig. 5, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

The processor 701 is configured to control the overall operation of the electronic device 700, so as to complete all or part of the steps in the above-mentioned etiology analysis method. The memory 702 is used to store various types of data to support operation at the electronic device 700, such as instructions for any application or method operating on the electronic device 700 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 702 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia components 703 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 702 or transmitted through the communication component 705. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 705 may thus include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-described etiology analysis method.

In another exemplary embodiment, there is also provided a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the above-described etiological analysis method. For example, the computer readable storage medium may be the memory 702 described above including program instructions that are executable by the processor 701 of the electronic device 700 to perform the etiology analysis method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described method of etiological analysis when executed by the programmable apparatus.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A method of etiological analysis, said method comprising:

2. The method according to claim 1, wherein the determining a data type of each attribute item according to the value data under each attribute item includes:

3. The method of claim 2, wherein the processing of the value data of each attribute item by the data processing model comprises:

4. The method according to claim 3, wherein the single-factor analysis includes performing segmentation discretization on value data of each attribute item in the first intermediate attribute item, and a segmentation process in the segmentation discretization includes:

5. The method according to claim 3 or 4, wherein the processing of the value data of each attribute item by the data processing model further comprises:

wherein the multi-factor analysis comprises:

6. An etiological analysis device, the device comprising:

7. The apparatus of claim 6, wherein the determining module comprises:

8. The apparatus of claim 7, wherein the data processing model is configured to:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.

10. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 5.