CN110223770A

CN110223770A - Gastroenterology electronic data analysis method

Info

Publication number: CN110223770A
Application number: CN201910454851.6A
Authority: CN
Inventors: 李霄剑; 王亚雷; 丁帅; 孙斌; 张宏敏; 李杨
Original assignee: Hefei University of Technology; First Affiliated Hospital of Anhui Medical University
Current assignee: Hefei University of Technology; First Affiliated Hospital of Anhui Medical University
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2019-09-10

Abstract

The embodiment of the present invention discloses a kind of gastroenterology electronic data analysis method, it include: to obtain enteroscopy result and pathologic finding conclusion, and enteroscopy result and pathologic finding conclusion are compared, show that intestinal cancer analyzes result, result statistical nature is analyzed for intestinal cancer, construct nonhomogeneous Poisson process NHPP class diseases analysis reliability growth model, Training Support Vector Machines SVM model, using model accuracy rate as evaluation index, using diseases analysis reliability growth model each time reliability result as penalty factor, associated losses function optimizes SVM model together and carries out intestinal cancer judgement.The embodiment of the present invention can analyze gastroenterology electronic data, obtain more accurate judging result.

Description

Gastroenterology electronic data analysis method

Technical field

The present invention relates to data analysis field more particularly to a kind of gastroenterology electronic data analysis methods.

Background technique

The medical datas such as intestinal inspection result can reflect enteron aisle situation, if to the medical datas such as intestinal inspection result into Row analysis, it will be appreciated that intestinal cancer situation.How the current needs of analysis are carried out to medical datas such as intestinal inspection results to solve Technical problem.

Summary of the invention

The embodiment of the present invention provides a kind of gastroenterology electronic data analysis method, can to gastroenterology electronic data into Row analysis obtains analysis result.

The embodiment of the present invention adopts the following technical scheme that

A kind of gastroenterology electronic data analysis method, comprising:

Obtain enteroscopy result and pathologic finding conclusion, and by the enteroscopy result and the pathologic finding conclusion It is compared, show that intestinal cancer analyzes result；

Result statistical nature is analyzed for the intestinal cancer, building nonhomogeneous Poisson process NHPP class diseases analysis reliability increases Long model；Training Support Vector Machines SVM model, using model accuracy rate as evaluation index, by the diseases analysis reliability growth Model each time reliability result as penalty factor, associated losses function optimizes SVM model together, carries out intestinal cancer and sentences It is disconnected.

Optionally, the acquisition enteroscopy result and pathologic finding conclusion, and by the enteroscopy result and described Pathologic finding conclusion is compared, and show that intestinal cancer analysis result includes:

Characteristic information is extracted in enteroscopy report and pathologic finding report from the same sufferer, is believed according to individual features Breath carries out report splicing integration；During integrating to data, it is subject to pathological examination results；Wherein, the extraction Characteristic information includes: the extraction of the extraction of text-type feature, the extraction of temporal characteristics and patient's essential information feature；

The text-type feature is integrated into construction feature space, proposes corresponding disease category from pathologic finding report Descriptor as label (Label), construction output space (Label Space), the attribute for being characterized each feature in space takes Value and output space setting numeric coding rule, by the numeric coding that sets rule, to the data reporting after integration into Line number value indicates, becomes computer and algorithm model can recognize and the numeric type data (Numerical of study Data)；

After carrying out numeralization expression to data, using pathologic finding conclusion as final conclusion, by enteroscopy conclusion and disease Reason checks that conclusion is compared, and using the moon as time partition dimension, counts the analysis result of intestinal cancer every month.

Optionally, the extraction of the text-type feature includes: that the text-processing done to enteroscopy text data is known as Medical Language processing, it is main during this to complete the participle to enteroscopy text data, noise elimination, and extract specific disease The positive description of disease is used as characteristic information；

The extraction of the temporal characteristics includes: the character string type number for being Interval Coding division check data in audit report According to, including the date, the moon is therefrom extracted as temporal characteristics；

The extraction of patient's essential information feature includes: that the feature of extraction includes gender, age, occupation.

Optionally, described to analyze result statistical nature for the intestinal cancer, construct nonhomogeneous Poisson process NHPP class disease Analyzing reliability growth model includes:

Nonhomogeneous Poisson process NHPP is set；

Construct NHPP class calculated result reliability growth model frame；

Construct the NHPP class diseases analysis reliability growth model to tally with the actual situation；

To the parameter Estimation in the NHPP class diseases analysis reliability growth model to tally with the actual situation.

Optionally, the setting nonhomogeneous Poisson process NHPP includes:

One random counting process { N (t), t >=0 } is set meets A1 to A4:N and represent a counting process, can indicate Quantity number, t indicate the time；

A1:N (0)=0；

A2:{ N (t), t >=0 } it is an independent incremental process；

A3:P [N (t+ Δ t)-N (t)]=1=λ (t) Δ t+o (Δ t)；

λ (t) indicates that the intensity function of nonhomogeneous Poisson process, Δ t indicate a time interval, and (Δ t) indicates Δ t's to o Higher-order shear deformation function；

A4:P {-N (s) >=2 N (t) }=o (Δ t)；

Then claiming { N (t), t >=0 } is the nonhomogeneous Poisson process with intensity λ (t), as λ (t)=λ, nonhomogeneous Poisson Process is exactly common homogeneous Poisson processes；

The probability distribution formula of nonhomogeneous Poisson process is as follows:

S indicates next time, identical as t meaning；

N (t): the cumulative error analysis times found within [0, the t] period；

M (t): the desired value of cumulative error diagnosis number, m (t)=E [N (t)] in [0, the t] period；

X (t): until moment t, it is detected and belongs to the error analysis number that repetition mistake is examined；

A (t): disease mistake examines total function, indicates until moment t, the diseases analysis errors number counted in case Summation；

a₀: errors number is analyzed present in case when statistics starts；

B: analysis error rate indicates that each mistake examines the probability counted in case；

P (t): mistake examines repetitive rate function, indicate in moment t, and the error analysis being each detected, which belongs to, to be repeated Probability；

R (x | t): disease reliability function is indicated since moment t in the t+x period, the reliability of diseases analysis.

Optionally, the building NHPP class calculated result reliability growth model frame includes:

Based on the probability distribution formula of the nonhomogeneous Poisson process, set:

B1: the cumulative error analysis times N (t) to time t obeys the Poisson process that mean function is m (t).Any time Interval t to t+ Δ t in desired mistake examine generation number examined to the remaining mistake of t moment it is several proportional.

B2: disease mistake examine quantity in varying environment, different moments be it is different, disease error analysis sum is at any time Variation.

B3: identical mistake examines situation and is likely to occur in different time sections, and mistake examines the function that repetitive rate is the time.

B4: each error analysis is independent from each other in case, and consequence seriousness caused by each analysis mistake is different；

By assuming B1: having B5, B5:m (t+ Δ t)=b (a (t)-x (t)) Δ t+o (Δ t)

(a (t)-x (t)) is indicated until moment t, is detected and is not belonging to repeat the wrong analysis errors number examined

So as to obtain the differential equation

By assuming that B3 has B7, B7:

Indicate x (t) to t carry out derivation, behind be also same meaning；

B8, B8 can be obtained by equation B6, equation B7:

Indicate x (t) to t carry out derivation, behind be also same meaning；

Equation B6, equation B7 primary condition be B9, B10, B9:m (0)=0, B10:x (0)=0；

B11 can be obtained by formula B8, formula B10,

B11:

Exp is the meaning of index, and t, u are the distinct symbols for indicating the time, and dt, du quadrature；

It is B12 to which the cumulative analysis mistake mean function of model can be solved by formula B6, formula B9:

Since the cumulative error analysis times N (t) until moment t obeys the nonhomogeneous Poisson distribution of mean value m (t), institute With B13:

According to nonhomogeneous Poisson be distributed property, Reliability Function B14:

R (x | t)=1-P {-N (t)=0 N (t+x) }=1-exp [- (m (t+x)-m (t))].

Optionally, the NHPP class diseases analysis reliability growth model to tally with the actual situation that constructs includes:

The total function of analysis mistake describes B15:a (t)=a with following function₀(1+αt)

Wherein, the size of α < 0, α determine the wrong size for examining total function decrease speed；

Mistake, which examines repetitive rate function p (t), should meet following condition: p (t) ∈ [0,1] and p (t) is decreasing function, when t → When ∞, p (t) → 0.Therefore, following function can be chosen and examine repetitive rate function B16 to define mistake:

The size of wherein k > 0, k determine the wrong speed for examining repetitive rate variation；

It brings formula B15, formula B16 into formula B11, formula B12, obtains:

B17:

B18:

After obtaining the mean function m (t) of cumulative analysis errors number, so that it may be obtained in m (t) using Parameter Estimation Method Parameter.

Optionally, the parameter Estimation packet in the described pair of NHPP class diseases analysis reliability growth model to tally with the actual situation It includes:

The parameter in formula B18 is estimated using Maximum Likelihood Estimation Method, examining mean function m (t) by mistake can obtain seemingly Right function such as B19:

Wherein, (t_i, n_i) occur in pairs, n_iIndicate t_iCounted in moment case institute it is wrong examine number only and；t_iIt indicates Counting on mistake and examining number summation is n_iAt the time of, L (parameters | (t_i, n_i)) indicate maximum likelihood function, subsequent is to ask Product code, exp are to indicate exponent arithmetic；

Take natural logrithm that can obtain B20 formula B19:

Ln indicates that logarithm operation, ∑ indicate summation operation,！Indicate factorial operation；

Each estimates of parameters can be acquired to above-mentioned formula B20 differential.

Optionally, the Training Support Vector Machines SVM model, using model accuracy rate as evaluation index, by the disease point Reliability growth model is analysed in the reliability result of each time as penalty factor, associated losses function optimizes SVM mould together Type, carrying out intestinal cancer judgement includes:

Using linear separability support vector machines learning algorithm, i.e., maximal margin method come Training Support Vector Machines SVM model with Optimal separating hyperplane is found, algorithm description is as follows:

Input: linear separability training dataset T={ (x₁, y₁), (x₂, y₂) ..., (x_n, y_n), wherein x_i∈ χ=Rⁿ, y_i∈ γ={ -1 ,+1 } is ith feature vector, also referred to as example, y_iFor x_iClass label, work as y_iWhen=+ 1, claim x_iFor positive example； y_i=-1 at that time, claims x_iBe negative example, (x_i, y_i) it is known as sample point；

Output: largest interval separating hyperplance and categorised decision function；

(1) it constructs and solves constrained optimization problem:

s.t y_i(w·x_i+ b) -1 >=0, i=1,2 ..., N B22

Acquire optimal solution w^*, b^*；Min expression is minimized, and w, b indicate to constitute two ginsengs of largest interval separating hyperplance Number, Xi indicate the example in training dataset；

(2) separating hyperplance is thus obtained are as follows:

w^*·x+b^*=0 B23

Categorised decision function are as follows:

F (x)=sign (w^*·x+b^*) B24

Wherein: | | w | | it is the L of w₂Norm, (w, b) are given hyperplane；

By using above-mentioned algorithm Training Support Vector Machines model, optimal separating hyperplane is found, feature space is drawn It is divided into two parts, a part is positive class, and a part is negative class, so that data set be classified.

Optionally, after obtaining hyperplane further include:

To each sample point (x_i, y_i) introduce a slack variable ξ_i>=0, so that function interval is greater than plus slack variable Equal to 1, at this point, constraint condition becomes: y_i(w·x_i+b)≥1-ξ_i.Meanwhile to each slack variable ξ_i, pay a cost Function ξ_i, objective function is also by originalBecome:N value range is all nonnegative integers Set；

Wherein, C > 0 is known as punishment parameter, its value is different under different Question backgrounds, to misclassification when C value is big Punishment increases, and C value hour reduces the punishment of misclassification, and the objective function after change includes keeping interval big as far as possible and misclassified gene Quantity two layers of meaning small as far as possible, C belong to reconcile the relationship of the two variable；

The problems of value of C is combined with disease reliability issues, value, that is, disease Reliability Function of C is in each time The calculated result of section, objective function become:To disease misclassification when data calculating reliability is big Punishment increase, reliability hour, which punishes the misclassification of disease, to be reduced, at this point, by NHPP class diseases analysis reliability growth mould Type is applied in the optimization method of basic SVM model, trains satisfactory more granularity enteroscopy report analysis models.

Gastroenterology electronic data analysis method based on the above-mentioned technical proposal obtains enteroscopy result and pathologic finding Conclusion, and enteroscopy result and pathologic finding conclusion are compared, show that intestinal cancer is analyzed as a result, analyzing result for intestinal cancer Statistical nature constructs nonhomogeneous Poisson process NHPP class diseases analysis reliability growth model, Training Support Vector Machines SVM mould Type, using model accuracy rate as evaluation index, using diseases analysis reliability growth model each time reliability result as Penalty factor, associated losses function optimize together SVM model carry out intestinal cancer judgement, so as to gastroenterology electronic data into Row analysis, obtains more accurate intestinal cancer judging result.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.

Fig. 1 is the flow chart of the gastroenterology electronic data analysis method shown in the embodiment of the present invention.

Fig. 2 is the feature extraction schematic diagram shown in the embodiment of the present invention.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

As shown in Figure 1, the embodiment of the present invention provides a kind of gastroenterology electronic data analysis method, comprising:

11, enteroscopy result and pathologic finding conclusion are obtained, and by the enteroscopy result and the pathologic finding Conclusion is compared, and show that intestinal cancer analyzes result.

For example, obtaining the analysis error rate of intestinal cancer every month.

Example is applied in the invention of this reality, is integrated enteroscopy report and pathologic finding report, sample data set is formed, after integration Audit report data carry out the processing such as the numeralization expression of feature extraction and report；

12, result statistical nature is analyzed for the intestinal cancer, constructs nonhomogeneous Poisson process NHPP (non- Homogeneous Poisson process, referred to as: NHPP) class diseases analysis reliability growth model.

Example is applied in the invention of this reality, which, which had not only considered identical mistake during diseases analysis, which examines situation, to repeat, but also It considers that mistake examines recurrence probability and mistake examines sum may all change at any time, model is made more to tally with the actual situation.

13, Training Support Vector Machines SVM model, with model accuracy rate (accuracy) for evaluation index, by the disease Reliability growth model is analyzed in the reliability result of each time as penalty factor, associated losses function optimizes SVM together Model carries out intestinal cancer judgement.

The gastroenterology electronic data analysis method of the embodiment of the present invention obtains enteroscopy result and pathologic finding knot By, and enteroscopy result and pathologic finding conclusion are compared, show that intestinal cancer is analyzed as a result, for intestinal cancer analysis result system Meter feature, building nonhomogeneous Poisson process NHPP class diseases analysis reliability growth model, Training Support Vector Machines SVM model, Using model accuracy rate as evaluation index, using diseases analysis reliability growth model each time reliability result as punishment The factor, associated losses function optimizes SVM model together and carries out intestinal cancer judgement, so as to divide gastroenterology electronic data Analysis, obtains more accurate intestinal cancer judging result.

In one embodiment, the acquisition enteroscopy result and pathologic finding conclusion, and by the enteroscopy knot Fruit and the pathologic finding conclusion are compared, and show that intestinal cancer analysis result includes:

By enteroscopy report (Colonoscopy report) and pathologic finding report (Pathological report) It is integrated, i.e., characteristic information is extracted in the enteroscopy report from the same sufferer and pathologic finding report, according to corresponding special Sign information field (such as admission number) carries out report splicing integration；Enteroscopy report includes personal patient information, checks finding, intestines Mirror analyzes a plurality of types of fields such as result, check data；And pathologic finding report includes pathologic finding position, pathologic finding institute See and the information such as pathologic finding record.In medical domain, pathological examination results are considered as " goldstandard ", i.e., the generation of canceration whether It can be obtained by carrying out pathological examination as a result, " ill " or " disease-free " can correctly be distinguished.Therefore, we to data into During row integration, be subject to pathological examination results, enteroscopy result and pathological examination results it is inconsistent will be considered to intestines Spectroscopy the result is that mistake；Wherein, described to extract the extraction that characteristic information includes: the extraction of text-type feature, temporal characteristics With the extraction of patient's essential information feature；Wherein, the extraction of the text-type feature: the text that enteroscopy text data is done Present treatment is known as Medical Language processing, and participle, the noise of enteroscopy text data are eliminated in main completion during this, and The positive description for extracting particular condition is used as characteristic information；Wherein, the extraction of the temporal characteristics: in view of the season of disease incidence Saving sexual factor, i.e., seasonal, periodic changing rule would generally be presented in the disease incidence of certain diseases, thus it is considered that disease Breaking-out and inherent implicit contacted between season there are certain.Check data is the word that Interval Coding divides in audit report Serial type data, including date are accorded with, therefrom extract the moon as temporal characteristics；Wherein, the extraction of patient's essential information feature: The processing that we are done is mainly unified to lead to the problem of that expression is inconsistent, and the feature of extraction includes since data source is different Gender, age, occupation etc..

To as shown in Figure 2 before and after audit report feature extraction.

Construction feature space is integrated by the text-type feature to extraction.Meanwhile it being mentioned from pathologic finding report It is each to be characterized space as label (Label), construction output space (Label Space) for the descriptor of corresponding disease category out The attribute value and output space setting numeric coding rule of a feature, by the numeric coding rule set, to integration Data reporting afterwards carries out numeralization expression, becomes computer and algorithm model can recognize and the numeric type data of study (Numerical Data)；

After carrying out numeralization expression to data, using pathologic finding conclusion as final conclusion, by enteroscopy conclusion and disease Reason checks that conclusion is compared, and using the moon as time partition dimension, counts the analysis analysis of intestinal cancer every month as a result, being convenient for subsequent The building of intestinal cancer reliability growth model.

In one embodiment, described to analyze result statistical nature for the intestinal cancer, construct nonhomogeneous Poisson process NHPP class diseases analysis reliability growth model includes:

Nonhomogeneous Poisson process NHPP is set；

Construct NHPP class calculated result reliability growth model frame；

The embodiment of the present invention proposes a kind of new NHPP alanysis reliability growth model.The model had both considered identical Analysis error situation may repeat, it is contemplated that the wrong recurrence probability of analysis and mistake are examined sum and may all be become at any time Change, improves such prediction and evaluation capacity for calculating reliability growth model, model is made more to tally with the actual situation.

Nonhomogeneous Poisson process (non-homogeneous Poisson process, abbreviation NHPP) is Poisson process One popularization, below we will provide the definition of nonhomogeneous Poisson process.In one embodiment, the setting nonhomogeneous Poisson Process NHPP includes: that one random counting process { N (t), t >=0 } of setting meets A1 to A4:N and represents a counting process, can be with Indicate quantity number, t indicate the time；

A1:N (0)=0；

A2:{ N (t), t >=0 } it is an independent incremental process；

A3:P [N (t+ Δ t)-N (t)]=1=λ (t) Δ t+o (Δ t)；

A4:P {-N (s) >=2 N (t) }=o (Δ t)；

S indicates next time, identical as t meaning；

N (t): the cumulative error analysis times found within [0, the t] period；

M (t): the desired value of cumulative error analysis times in [0, the t] period, m (t)=E [N (t)]；

A (t): disease mistake examines total function, indicates until moment t, the diseases analysis errors number counted in case Summation:

a₀: errors number is analyzed present in case when statistics starts；

The reliability growth model that the embodiment of the present invention is established mainly considers following two problem:

(1) diseases analysis process can continually introduce new mistake and examine case, i.e. diseases analysis mistake sum function a (t) be with Time is changed；

(2) each mistake examines case and cannot be guaranteed to repeat, that is, there are problems that mistake examines repetitive rate size.While with The migration of time, department's Medical Devices are increasingly advanced and doctors experience gradually increases, mistake examine repetitive rate function also with Time changes.

It is needed to construct NHPP class diseases analysis reliability growth model frame to warp based on original reliability model The NHPP class model of allusion quotation is modified and is supplemented, and assumed condition below is formd:

In one embodiment, the building NHPP class calculated result reliability growth model frame includes:

By assuming B1: having B5, B5:m (t+ Δ t)=b (a (t)-x (t)) Δ t+o (Δ t)

So as to obtain differential equation B6:

By assuming that B3 has B7, B7:

Indicate x (t) to t carry out derivation, behind be also same meaning；

B8, B8 can be obtained by equation B6, equation B7:

B11 can be obtained by formula B8, formula B10,

B11:

R (x | t)=1-P {-N (t)=0 N (t+x) }=1-exp [- (m (t+x)-m (t))].

In one embodiment, described to construct the NHPP class diseases analysis reliability growth model packet to tally with the actual situation It includes:

It brings formula B15, formula B16 into formula B11, formula B12, obtains:

B17:

B18:

In one embodiment, the ginseng in the described pair of NHPP class diseases analysis reliability growth model to tally with the actual situation Number is estimated

Take natural logrithm that can obtain B20 formula B19:

Support vector machines (Support Vector Machine, abbreviation SVM) model is a kind of two disaggregated models, is being solved It needs specifically to be promoted when more classification problems.The model shows brilliance in text categorization task and high dimensional data Can, and become the mainstream technology for leading machine learning trend.Its basic model is the line in particular space with largest interval Property classifier, core is that original training set is mapped to high-dimensional feature space, wherein non-linear separation characteristic is sentenced by High-dimensional Linear Replaced other function, it is a kind of supervised learning method that can be widely applied to statistical classification and regression analysis.Meanwhile it Many unique advantages are shown in terms of solving small sample, non-linear and high dimensional pattern identification, and can be applied to other In Machine Learning Problems, such as Function Fitting.Supporting vector machine model is VC dimension theory and structure based on Statistical Learning Theory Least risk principle, according to limited sample information model complexity (i.e. to the study precision of specific training sample, Accuracy) seek optimal trade-off between learning ability (identifying the ability of arbitrary sample without error), to obtain Best Generalization Ability (or generalization ability).In SVM model, kernel function directly determines support vector machines and kernel method Final performance, but the selection of kernel function depends on particular problem, and Multiple Kernel Learning also can be used and pass through the multiple kernel functions of study The optimal combination of acquisition improves the performance of model as final kernel function.

In one embodiment, the Training Support Vector Machines SVM model, using model accuracy rate as evaluation index, by institute Diseases analysis reliability growth model is stated in the reliability result of each time as penalty factor, associated losses function is excellent together Change SVM model, carrying out intestinal cancer judgement includes:

Input: linear separability training dataset T={ (x₁, y₁), (x₂, y₂) ..., (x_n, y_n), wherein x_i∈ χ=Rⁿ, y_i∈ γ={ -1 ,+1 } is ith feature vector, also referred to as example, y_iFor x_iClass label, work as y_iWhen=+ 1, claim x_iFor positive example； Work as y_iWhen=- 1, claim x_iBe negative example, (x_i, y_i) it is known as sample point；

(1) it constructs and solves constrained optimization problem:

s.t y_i(w·x_i+ b) -1 >=0, i=1,2 ..., N B22

Acquire optimal solution w^*, b^*；Min expression is minimized, and w, b indicate to constitute two ginsengs of largest interval separating hyperplance Number, X_iIndicate the example in training dataset；

(2) separating hyperplance is thus obtained are as follows:

w^*·x+b^*=0 B23

Categorised decision function are as follows:

F (x)=sign (w^*·x+b^*) B24

Wherein: | | w | | it is the L of w₂Norm, (w, b) are given hyperplane；

Currently, support vector machines (SVM) model is applied to clinic by existing many researchs, the classification that disease may be implemented is known Not, the probability of happening of predictive disease.The embodiment of the present invention is by constructing a kind of more granularity enteroscopy report analysis models to this The text of mode input patient's enteroscopy describes data, can automate generation analysis result.

In one embodiment, after obtaining hyperplane further include:

To each sample point (x_i, y_i) introduce a slack variable ξ_i>=0, so that function interval is greater than plus slack variable Equal to 1, at this point, constraint condition becomes: y_i(w·x_i+b)≥1-ξ_i.Meanwhile to each slack variable ξ_i, pay a cost Function ξ_i, objective function is also by originalBecome:

The problems of value of C is combined with disease reliability issues, value, that is, disease Reliability Function of C is in each time The calculated result of section, objective function become:N value range is the set of all nonnegative integers； The punishment of disease misclassification is increased when data calculating reliability is big, reliability hour, which punishes the misclassification of disease, to be reduced, this When, NHPP class diseases analysis reliability growth model is applied in the optimization method of basic SVM model, trains and conform to The more granularity enteroscopy report analysis models asked.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or improvement to the technology in market for best explaining each embodiment, or make the art Other those of ordinary skill can understand each embodiment disclosed herein.

Those skilled in the art will readily occur to its of the disclosure after considering specification and practicing disclosure disclosed herein Its embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Person's adaptive change follows the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.

Claims

1. a kind of gastroenterology electronic data analysis method characterized by comprising

Enteroscopy result and pathologic finding conclusion are obtained, and the enteroscopy result and the pathologic finding conclusion are carried out Compare, show that intestinal cancer analyzes result；

Result statistical nature is analyzed for the intestinal cancer, constructs nonhomogeneous Poisson process NHPP class diseases analysis reliability growth mould Type；

Training Support Vector Machines SVM model increases the NHPP class diseases analysis reliability using model accuracy rate as evaluation index Long model each time reliability result as penalty factor, associated losses function optimizes SVM model together, carries out intestinal cancer Judgement.

2. the method according to claim 1, wherein the acquisition enteroscopy result and pathologic finding conclusion, And be compared the enteroscopy result and the pathologic finding conclusion, show that intestinal cancer analysis result includes:

Extract characteristic information in enteroscopy report from the same sufferer and pathologic finding report, according to individual features information into Row report splicing integration；During integrating to data, it is subject to pathological examination results；Wherein, the extraction feature Information includes: the extraction of the extraction of text-type feature, the extraction of temporal characteristics and patient's essential information feature；

The text-type feature is integrated into construction feature space, proposes retouching for corresponding disease category from pathologic finding report Predicate is compiled as label, construction output space, the attribute value and output space setting numerical value for being characterized each feature in space Code rule carries out numeralization expression to the data reporting after integration, becomes calculating by the numeric coding rule set Machine and algorithm model can recognize and the numeric type data of study；

After carrying out numeralization expression to data, using pathologic finding conclusion as final conclusion, enteroscopy conclusion and pathology are examined It comes to an end by being compared, using the moon as time partition dimension, counts the analysis result of intestinal cancer every month.

3. according to the method described in claim 2, it is characterized in that,

The extraction of the text-type feature includes: that the text-processing done to enteroscopy text data is known as at Medical Language Reason, it is main during this to complete the participle to enteroscopy text data, noise elimination, and the positive for extracting particular condition is retouched It states as characteristic information；

The extraction of the temporal characteristics includes: the character string data for being Interval Coding division check data in audit report, Including the date, the moon is therefrom extracted as temporal characteristics；

4. the method according to claim 1, wherein described analyze result statistical nature, structure for the intestinal cancer Building nonhomogeneous Poisson process NHPP class diseases analysis reliability growth model includes:

Nonhomogeneous Poisson process NHPP is set；

Construct NHPP class calculated result reliability growth model frame；

Construct the NHPP class disease reliability growth model to tally with the actual situation；

To the parameter Estimation in the NHPP class disease reliability growth model to tally with the actual situation.

5. according to the method described in claim 4, it is characterized in that, the setting nonhomogeneous Poisson process NHPP includes:

A1:N (0)=0；

A2:{ N (t), t >=0 } it is an independent incremental process；

A3:P [N (t+ Δ t)-N (t)]=1=λ (t) Δ t+o (Δ t)；

λ (t) indicates that the intensity function of nonhomogeneous Poisson process, Δ t indicate a time interval, the o (high-order of Δ t) expression Δ t Infinitely small function；

A4:P {-N (s) >=2 N (t) }=o (Δ t)；

Then claiming { N (t), t >=0 } is the nonhomogeneous Poisson process with intensity λ (t), as λ (t)=λ, nonhomogeneous Poisson process It is exactly common homogeneous Poisson processes；

S indicates next time, identical as t meaning；

N (t): the cumulative analysis errors number found within [0, the t] period；

M (t): the desired value of cumulative analysis errors number in [0, the t] period, m (t)=E [N (t)]；

X (t): until moment t, it is detected and belongs to the analysis errors number that repetition mistake is examined；

A (t): disease mistake examines total function, indicates until moment t, the analysis errors number summation counted in case；

a₀: errors number is calculated present in case when statistics starts；

B: calculating error rate, indicates the probability that each mistake is counted in case；

P (t): mistake repetitive rate function indicates that the mistake being each detected belongs to the probability repeated in moment t；

6. method according to claim 4 or 5, which is characterized in that the building NHPP class calculated result reliability growth Model framework includes:

B1: the cumulative analysis errors number N (t) to time t obeys the Poisson process that mean function is m (t), arbitrary time span Number occurs for desired analysis mistake proportional to the remaining analysis error number of t moment in t to t+ Δ t.

B2: disease mistake examine quantity in varying environment, different moments be it is different, diseases analysis mistake sum is to change over time 's.

B4: each analysis mistake is independent from each other in case, and consequence seriousness caused by each analysis mistake is different；

By assuming B1: there is a B5, B5:m (t+ Δ t)=b (a (t)-x (t)) Δ t+o (Δ t),

(a (t)-x (t)) is indicated until moment t, be detected and be not belonging to repeat the wrong analysis errors number examined so as to Obtain differential equation B6:

By assuming that B3 has B7, B7:

B8, B8 can be obtained by equation B6, equation B7:

Indicate that x (t) carries out derivation to t；

B11 can be obtained by formula B8, formula B10,

B11:

Since the cumulative analysis errors number N (t) until moment t obeys the nonhomogeneous Poisson distribution of mean value m (t), so B13:

R (x | t)=1-P {-N (t)=0 N (t+x) }=1-exp [- (m (t+x)-m (t))].

7. according to the method described in claim 6, it is characterized in that, described construct the NHPP class diseases analysis to tally with the actual situation Reliability growth model includes:

Mistake, which examines repetitive rate function p (t), should meet following condition: p (t) ∈ [0,1] and p (t) is decreasing function, as t → ∞, p(t)→0.Therefore, following function can be chosen and examine repetitive rate function B16 to define mistake:

It brings formula B15, formula B16 into formula B11, formula B12, obtains:

B17:

B18:

After obtaining the mean function m (t) of cumulative analysis errors number, so that it may obtain the ginseng in m (t) using Parameter Estimation Method Number.

8. the method according to the description of claim 7 is characterized in that the described pair of NHPP class diseases analysis to tally with the actual situation can Parameter Estimation in property model of growth includes:

The parameter in formula B18 is estimated using Maximum Likelihood Estimation Method, likelihood letter can be obtained by examining mean function m (t) by mistake Number such as B19:

Wherein, (t_i, n_i) occur in pairs, n_iIndicate t_iCounted in moment case institute it is wrong examine number only and；t_iIndicate statistics Examining number summation to mistake is n_iAt the time of, L (parameters | (t_i, n_i)) indicate maximum likelihood function, subsequent is quadrature symbol Number, exp is to indicate exponent arithmetic；

Take natural logrithm that can obtain B20 formula B19:

9. the method according to claim 1, wherein the Training Support Vector Machines SVM model, accurate with model Rate is evaluation index, using the diseases analysis reliability growth model each time reliability result as penalty factor, Associated losses function optimizes SVM model together, carries out intestinal cancer judgement and includes:

Using linear separability support vector machines learning algorithm, i.e. maximal margin method carrys out Training Support Vector Machines SVM model to find Optimal separating hyperplane, algorithm description are as follows:

Input: linear separability training dataset T={ (x₁, y₁), (x₂, y₂) ..., (x_n, y_n), wherein x_i∈ χ=Rⁿ, y_i∈ γ={ -1 ,+1 } is ith feature vector, also referred to as example, y_iFor x_iClass label, work as y_iWhen=+ 1, claim x_iFor positive example；Work as y_i When=- 1, claim x_iBe negative example, (x_i, y_i) it is known as sample point；

(1) it constructs and solves constrained optimization problem:

s.t y_i(w·x_i+ b) -1 >=0, i=1,2 ..., N B22

Acquire optimal solution w^*, b^*；Min expression is minimized, and w, b indicate to constitute two parameters of largest interval separating hyperplance, X_i Indicate the example in training dataset；

(2) separating hyperplance is thus obtained are as follows:

w^*·x+b^*=0 B23

Categorised decision function are as follows:

F (x)=sign (w^*·x+b^*) B24

Wherein: | | w | | it is the L of w₂Norm, (w, b) are given hyperplane；

By using above-mentioned algorithm Training Support Vector Machines model, optimal separating hyperplane is found, feature space is divided into Two parts, a part are positive classes, and a part is negative class, so that data set be classified.

10. according to the method described in claim 9, it is characterized in that, after obtaining hyperplane further include:

To each sample point (x_i, y_i) introduce a slack variable ξ_i>=0, so that function interval is more than or equal to plus slack variable 1, at this point, constraint condition becomes: y_i(w·x_i+b)≥1-ξ_i.Meanwhile to each slack variable ξ_i, pay a cost function ξ_i, objective function is also by originalBecome:N value range is the collection of all nonnegative integers It closes；

Wherein, C > 0 is known as punishment parameter, its value is different under different Question backgrounds, to the punishment of misclassification when C value is big Increase, C value hour reduces the punishment of misclassification, and the objective function after change includes making interval number big and misclassified gene as far as possible Two layers of meaning small as far as possible is measured, C belongs to the variable of reconciliation the relationship of the two；

The problems of value of C is combined with disease reliability issues, value, that is, disease Reliability Function of C is in each period Calculated result, objective function become:Data calculating reliability punishes disease misclassification when big Increase is penalized, reliability hour, which punishes the misclassification of disease, to be reduced, at this point, NHPP class diseases analysis reliability growth model is answered In optimization method for basic SVM model, satisfactory more granularity enteroscopy report analysis models are trained.