CN111143435B

CN111143435B - Medicine cloud platform big data abnormity online early warning method based on statistical generation model

Info

Publication number: CN111143435B
Application number: CN201911379506.7A
Authority: CN
Inventors: 张宸宇; 陈海波
Original assignee: Hangzhou Zedaxin Pharmaceutical Alliance Information Technology Co ltd
Current assignee: Hangzhou Zedaxin Pharmaceutical Alliance Information Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-04-13
Anticipated expiration: 2039-12-27
Also published as: CN111143435A

Abstract

The invention discloses a medicine cloud platform big data abnormity online early warning method based on a statistic generation model. For searching of the abnormal early warning samples, the method adopts an online mixed Gaussian statistic generation model, the model fits the probability distribution of the full life cycle of the medical data, the occurrence probability of the samples can be calculated for real-time sequence samples, and low-probability sequences in the samples are selected as the early warning samples, so that the large data abnormal online early warning of the medical cloud platform is realized.

Description

Medicine cloud platform big data abnormity online early warning method based on statistical generation model

Technical Field

The invention relates to a big data abnormity judgment and early warning method for a medicine cloud platform, in particular to a big data abnormity judgment and early warning method for the medicine cloud platform based on a statistic generation model.

Background

A large amount of drug manufacturing, storing and circulating data and patient medication habits and mode data are stored in a medicine cloud platform, the data can reflect the space-time distribution characteristics and future development trends of various drugs and associated diseases, and an industry worker may concern about the variation of a certain type of drugs and a certain brand of drugs in space-time distribution or find potential causal relationships among all the variations. In the presence of massive large data, the conventional dependence regular report can not meet the industry requirements in terms of timeliness and operability, and therefore the method needs to be realized by means of a space-time large data mining algorithm.

At present, feature extraction and abnormal sample search of main difficult data of an outlier mining technology of spatio-temporal event type data are realized, wherein the former refers to a method for filtering massive original data and extracting key feature points, and a PLS (partial line segment) and a variant algorithm thereof are generally adopted; the latter uses Dynamic Time Window (DTW) or clustering method to find statistically distant samples as abnormal samples based on various distance definitions in euclidean space. Because data changes of production and manufacturing, logistics, regional circulation and the like in the field of medicine are gentle, feature points filtered and extracted by the conventional method are still too dense, a large number of similar repeated features are reserved, and the feature extraction cannot improve the algorithm execution efficiency; the method of adopting dynamic time window or clustering depends on the reasonability of the distance measurement definition given to the sample sequence, and no ideal distance measurement method exists at present for the medicine cloud platform data.

Disclosure of Invention

The invention aims to provide a medicine cloud platform big data abnormity online early warning method based on a statistic generation model aiming at the defects of the prior art, and the method adopts a feature point filtering method with smooth direction, so that a large amount of mild space-time feature data can be removed, and a small amount of feature points are reserved; for searching abnormal early warning samples, the method provides an online Gaussian mixture statistics generation model which fits the probability distribution characteristics of the full life cycle of medical data, can calculate the occurrence probability of real-time sequence samples, and selects low-probability sequences as early warning samples.

The purpose of the invention is realized by the following technical scheme: a medicine cloud platform big data abnormity online early warning method based on a statistic generation model comprises the following steps:

(1) feature filtering, including affine transformation and direction smoothing filtering, as follows:

(1.1) the medicine cloud space-time data consists of a fixed-length feature vector time sequence, and the feature vector at the time t is set as D_t＝＜d_t1，d_t2，...，d_tpLong, then D ═<D₁,D₂,...D_T>Forming a sequence segment, and T is the maximum value of the sequence segment.

(1.2) performing affine transformation on each feature vector to map the feature vector to a p-dimensional finite space, and recording the feature vector at the time t after the affine transformation as D'_t。

(1.3) performing feature filtering in the mapped pixel space, wherein the specific process is as follows:

(1.3.1) input: time sequence segment D ═<D₁,D₂,...D_T>(ii) a Affine-transformed time-series segment D '< D'₁，D‘₂，...,D‘_T＞；

And (3) outputting: filtered time-series fragment DA ═ DA_r1，Da_r2,...,Da_rk>, where r1, r 2.. rk ∈ {1, 2.,. T }, and k ≦ T;

(1.3.2) sequentially traversing each component D 'in D'_i(i＝1,2,...,T)；

(1.3.2.1) if i ═ 1 or i ═ T, then D will be added_iAdding into DA;

(1.3.2.2) calculate vector D'_i-1And D'_iA Euclidean distance between them, if the Euclidean distance is greater than a distance threshold minDis, D is determined_iAdding into DA.

(1.4) directional smoothing filtration: firstly, searching a weighted main direction of a time sequence segment, and then filtering according to the weighted main direction, wherein the specific process comprises the following steps:

(1.4.1) input: the time sequence fragment DA after the last step of filtering; and (3) outputting: the direction is smoothed to obtain a filtered time sequence segment DA';

(1.4.2) mixing Da_r1Adding into DA';

(1.4.3) defining the value of variable index as r1 and the value of lastAngle as-1;

(1.4.4) sequentially traversing each component Da in the DA_ri(i＝2,...,k-1)；

(1.4.4.1) calculation from Da_indexTo Da_riIs marked as DIS_ri；

(1.4.4.2) calculation from Da_indexTo Da_riWeighted Angle of (1), denoted as Angle_ri；

(1.4.4.3) if lastAngle has a value not equal to-1, and lastAngle and Angle_riThe absolute value of the difference between is greater than

Then Da will be_riAdding the sample into DA', and making index value be ri, otherwise, filtering the point;

(1.4.4.4) let lastAngle be equal_ri；

(1.4.5) finally, the Da_rkAdded to DA'.

(2) And (3) calculating a statistical generation model: generating a probability distribution model of the time sequence segment based on historical data, wherein the probability distribution of the time sequence segment is assumed to be a Gaussian mixture function in a priori mode, and the probability distribution model is defined as follows:

where M is the number of Gaussian components in the Gaussian mixture function, k_iIs the weight of the ith Gaussian component and satisfies

N(D|u_i,Σ_i) Is the ith Gaussian function, u_iIs the mean of the ith Gaussian component, sigma_iA covariance matrix of the ith Gaussian component; a real-time online learning method is adopted, and a Gaussian mixture model is dynamically corrected along with the increase of data, and the specific process is as follows:

(2.1) initial M is in [1,5 ]]Taking values, and selecting N time sequence segments D from historical data⁽¹⁾,D⁽²⁾,...D^(N)An initial mixture gaussian model is generated using standard EM algorithms.

(2.2) continuously updating the initial Gaussian mixture model along with the arrival of new time sequence fragment data, wherein the updating process is as follows:

(2.2.1) wait for the new time series fragment data to reach R, and mark as ND⁽¹⁾,ND⁽²⁾,...ND^(R)；

(2.2.2) let j be 1, L { }, and let H be the current mixed gaussian model;

(2.2.3)E^(j)＝{E₁,E₂.,..,E_M}＝{N(ND^(j)|u_i,Σ_i) I | (1, 2., M } }, i.e., ND for each newly arrived fragment data ND^(j)Calculating the value of each Gaussian component;

(2.2.4) pairs of E^(j)Carrying out normalization processing;

(2.2.5)I＝argmax(E^(j)),V＝max(E^(j))；

(2.2.6) if V>0.5, then L ═ U { ND-^(j)Else, executing step (2.2.8);

(2.2.7) if | L | > is equal to N, performing mixed gaussian clustering on all data in L by adopting an EM algorithm to obtain a new model HL, making H equal to H ═ HL, and making L equal to { };

(2.2.8) mixing ND^(j)Classifying the I-th Gaussian component in the H, and recalculating the average value of the I-th Gaussian component;

(2.2.9) j equals j +1, if j > R, the algorithm ends, otherwise go back to step (2.2.3).

(3) And (5) early warning and judgment. And if the length of the set L is always smaller than N after the T batches of new data arrive, starting early warning judgment and early warning the small-probability time sequence segments.

Further, in the step (1.2), affine transformation is performed on each feature vector to map the feature vector to a p-dimensional finite space, and the maximum length of each dimension is set as L_iI belongs to {1,2,. eta., p }, and the value range of each dimension is [0, L ]_i](ii) a Feature vector of affine transformation at time t is recorded as D'_tThen the affine transformation is defined by the following formula:

wherein d'_ti(i ═ 1, 2.. multidot.p) is D'_tThe ith dimension component of (1).

Further, in the step (1.4.4.2), Angle is weighted_riThe calculation formula of (2) is as follows:

in the above formula, x represents a dot product operation of vectors, and d represents an euclidean distance between two vectors.

Further, in the step (2.2.8), the average value of the I-th component is recalculated according to the following formula:

further, in the step (3), the early warning determination method includes substituting each new time sequence fragment data into the gaussian mixture model, and if the calculated value is less than 0.1, indicating that a small probability time sequence fragment occurs, early warning the time sequence fragment.

The invention has the beneficial effects that:

1. the method realizes the filtering of the sequence fragment data through a two-step filtering method comprising affine transformation and direction smoothing filtering, thereby removing similar points in the sequence fragment data, reserving a small number of characteristic points, reducing the analysis data volume and simultaneously providing a data basis for a statistic generation model.

2. And an online Gaussian mixture statistic generation model is further adopted, and the model fits the probability distribution of the time sequence fragment data, so that the capacity of estimating the occurrence probability of the time sequence fragment and early warning is realized.

Drawings

FIG. 1 is a graph of the characteristic filtering effect of an embodiment of the present invention.

FIG. 2 is a diagram illustrating a distribution of characteristics of time series data according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways than those specifically described and will be readily apparent to those of ordinary skill in the art without departing from the spirit of the present invention, and therefore the present invention is not limited to the specific embodiments disclosed below.

The invention provides a statistical generation model-based online early warning method for big data abnormity of a medicine cloud platform, which comprises the following steps:

(1) feature filtering method

(1.2) affine transforming each feature vector to map it to a p-dimensional finite space, the maximum length of each dimension being L_iI belongs to {1,2,. eta., p }, and the value range of each dimension is [0, L ]_i](ii) a Feature vector of affine transformation at time t is recorded as D'_tThen the affine transformation is defined by the following formula:

(1.3) affine transformation converts the feature vector into an artifact pixel space, points which are too close to each other in the space have strong similarity, and only one of the points is reserved, so that the purpose of feature filtering is achieved; the specific process is as follows:

(1.3.2) sequentially traversing each component D 'in D'_i(i＝1,2,...,T)；

(1.3.2.1) if i ═ 1 or i ═ T, then D will be added_iAdding into DA;

(1.3.2.2) calculate vector D'_i-1And D'_iThe Euclidean distance therebetween, if EuclideanIf the distance is greater than the distance threshold minDis, D is set_iAdded to DA, minDis is usually in [5,25 ]]Taking values in between.

(1.4) direction smoothing filtering, wherein the filtering method considers the included angle of the front and rear eigenvectors, and is different from other smoothing methods in that the direction smoothing firstly searches the weighted main direction of a time sequence segment and carries out filtering according to the weighted main direction; the method comprises the following steps:

(1.4.2) mixing Da_r1Adding into DA';

(1.4.4.1) calculation from Da_indexTo Da_riIs marked as DIS_ri；

(1.4.4.2) calculation from Da_indexTo Da_riWeighted Angle of (1), denoted as Angle_riThe calculation formula is as follows:

in the formula, x represents the dot product operation of the vectors, and d represents the Euclidean distance between the two vectors;

(1.4.4.4) let lastAngle be equal_ri；

(1.4.5) finally, the Da_rkAdded to DA'.

(2) The statistical generation model calculation method generates a probability distribution model of a time sequence segment based on historical data, wherein the probability distribution of the time sequence segment is assumed to be a Gaussian mixture function in a priori mode and is defined as follows:

N(D|u_i,Σ_i) Is the ith Gaussian function, u_iIs the mean of the ith Gaussian component, sigma_iIs the covariance matrix of the ith gaussian component. Where M and all k_i,u_i,Σ_iAre unknown and need to be learned through historical data. In consideration of the fact that system data continuously increases and changes in practical application, a real-time online learning method is designed, a Gaussian mixture model can be dynamically corrected along with the increase of the data, and the specific process is as follows:

(2.2.2) let j be 1, L { }, and let H be the current mixed gaussian model;

(2.2.4) pairs of E^(j)And (3) carrying out normalization treatment:

E^(j)＝{(E₁-min(E^(j)))/(max(E^(j))-min(E^(j))),..,(E_M-min(E^(j)))/(max(E^(j))-min(E^(j)) ) }, min and max are functions for solving the minimum value and the maximum value respectively;

(2.2.5)I＝argmax(E^(j)),V＝max(E^(j))；

(2.2.6) if V>0.5, then L ═ U { ND-^(j)Else, executing step (2.2.8);

(2.2.8) mixing ND^(j)The I-th Gaussian component in H is included, and the mean value of the I-th component is recalculated according to the following formula:

(3) And (5) early warning and judgment. If the length of the set L is always smaller than N after T batches of new data (T usually takes 2R-10R) arrive, the early warning judgment process can be started. The judgment method comprises the steps of substituting each new time sequence fragment data into a Gaussian mixture model, and if a calculated value is smaller than 0.1, indicating that a small-probability time sequence fragment appears, carrying out early warning on the time sequence fragment.

An example of a specific application of the present invention is given below. Some acute infectious diseases have the unfavorable characteristics of fast diffusion, long incubation period and easy misdiagnosis, for example, tuberculosis of the B infectious disease is spread by droplets, the incubation period is 2-3 weeks after infection, and the viral cold is easily misdiagnosed, which brings great difficulty to the prevention and treatment of the infectious diseases, and particularly, when the infectious diseases are diffused rapidly on a large scale, timely early warning is necessary.

By adopting the method, the regional dosage conditions of the anti-tuberculosis drugs and the antiviral cold drugs, such as ethambutol, quinolone, loratadine and the like, are monitored on line, a statistical generation model is established to search for the small-probability time sequence abnormal data, and the early warning capability of the spread of potential diseases can be realized. The method comprises the following steps:

1. the 7-year data of 34 anti-tubercular drugs and antiviral cold drugs in a certain area are selected, and in order to realize effective monitoring, the hourly dosage is calculated by taking the hour as a basic unit, and the 24-hour dosage is taken as a minimum time sequence segment, and the total number of the data items is 34, 7, 365 and 86870 time sequence segments, and 34, 7, 365 and 24 is 2084880.

2. Since the dosage data can be influenced by various external factors such as population, economy and the like, the data needs to be normalized to eliminate the influence of the factors. The specific method is that the mean value and the standard deviation of the whole year are calculated by taking the year as a unit, and the mean value is subtracted from each data item and then divided by the standard deviation to be taken as normalized data.

3. Time series segments (12 minimum time series segments) in units of years are subjected to feature filtering by using the feature filtering method of the present invention, and fig. 1 shows the difference before and after the filtering. The filtering method can keep the direction change characteristics of the time sequence data and delete the data items with gentle change.

4. The method of the invention is further adopted to estimate the probability distribution of the time sequence segments, and the basic unit of estimation is the minimum time sequence segment. The probability distribution data is shown in fig. 2.

All time series segments with probability density values below 0.1 were selected, two in this example, in which the dosages of quinolone in the region of month 11 showed a special case of a significant increase and decrease beyond the dosage of quinolone in the past year (marked with (1) in fig. 2), the average probability density value of this time series segment was 0.061, while the dosages of cycloserine in the same month showed a tendency of a sudden increase in the past year (marked with (2) in fig. 2), and the average probability density value of this time series segment was 0.0396. The abnormity of the two medicines can be visually displayed in a visual mode according to the difference of the probability density values, and the early warning is automatically given to related industry management personnel, so that the management personnel can be helped to acquire more valuable data from a large amount of medicine information.

The foregoing is only a preferred embodiment of the present invention, and although the present invention has been disclosed in the preferred embodiments, it is not intended to limit the present invention. Those skilled in the art can make numerous possible variations and modifications to the present teachings, or modify equivalent embodiments to equivalent variations, without departing from the scope of the present teachings, using the methods and techniques disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention, unless the contents of the technical solution of the present invention are departed.

Claims

1. A medicine cloud platform big data abnormity online early warning method based on a statistic generation model is characterized by comprising the following steps:

(1) collecting medicine cloud time-space data as input of an early warning method; the medicine cloud time-space data consists of a characteristic vector time sequence with a fixed length, and comprises the medicine dosage and the medicine taking time data of a patient, and the characteristic vector of the medicine taking at the moment t is set as D_t＝＜d_t1，d_t2，...，d_tpLong, then D ═<D₁,D₂,...D_T>Forming a sequence segment, wherein T is the maximum value of the sequence segment;

(2) feature filtering, including affine transformation and direction smoothing filtering, as follows:

(2.1) performing affine transformation on each feature vector to map the feature vector to a p-dimensional finite space, and recording the feature vector at the time t after the affine transformation as D'_t；

(2.2) performing feature filtering in the mapped pixel space, wherein the specific process is as follows:

(2.2.1) input: time-sequential fraction of administration D ═<D₁,D₂,...D_T>(ii) a Affine-transformed time-series segment D '< D'₁，D‘₂，...,D‘_T＞；

(2.2.2) sequentially traversing each component D 'in D'_i(i＝1,2,...,T)；

(2.2.2.1) if i is 1 or i is T, then D is added_iAdding into DA;

(2.2.2.2) calculate vector D'_i-1And D'_iA Euclidean distance between them, if the Euclidean distance is greater than a distance threshold minDis, D is determined_iAdding into DA;

(2.3) directional smoothing filtration: firstly, searching a weighted main direction of a time sequence segment, and then filtering according to the weighted main direction, wherein the specific process comprises the following steps:

(2.3.1) input: the time sequence fragment DA after the last step of filtering; and (3) outputting: the direction is smoothed to obtain a filtered time sequence segment DA';

(2.3.2) mixing Da_r1Adding into DA';

(2.3.3) defining the value of variable index as r1 and the value of lastAngle as-1;

(2.3.4) sequentially traversing each component Da in the DA_ri(i＝2,...,k-1)；

(2.3.4.1) calculation from Da_indexTo Da_riIs marked as DIS_ri；

(2.3.4.2) calculation from Da_indexTo Da_riWeighted Angle of (1), denoted as Angle_ri；

(2.3.4.3) if lastAngle has a value not equal to-1, and lastAngle and Angle_riThe absolute value of the difference between is greater than

Then Da will be_riAdding into DA' and making index value be ri, otherwise, said Da_riIs filtered;

(2.3.4.4) let lastAngle be equal_ri；

(2.3.5) finally, the Da_rkAdding into DA';

(3) and (3) calculating a statistical generation model: generating a probability distribution model of the time sequence segment based on the medicine cloud space-time historical data in the step (1), wherein the probability distribution of the time sequence segment is assumed to be a Gaussian mixture function in a priori mode, and the probability distribution model is defined as follows:

(3.1) initial M is in [1,5 ]]Taking values, and selecting N time sequence segments D from historical data⁽¹⁾,D⁽²⁾,...D^(N)Generating an initial Gaussian mixture model by using a standard EM algorithm;

(3.2) continuously updating the initial Gaussian mixture model along with the arrival of new time sequence fragment data, wherein the updating process is as follows:

(3.2.1) wait for the new time series fragment data to reach R, and mark as ND⁽¹⁾,ND⁽²⁾,...ND^(R)；

(3.2.2) let j equal 1, L equal { }, and let H be the current mixed gaussian model;

(3.2.3)E^(j)＝{E₁,E₂.,..,E_M}＝{N(ND^(j)|u_i,Σ_i) I | (1, 2., M } }, i.e., ND for each newly arrived fragment data ND^(j)Calculating the value of each Gaussian component;

(3.2.4) pairs of E^(j)Carrying out normalization processing;

(3.2.5)I＝argmax(E^(j)),V＝max(E^(j))；

(3.2.6) if V>0.5, then L ═ U { ND-^(j)Else, executing step (2.2.8);

(3.2.7) if | L | > is equal to N, performing mixed gaussian clustering on all data in L by adopting an EM algorithm to obtain a new model HL, making H equal to H ═ HL, and making L equal to { };

(3.2.8) mixing ND^(j)Classifying the I-th Gaussian component in the H, and recalculating the average value of the I-th Gaussian component;

(3.2.9) j is j +1, if j > R, the algorithm ends, otherwise go back to step (2.2.3);

(4) early warning judgment; and if the length of the set L is always smaller than N after the T batches of new data arrive, starting early warning judgment and early warning the small-probability time sequence segments.

2. The medicine cloud platform big data abnormity online early warning method based on the statistic generation model as claimed in claim 1, wherein in the step (2.1), affine transformation is performed on each feature vector to enable each feature vector to be mapped to a p-dimensional finite space, and the maximum length of each dimension is set as L_iI belongs to {1,2,. eta., p }, and the value range of each dimension is [0, L ]_i](ii) a Feature vector of affine transformation at time t is recorded as D'_tThen the affine transformation is defined by the following formula:

3. The medicine cloud platform big data abnormity online early warning method based on statistical generation model as claimed in claim 1, wherein in the step (2.3.4.2), Angle is weighted_riThe calculation formula of (2) is as follows:

4. The medicine cloud platform big data abnormity online early warning method based on the statistical generation model as claimed in claim 1, wherein in the step (2.2.8), the mean value of the I-th component is recalculated according to the following formula:

5. the medicine cloud platform big data abnormity online early warning method based on the statistic generation model as claimed in claim 1, wherein in the step (3), the early warning determination method is to substitute each new time sequence fragment data into the Gaussian mixture model, and if the calculated value is less than 0.1, it indicates that a small probability time sequence fragment occurs, the time sequence fragment is early warned.