CN110347727B

CN110347727B - Filtering method for correlation of health and air quality data based on multi-level mutual information

Info

Publication number: CN110347727B
Application number: CN201910656088.5A
Authority: CN
Inventors: 强星; 乐卫清; 潘卫东; 花月明
Original assignee: Nanjing Meihua Software System Co ltd
Current assignee: Nanjing Meihua Software System Co ltd
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2023-04-07
Anticipated expiration: 2039-07-19
Also published as: CN110347727A

Abstract

A method for filtering correlation between health data and air quality data based on multi-level mutual information includes inputting human health data and air quality index data into adaptive multi-level mutual information calculation method, scanning and sampling data of time sequence corresponding to input data set on granularity of multi-level time window layer by layer through time window, calculating mutual information value of window through KSG estimation algorithm, setting mutual information threshold value by using two standardization methods, and filtering to obtain window set with correlation between human health data and air quality index data. Compared with the traditional mutual information calculation method, the method has higher processing accuracy and is less limited by input data, and the time for calculating the mutual information under the same hardware condition is shorter than that of most methods.

Description

Filtering method for correlation of health and air quality data based on multi-level mutual information

Technical Field

The invention belongs to the technical field of health information, and particularly relates to a filtering method for data correlation between human health data and an air quality index based on an adaptive multi-level mutual information calculation method.

Background

In recent years, with the development and the perfection of intelligent wearable equipment, great convenience is brought to all-weather human health data acquisition, and increasingly abundant human health big data bring conditions to related intelligent analysis at the same time. If only shallow analysis is performed on massive and abundant information or pure data summarization is performed, great waste of data resources and internal information is often caused. In addition, the information data of a single dimension can only find problems in the single dimension, and most of the intrinsic values are difficult to mine. The air quality index data is introduced and organically combined with the human health data, rich potential correlation between two dimensional data can be mined, and the core advantages of big data can be shown. By utilizing the predictability of the maturity of the air quality index and combining the potential association of human health data with the data, an effective suggestion for maximizing the body health can be provided for personal behavior decision, the advance judgment is realized, and the huge value is exerted.

The collected human health and air quality index data has the characteristics of large volume, high speed increasing speed, complex relation among data, high noise and the like, and brings a plurality of challenges to the existing data processing technology. In human health and air quality index data of a large data volume and a long time axis sequence, the existing correlation analysis between different dimensional data is still a difficult problem, and the existing correlation analysis has long-time span data characteristics and huge information volume, so that a valuable time window is difficult to find out by a traditional correlation analysis method to perform correlation analysis between the human health and the air quality index data. Even if a valuable time window is chosen, the challenge remains to determine under what conditions the two sets of data are most strongly correlated rather than simply finding whether there is a correlation between the human health and air quality index data within the time window. Not only is the amount of data large over a long time sequence, but the speed of data generation and the complexity of the data also present challenges to existing correlation analysis techniques, and large amounts of data with noise and structural and non-structural clutter are rapidly generated. The existing correlation analysis among multiple groups of big data cannot meet the capabilities of efficient analysis, correlation sequencing establishment and the like. In the existing mutual information calculation method, the setting of the mutual information threshold depends on the prior knowledge of the characteristics of the input human health data and the air quality index data, the boundary of the threshold setting is difficult to determine, and a trial-and-error method is often needed to approach the final ideal threshold range.

The traditional correlation analysis between multidimensional data can mine partial internal correlation between human health data and air quality index data, but still has many elbows in the light of huge, complicated, long-term and rapid data generation of today.

Disclosure of Invention

The invention aims to solve the defects, provides a method for filtering the correlation between health data and air quality data based on multi-level mutual information, and the method comprises the steps of inputting human health data and air quality index data into an adaptive multi-level mutual information calculation method, scanning and sampling data of a time sequence corresponding to an input data set on the granularity of a multi-level time window layer by layer through a time window, calculating the mutual information value of the window through a KSG (K-nearest neighbor) estimation algorithm, and setting a mutual information threshold value by using two standardization methods for filtering to obtain a window set with the correlation between the human health data and the air quality index data. Compared with the traditional mutual information calculation method, the method has higher processing accuracy and is less limited by input data, and the time for calculating the mutual information under the same hardware condition is shorter than that of most methods.

The technical scheme of the invention is as follows:

the invention provides a filtering method of data correlation between human health data and air quality index based on an adaptive multi-level mutual information algorithm model, which comprises the following steps:

s1, selecting one index data in human health data and air quality index data of a long time period as a group of input data groups, and inputting an adaptive multi-level mutual information algorithm model;

s2, calculating the adaptive multi-level mutual information algorithm model: respectively sampling data of a time sequence corresponding to an input data set on a multi-level time window granularity through a time window, calculating mutual information values (mutual information) of all sampled data points in the time window, comparing the mutual information values with a mutual information threshold (sigma), obtaining data points which all meet the mutual information threshold condition on the time window granularity level, sinking a filtered data list which does not meet the mutual information threshold condition on the time window granularity level to the next level time window granularity, and circulating the process after the next level time window granularity is reduced until the time window granularity is reduced to the minimum time window granularity or the filtered data list is empty;

and S3, outputting all data points which meet the mutual information threshold condition on the granularity of each level of time window as a time window set with correlation. The output time window set is a time window set with correlation, and is generally ordered according to the processing sequence of a calculation method of adaptive multi-level mutual information; the invention also supports the sequencing according to the mutual information value of the time windows in the set, arranges the time windows with the strongest correlation from top to bottom, and outputs the time windows as the time window set result.

The human health data of the long time period is a human health data set of a long time sequence (years) of a certain crowd in the same city, is accurate to minutes and comprises indexes such as heartbeat, blood oxygen saturation, blood pressure and the like; the air quality index data of the long time period is an air quality index data set in a long time sequence (years) of a city where the crowd is located, is accurate to minutes and comprises indexes such as inhalable suspended particulate concentration, smoke concentration, nitrogen dioxide concentration and the like;

the adaptability is that the data of the filtered data list of the level (namely, the data which does not meet the relevance screening condition) is used as the input data of the next level, and the time window granularity of the next data level is reduced;

further, before the input data group is input into the adaptive multi-level mutual information algorithm model, the maximum time window granularity, the minimum time window granularity and the sliding step length of the time window for sampling the input data group by the time window are selected in advance in the adaptive multi-level mutual information algorithm model.

Further, the step S2 specifically includes:

s21, starting from the maximum time window granularity level, sampling data points of a time sequence corresponding to an input data set through a time window in the current level time window granularity, and calculating mutual information values of all sampled data points in the time window by using a KSG (Kernel Shift keying) estimation method; comparing the mutual information value with a pre-selected mutual information threshold value (sigma), storing and removing data points which meet the mutual information threshold value condition (the mutual information value is larger than or equal to sigma) from the current time sequence, and reducing the amount of residual data so as to reduce the subsequent calculation load; sampling the data from the leftmost end of the unscanned data in the next time window; storing data points (mutual information value < sigma) which do not meet the mutual information threshold condition into a filtered data list; the data points meeting the mutual information threshold condition are stored into a time window list meeting the threshold condition, the list is an increment list, and the data points meeting the condition are all inserted into the list until the cycle is finished.

S22, the time window is scanned in a sliding mode on the time sequence of the input data set according to the sliding step length of the time window, the process is repeated until all time sequences corresponding to the input data set of the time window on the granularity of the current level time window are scanned in a sliding mode, a filtered data list of all data points which meet mutual information threshold conditions on the granularity of the current level time window and the granularity of the current level time window is obtained, and the time window is calculated on the granularity of the current level time window;

s23, sinking the filtered residue data list on the current level time window granularity to the next level time window granularity; and then, taking the filtered residual data list as input data, taking the granularity of the next-level time window as the granularity of the current time window, and circulating the process to finish the calculation of the granularity of each-level time window until the granularity of the current time window is reduced to the minimum granularity of the time window or the filtered residual data list is empty.

The time window is the inhalable suspended particle concentration data in a given long time sequence

And blood oxygen saturation index data->

For example, the time window w of (X, Y) _X，Y Is a time stamp (x) of a data point collected over a continuous time interval _t ，y _t ) A structured and time-ordered sequence; wherein x is _t A tth data point representing inhalable suspended particulate matter concentration data over a given long-time sequence; y is _t Representing a given long time sequence(ii) the t data point of oximetry index data; n represents the number of sample data points for a given long-time series of respirable suspended particulate matter concentration data or a given long-time series of oximetry index data; t represents a time-ordered sequence of sample data points of the inhalable suspended particulate matter concentration data or the blood oxygen saturation index data of the given long-time sequence.

The time window granularity is a temporary unit representing the time window granularity, for example, a data set collected in units of windows of hours, days, weeks and months, and the time window granularity is respectively the time window granularity of hours, days, weeks and months. The sliding step is a moving step representing the time window moving from the current time window to the next time window; the selected maximum time window granularity is a time window granularity selected to cover the whole data long-time sequence, and the time window granularity is gradually reduced to deeply discover the data correlation.

Further, the data meeting the mutual information threshold condition is data with strong correlation; the data which do not meet the mutual information threshold condition are data which do not have strong correlation; the data with strong correlation is data with mutual information value larger than the mutual information threshold value.

Further, in step S23, the granularity of the next-level time window is smaller than the granularity of the current-level time window (the time difference between the granularity of the next-level time window and the granularity of the current-level time window is generally fixed, and the difference is the sliding step), and the filter residue data list performs sliding scanning of finer granularity of the next-level time window; and the sliding scanning of the finer time window granularity is a new round of interval size obtained by subtracting a preset sliding step length from the time window granularity of the current sliding scanning.

Further, in step S21, the step of calculating the mutual information value includes:

s211, for each sampled data point p _i ＝(x _i ，y _i ) Using a grid-assisted (ordered) algorithm to search k nearest neighbors; wherein k is the number of nearest neighbors;

s212, for the above sampling data point p _i ＝(x _i ，y _i ) Is sampled data point i, is traced (x) _i ±d _x ，y _i ±d _y ) Inner new data point add or old data point remove pairs p _i K nearest neighbor induced change and the number of edge region data points (n) _x ，n _y ) A change in (c); wherein, d _i ＝(d _x ，d _y ) Is a data point p _i ＝(x _i ，y _i ) The distance of the grid area boundary; (n) _x ，n _y ) The number of edge region data points for each dimension distance;

s213, when the added new data point or the removed old data point is in (x) _i ±d _x ，y _i ±d _y ) In the range, lead to p _i When the value of the k nearest neighbor is changed, a new k nearest neighbor is searched again to complete the updating of the k nearest neighbor; when the new data point added or the old data point removed is not (x) _i ±d _x ，y _i ±d _y ) Within the range of p _i When the value of k nearest neighbor of (a) is kept constant, p _i The data point number of the edge area is counted again, and the data point number (n) of the edge area is completed _x ，n _y ) Updating of (3);

s214, using the latest k nearest neighbors and the latest data point number (n) of the edge area _x ，n _y ) Calculating mutual information values of all sampling data points in a time window of the time window as parameter conditions:

wherein I (X, Y) is the mutual information value of all sampled data points in the time window of the time window, psi is the dual gamma function, k is the number of nearest neighbors, and (n) _x ，n _y ) The number of edge region data points within each dimension distance d, N is the total number of sample data points,<ψ(n _x )+ψ(n _y )>is psi (n) _x )+ψ(n _y ) The mean function of.

Further, when the current time window contains data points corresponding to the time sequence contained in the previous time window, only the newly added data points of the current time window and the previous time window are subjected to mutual information value calculation.

Further, the mutual information threshold is a minimum value which is not negative and represents that the data has correlation, for example, the correlation value of a time window is larger than or equal to sigma, which represents that the data has correlation in the time window; the mutual information threshold is pre-selected by a two-step normalization method or a data coverage method.

Further, when the characteristics of the input human health data and the input air quality index data are not known in advance, a first two-step standardization method can be selected to calculate the mutual information threshold. The two-step normalization method includes filtering a window using normalized entropy and normalized mutual information; the advantage of the normalization method is that it provides a reliable boundary [0,1 ]]The user does not need to know the characteristics of the input human health data and the air quality index data in advance; the two-step normalization method for selecting the mutual information threshold comprises the following steps: time window omega with n pairs of data point samples obtained for sliding sampling _X，Y ＝{(x ₁ ，y ₁ )，...，(x _n ，y _n ) That is, the entropy of the time window is normalized by the maximum possible entropy to obtain the maximum possible entropy-normalized entropy

Expressed as:

wherein H _ω Entropy values for the time window; max (H) _ω ) Using the maximum possible entropy for the entropy of the time window; log (n) is the entropy value when the data point samples in the time window are uniformly distributed;

normalizing the entropy for the maximum possible entropy;

secondly, the entropy of the time window is used for standardizing the mutual information value by the maximum possible entropy to obtain the mutual information value with the standardized maximum possible entropy

Expressed as:

wherein, I _ω Is the mutual information value, H, of all sampled data points within the time window _ω For the entropy value of the time window it is,

a mutual information value normalized for the maximum possible entropy; max (H) _ω ) Using the maximum possible entropy for the entropy of the time window; log (n) is the entropy value when the data point samples in the time window are uniformly distributed;

then, the entropy value of the time window is used for standardizing the mutual information values of all sampling data points in the time window to obtain entropy standardized mutual information values

Expressed as:

/>

wherein I _ω Is the mutual information value, H, of all sampled data points within the time window _ω For the entropy value of the time window,

a mutual information value normalized for entropy, an actual reduction in uncertainty for one variable over another in the time window;

and finally, filtering out a time window meeting the condition, and taking mutual information values of all sampling data points in the time window as a mutual information threshold, wherein the steps are as follows:

(1) If it is not

Selecting the time window;

(2) For the time window selected in step 1), if

Or->

Selecting the time window as a time window meeting the condition, and taking mutual information values of all sampling data points in the time window meeting the condition as a mutual information threshold;

wherein σ _H To normalize the threshold of entropy, σ _I For normalizing mutual information value thresholds, the value is usually σ _H ＝σ _I And =0.2. The normalized entropy threshold σ _H With normalized mutual information threshold σ _I According to a large number of experimental results, let σ _H ＝σ _I And =0.2, it is ensured that a time window with no valuable correlation between the human health data and the air quality index data of the selected part cannot be set due to the mutual information threshold value, and a time window with valuable correlation between the missing part and the air quality index data cannot be set due to the threshold value.

Further, when the characteristics of the input human health data and the air quality index data are known in advance, a second data coverage rate method can be selected to calculate a mutual information threshold: the mutual information threshold (σ) is obtained using a data coverage method:

the data coverage rate represents a ratio of the amount of data covered by the selected time window to the total amount of data in the total sample;

the user selects a certain time window from the pair of long-time sequence human health data and the air quality index data, the number of samples sampled in the time window is obtained through the adaptive multi-level mutual information calculation method, the data coverage rate of the time window in the total time sequence is obtained according to the data coverage rate formula, and the user updates the multi-level mutual information calculation method according to the adjustment mutual information threshold value (sigma) to approach the satisfactory data coverage rate, so that the ideal mutual information threshold value (sigma) is obtained.

The satisfactory data coverage rate is related to the type of data collected: 1) Some artificially synthesized data (generated from known correlations), the expected coverage may be relatively high; 2) The data collected really is data with high noise component and rare relativity, the expected data coverage rate is low, and the human health data and the air quality index data related to the invention belong to the latter.

Further, one index data of the selected human health data is any one of heartbeat, blood oxygen saturation and blood pressure; the extracted air quality index data index is any one of the concentration of the inhaled suspended particulate matter, the concentration of smoke dust and the concentration of nitrogen dioxide; the input data set is a combination of any index data in the collected human health data and any index data in the collected air quality index data.

The invention has the beneficial effects that:

the method effectively filters valuable correlation data between the human health data and the air quality index data through an adaptive multi-level mutual information calculation method, extracts the internal correlation of different dimensional data, and outputs the most valuable correlation data. The invention uses an adaptive hierarchical model, the granularity of a time window of data sampling is gradually decreased layer by layer to increase the sampling precision layer by layer, the data meeting the correlation condition is removed from a time sequence, and only the residual data is reserved for the design of next layer processing, thereby effectively overcoming the problems of large data volume and large calculation load. The invention uses an efficient KSG estimation method, improves an auxiliary square grid algorithm for calculating the k value by a k nearest neighbor algorithm and incremental calculation, improves the calculation efficiency and precision in the calculation of the mutual information value, and effectively overcomes the problem of high data noise content. The invention integrates a two-step standardization method, uses the standardized entropy and compares two standardized mutual information values with the entropy threshold and the mutual information threshold respectively, and effectively overcomes the problem that the characteristics of the input human health data and the air quality index data are not known in advance.

Drawings

FIG. 1 is a basic flow diagram of the present invention.

FIG. 2 is a schematic diagram of sliding sampling of a time window in the adaptive multi-level mutual information calculation method according to the present invention.

FIG. 3 is a diagram of the enhanced-squares assisted (gated-assisted) algorithm of the present invention for k-nearest neighbor search.

FIG. 4 is a schematic diagram of the enhanced checkerboard assistance algorithm of the present invention increasing data points to induce an adjustment of the affected area, the affected edge area.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Example 1

Referring to fig. 1, a method for filtering data correlation between human health data and an air quality index based on an adaptive multi-level mutual information algorithm model selects a maximum time window granularity, a minimum time window granularity and a sliding step length of a time window in the adaptive multi-level mutual information algorithm model in advance for sampling an input data set, and then calculates and filters the data correlation between the human health data and the air quality index in the adaptive multi-level mutual information algorithm model. The method comprises the following steps:

s1, selecting one index data in human health data and air quality index data of a long time period as a group of input data groups, and inputting an adaptive multi-level mutual information algorithm model; one index data of the selected human health data is any one of heartbeat, blood oxygen saturation and blood pressure; the extracted air quality index data index is any one of the concentration of the inhaled suspended particulate matter, the concentration of smoke dust and the concentration of nitrogen dioxide; the input data set is a combination of any index data in the collected human health data and any index data in the collected air quality index data.

S2, calculating the adaptive multi-level mutual information algorithm model: sampling data of a time sequence corresponding to an input data set on a multi-level time window granularity through a time window, calculating mutual information values of all sampled data points in the time window, comparing the mutual information values with a mutual information threshold value to obtain data points which all meet the mutual information threshold value condition on the time window granularity level, sinking a filtered data list which does not meet the mutual information threshold value condition on the time window granularity level to the next level time window granularity, circulating the process after the next level time window granularity is reduced until the time window granularity is reduced to the minimum time window granularity or the filtered data list is empty, and only performing mutual information value calculation on data points which are newly added when the current time window is compared with the previous time window when the current time window contains the data points corresponding to the time sequence contained in the previous time window. The method specifically comprises the following steps:

s21, starting from the maximum time window granularity level, sampling data points of a long-time sequence corresponding to an input data set through a time window on the current level time window granularity, and calculating mutual information values of all sampled data points in the time window by using a KSG (K-nearest neighbor) estimation method; comparing the obtained mutual information value with a pre-selected mutual information threshold, storing and removing data points which meet the mutual information threshold condition from the current time sequence, and storing data points which do not meet the mutual information threshold condition into a filtered data list; the data meeting the mutual information threshold value condition is data with strong correlation; the data which do not meet the threshold condition of mutual information are data which do not have strong correlation; the data with strong correlation is data with mutual information value larger than the mutual information threshold value.

Wherein, the step of calculating the mutual information value is as follows:

s211, for each sampling data point p _i ＝(x _i ，y _i ) Searching k nearest neighbors by using a grid auxiliary algorithm; wherein k is the number of nearest neighbors;

s212, for the above sampling data point p _i ＝(x _i ，y _i ) Sampled data points i of (x), trace _i ±d _x ，y _i ±d _y ) Intra new data point add or old data point remove pairs p _i K nearest neighbor induced change and the number of edge region data points (n) _x ，n _y ) A change in (c); wherein, d _i ＝(d _x ，d _y ) Is a data point p _i ＝(x _i ，y _i ) The distance of the grid area boundary; (n) _x ，n _y ) The number of edge region data points for each dimension distance;

s213, when the added new data point or the removed old data point is in (x) _i ±d _x ，y _i ±d _y ) In the range, result in p _i When the value of the k nearest neighbor is changed, searching a new k nearest neighbor again to complete the updating of the k nearest neighbor; when the new data point added or the old data point removed is not (x) _i ±d _x ，y _i ±d _y ) Within the range of p _i When the value of k nearest neighbor of (a) is kept constant, p _i The number of data points in the edge region is counted again to complete the number of data points (n) in the edge region _x ，n _y ) Updating of (3);

wherein I (X, Y) is the mutual information value of all sampled data points in the time window of the time window, psi is the dual gamma function, k is the number of nearest neighbors, and (n) _x ，n _y ) The number of data points in the edge region within the distance d of each dimension is NThe total number of the present data points,<ψ(n _x )+ψ(n _y ) Phi is psi (n) _x )+ψ(n _y ) The mean function of (a).

s23, sinking the filtered residue data list on the current level time window granularity to the next level time window granularity; and then, taking the filtered residual data list as input data, taking the granularity of the next-level time window as the granularity of the current time window, and circulating the process to finish the calculation of the granularity of each-level time window until the granularity of the current time window is reduced to the minimum granularity of the time window or the filtered residual data list is empty. The granularity of the next-level time window is smaller than that of the current-level time window, and the filtered data list is subjected to sliding scanning of finer granularity of the next-level time window; and the sliding scanning of the finer time window granularity is a new round of interval size obtained by subtracting a preset sliding step length from the time window granularity of the current sliding scanning.

And S3, outputting all data points which meet the mutual information threshold condition on the granularity of each level of time window as a time window set with correlation.

The mutual information threshold value is a minimum value which is not negative and represents that data has correlation, and is pre-selected by a two-step standardization method or a data coverage rate method.

When the characteristics of the input human health data and the input air quality index data are not known in advance, a first two-step standardization method can be selected to calculate a mutual information threshold. The two-step normalization method includes the use of normalized entropy and normalized mutual information filteringA window; the two-step normalization method for selecting the mutual information threshold comprises the following steps: time window omega with n pairs of data point samples obtained for sliding sampling _X，Y ＝{(x ₁ ，y ₁ )，...，(x _n ，y _n ) Firstly, standardizing the entropy of the time window by using the maximum possible entropy to obtain the standardized entropy of the maximum possible entropy

Expressed as:

normalizing the entropy for the maximum possible entropy;

Expressed as:

wherein, I _ω Is the mutual information value, H, of all sampled data points within the time window _ω For the entropy value of the time window,

when it is reusedThe entropy value of the inter-window standardizes the mutual information values of all sampling data points in the time window to obtain entropy standardized mutual information values

Expressed as:

in which I _ω Is the mutual information value, H, of all sampled data points within the time window _ω For the entropy value of the time window,

(1) If it is not

Selecting the time window;

(2) For the time window selected in step 1), if

Or->

wherein σ _H To normalize the threshold of entropy, σ _I For normalizing mutual information value thresholds, the value is usually σ _H ＝σ _I ＝0.2。

When the characteristics of the input human health data and the air quality index data are known in advance, a second data coverage rate method can be selected to calculate a mutual information threshold value: the mutual information threshold (σ) is obtained using a data coverage method:

the user selects a certain time window from the pair of long-time sequence human health data and the air quality index data, the number of samples sampled by the time window is obtained through the adaptive multi-level mutual information calculation method, the data coverage rate of the time window in the total time sequence is obtained according to the data coverage rate formula, and the user updates the multi-level mutual information calculation method according to the adjustment mutual information threshold (sigma) to approach the satisfactory data coverage rate, so that the ideal mutual information threshold (sigma) is obtained.

Example 2

As shown in fig. 1, in this embodiment, the data of human health over a long period of time (several years) is collected as data of blood oxygen saturation index, and the data of air quality index over a long period of time is collected as data of concentration of inhalable suspended particles, and the data correlation filtering method between the data of human health and the data of air quality index, which is based on adaptive multi-level mutual information calculation, according to the present invention, is used to effectively filter valuable correlation data, and includes the following steps:

step 1: collecting human health data of a long time period (years) and air quality index data of the long time period, extracting an index (blood oxygen saturation index) from the collected human health data in advance based on actual requirements, extracting an index (inhalable suspended particulate matter concentration) from the collected air quality index data, and taking the set of blood oxygen saturation index and inhalable suspended particulate matter concentration data as input data; wherein the human health data are accurate to minutes and comprise indexes such as heartbeat, blood oxygen saturation, blood pressure and the like; the air quality index data is accurate to minutes and comprises indexes such as inhalable suspended particulate concentration, smoke concentration, nitrogen dioxide concentration and the like.

Step 2: presetting mutual information value threshold values of the blood oxygen saturation index and the inhalable suspended particulate matter concentration data, wherein the first preferred scheme is a two-step standardization method:

obtaining a data time window omega of n data point samples for sliding sampling _X，Y ＝{(x ₁ ，y ₁ )，...，(x _n ，y _n ) Firstly, standardizing the entropy of the time window by using the maximum possible entropy to obtain the standardized entropy of the maximum possible entropy

Expressed as:

wherein H _ω Entropy values for the time window; max (H) _ω ) Using the maximum possible entropy for the entropy of the time window; log (n) is the entropy value for a uniform distribution of data point samples over the time window (note: sample X over the time window) _ω And Y _ω Entropy is maximum when evenly distributed);

normalizing the entropy for the maximum possible entropy;

secondly, standardizing the mutual information value by using the maximum possible entropy to obtain the mutual information value standardized by the maximum possible entropy

Expressed as:

Expressed as:

wherein I _ω For the window mutual information value, H _ω For the entropy value of the window,

(1) If it is not

Selecting the time window;

(2) For time windows selected in step 1), e.g.Fruit

Or>

wherein σ _H To normalize the threshold of entropy, σ _I To normalize mutual information value thresholds, typically normalized entropy thresholds σ _H With normalized mutual information threshold σ _I In order to ensure that there is no time window in which there is no valuable correlation between the blood oxygen saturation index of the selected part and the concentration data of the respirable suspended particles due to the threshold value being set too low, and to ensure that there is no time window in which there is valuable correlation between the missing part due to the threshold value being set too high, sigma is set _H ＝σ _I ＝0.2。

The second preferred scheme is that the data coverage rate method is used for obtaining:

the user selects a certain window from the blood oxygen saturation index and the inhalable suspended particulate matter concentration data of the long time sequence, the number of samples sampled by the time window is obtained through the adaptive multi-level mutual information calculation method, the data coverage rate of the time window in the total time sequence is obtained according to the data coverage rate formula, and the user updates the multi-level mutual information calculation method according to the adjustment mutual information threshold (sigma) to approach the satisfactory data coverage rate, so that the ideal mutual information threshold (sigma) is obtained.

And step 3: selecting a maximum time window granularity and a minimum time window granularity for time window sampling of a set of oximetry index and respirable suspended particulate matter concentration data, and a sliding step size of the time window; inputting the set of blood oxygen saturation index data and inhalable suspended particulate concentration data, and carrying out data point sampling on the set of blood oxygen saturation index data and inhalable suspended particulate concentration data (X, Y) on a long-time sequence through a time window on the current level time window granularity from the maximum time window granularity level; comparing the obtained mutual information value with a pre-selected mutual information threshold, storing and removing data points which meet the mutual information threshold condition from the current time sequence, and storing the data points which do not meet the mutual information threshold condition into a filtered data list.

At the granularity of the time window at the same level, the size and the granularity of the time window are fixed, and the time window omega _i Is from a starting point s _i To the end point e _i A segment of data of, while s _i And e _i Respectively the start and end timestamps of the window for the time period; as shown in fig. 2, window ω ₁ ＝[s ₁ ，e ₁ ]Is the first time window from the leftmost side of the layer, and ω is calculated ₁ Mutual information value σ of ₁ Let us assume σ ₁ If the mutual information value of the window does not meet the preset mutual information threshold value screening condition, the sequence [ s ] is selected ₁ ，e ₁ ]And storing the residual time window list. Will time window omega ₁ According to a preset sliding step length(s) on a time sequence ₁ To s ₂ Distance) sliding to obtain a time window omega ₂ ＝[s ₂ ，e ₂ ]And calculating to obtain omega ₂ Mutual information value σ of ₂ Let us assume σ ₂ The mutual information value of all sampling data points in the time window meets the preset mutual information threshold value screening condition, and then the sequence [ s ] ₂ ，e ₂ ]Storing the time window list meeting the threshold condition and removing the time window list from the current time sequence, wherein the residual time window list consists of [ s ] ₁ ，e ₁ ]Is updated to [ s ] ₁ ，s ₂ ](ii) a The window continues to slide, and a third time window position, e.g., ω ₃ ＝[s ₃ ，e ₃ ]The above process is repeated until the sliding time window covers the entire time series, as shown.

Wherein the window ω is calculated by the KSG estimation method _i Of mutual informationThe steps of the values are: for each sampled data point p _i ＝(x _i ，y _i ) The sampling data point i is searched for the k nearest neighbor by an enhanced grid-assisted (assisted) algorithm; for a certain data point p _i ＝(x _i ，y _i ) The distance of the grid region boundary of the sampled data point i is set as d _i ＝(d _x ，d _y ) Track (x) _i ±d _x ，y _i ±d _y ) Intra new data point add or old data point remove pairs p _i Induced changes (changes in k nearest neighbors and changes in the number of edge region data points), adding the affected points to the affected data point array; if the new data point added or the old data point removed is at (x) _i ±d _x ，y _i ±d _y ) Internal result in p _i If the k nearest neighbor value is changed, searching a new k nearest neighbor value, updating the number of data points in the edge area, and finishing the updating of the k nearest neighbor; if the new data point added or the old data point removed is not (x) _i ±d _x ，y _i ±d _y ) Internal, make p _i K nearest neighbors of (c) are kept constant, for p _i The number of data points in the edge region is counted again to complete the number of data points (n) in the edge region _x ，n _y ) Updating of (1); and when the current time window contains the data point corresponding to the time sequence contained in the previous time window, only carrying out mutual information value calculation on the newly added data point of the current time window compared with the previous time window.

As shown in FIG. 3 and FIG. 4, assume that a set of data is represented by p ₀ ，...，p ₆ Seven data points make up, set k =2 and with p ₀ Is a survey point; for p ₀ In other words, p ₁ And p ₂ Is p ₀ Nearest neighbor of, distance p ₀ Is (d) _x ，d _y ) The rectangular area formed by the boundary of the image sensor is an affected area; for p ₀ In the word d _x The wide edge region data points include (p) ₁ ，p ₂ ，p ₄ ) Number of edge regions n _x =3, in d _y The wide edge region data points include (p) ₁ ，p ₂ ，p ₃ ) Number of edge regions n _y =3, the hatched portion is the affected edge region in fig. 3. FIG. 4 is a diagram of the case of FIG. 3 with the addition of a new data point p ₇ The scene (c); data point p ₇ The addition of (c) brings two changes to the design of the enhanced checkerboard assist of the present invention: first, p ₀ K nearest neighbors of the point are changed, and the k nearest neighbors of the point need to be searched again; secondly, carrying out re-counting on data points in the edge area; if the process involves an overlapping region (the time window has an overlapping part with the previous time window), only k nearest neighbor calculation is needed to be carried out on the data points of the newly added part; calculating mutual information values of all sampling data points in a time window of the time window:

wherein I (X, Y) is the mutual information value of all sampled data points in the time window of the time window, psi is the double gamma function, k is the number of nearest neighbors, and (n) _x ，n _y ) The number of edge region data points within each dimension distance d, N is the total number of sample data points,<ψ(n _x )+ψ(n _y ) Phi is psi (n) _x )+ψ(n _y ) The mean function of (a).

And 4, step 4: the time window is scanned in a sliding mode on the time sequence of the input data set according to the sliding step length of the time window, the process is repeated until the sliding scanning of the time window on all the time sequences corresponding to the input data set on the granularity of the current level time window is completed, data points which all accord with mutual information threshold conditions on the granularity of the current level time window and a filtered data list of the granularity of the current level time window are obtained, the calculation of the time window on the granularity of the current level time window is completed, and the filtered data list on the granularity of the current level time window sinks to the granularity of the next level time window; taking the filtered data list as input data, taking the granularity of the next-level time window as the granularity of the current time window, and scanning the finer granularity of the time window; performing time window sliding scanning, calculation by a KSG (KSG estimation method), comparison and filtering with a mutual information threshold (sigma) again in the new-level time window granularity, inserting a time window list meeting the threshold condition into a time window meeting the condition, and updating a time window list not meeting the threshold condition in an after-filtering time window list; after the sliding scanning of the data of the layer is finished, the updated filtered residual time window list of the layer sinks continuously, and the next layer of circulation is carried out; the above-mentioned cyclic process will stop when the granularity of the current time window is reduced to the minimum granularity of the time window preset by the user, or when the time window left after filtering is empty.

And 5: and 4, outputting all the blood oxygen saturation data meeting the mutual information threshold condition on the granularity of each level time window and data points corresponding to the concentration of the inhalable suspended particles as a time window set with correlation. The output time window set with correlation is generally processed and ordered according to the calculation method of adaptive multi-level mutual information and output as the time window set result; the invention supports the sequencing according to the mutual information value according to the time window set, sequences the time window with the strongest correlation from top to bottom, and outputs the time window set as a time window set result.

It should be understood by those skilled in the art that the data of the blood oxygen saturation index and the concentration of the inhalable suspended particles are selected as parameters to perform data correlation filtering, so that the method is suitable for the human health data including the indexes such as heartbeat, blood oxygen saturation, blood pressure and the like and the air quality index data including the indexes such as the concentration of the inhalable suspended particles, the smoke concentration, the concentration of nitrogen dioxide and the like, and ensures the application universality. The invention ensures the high efficiency of data processing by using the adaptive hierarchical model. The invention uses an efficient KSG estimation method, improves an auxiliary square grid algorithm for calculating the k value by a k nearest neighbor algorithm and incremental calculation, and effectively improves the calculation efficiency in the calculation of the mutual information value. The invention uses the standardized entropy and the mutual information threshold value comparison method of the standardized information value, and effectively reduces the cognitive threshold of the user on the access parameter data.

The above embodiments are merely illustrative of the technical idea of the present invention, and the scope of the present invention should not be limited thereto, and any modification made based on the technical idea of the present invention falls within the scope of the present invention.

Claims

1. A filtering method for data correlation between human health data and air quality indexes based on an adaptive multi-level mutual information algorithm model is characterized by comprising the following steps of:

s2, calculating the adaptive multi-level mutual information algorithm model: sampling data of a time sequence corresponding to an input data set on a multi-level time window granularity through a time window, calculating mutual information values of all sampled data points in the time window, comparing the mutual information values with a mutual information threshold to obtain data points which all meet the mutual information threshold condition on the time window granularity level, sinking a filtered data list which does not meet the mutual information threshold condition on the time window granularity level to the next level time window granularity, and circulating the process after the next level time window granularity is reduced until the time window granularity is reduced to the minimum time window granularity or the filtered data list is empty; the data meeting the mutual information threshold value condition is data with strong correlation; the data which do not meet the threshold condition of mutual information are data which do not have strong correlation; the data with strong correlation is data of which the mutual information value is greater than the mutual information threshold value;

2. The method for filtering data correlation between human health data and air quality index based on adaptive multi-level mutual information algorithm model according to claim 1, wherein the maximum time window granularity, the minimum time window granularity and the sliding step size of the time window for sampling the input data set in the adaptive multi-level mutual information algorithm model are pre-selected in the time window before the input data set is input into the adaptive multi-level mutual information algorithm model.

3. The filtering method for data correlation between human health data and air quality index based on adaptive multi-level mutual information algorithm model according to claim 1, wherein the step S2 specifically comprises:

s21, starting from the maximum time window granularity level, sampling data points of a long-time sequence corresponding to an input data set through a time window on the current level time window granularity, and calculating mutual information values of all sampled data points in the time window by using a KSG (K-nearest neighbor) estimation method; comparing the obtained mutual information value with a pre-selected mutual information threshold, storing and removing data points which meet the mutual information threshold condition from the current time sequence, and storing data points which do not meet the mutual information threshold condition into a filtered data list; the step of calculating the mutual information value of all the collected data points in the window by using the KSG measuring method comprises the following steps: for each sampled data point p _i ＝(x _i ,y _i ) The sampling data point i is searched for the nearest k neighbor by an enhanced grid auxiliary algorithm; the grid assistance algorithm is to a certain data point p _i ＝(x _i ,y _i ) The distance of the grid region boundary of the sampled data point i is set as d _i ＝(d _x ,d _y ) Track (x) _i ±d _x ,y _i ±d _y ) Intra new data point add or old data point remove pairs p _i Induced changes (changes in k nearest neighbors and changes in the number of edge region data points), adding the affected points to the affected data point array; if the new data point added or the old data point removed is at (x) _i ±d _x ,y _i ±d _y ) Internal lead to p _i If the k nearest neighbor value is changed, searching a new k nearest neighbor value, updating the number of data points in the edge area, and finishing the updating of the k nearest neighbor; if the new data point added or the old data point removed is not present(x _i ±d _x ,y _i ±d _y ) Internal, let p _i K nearest neighbors of (c) are kept constant, for p _i The data point number of the edge area is counted again, and the data point number (n) of the edge area is completed _x ,n _y ) Updating of (3); when the current time window contains a data point corresponding to the time sequence contained in the previous time window, only the mutual information value calculation is carried out on the newly added data point of the current time window compared with the previous time window;

s22, the time window slides and scans on the time sequence of the input data set according to the sliding step length of the time window, the process is repeated until all time sequence sliding scanning corresponding to the input data set of the time window on the granularity of the current level time window is finished, all data points which accord with the mutual information threshold condition on the granularity of the current level time window and a filtered data list of the granularity of the current level time window are obtained, and the calculation of the time window on the granularity of the current level time window is finished;

4. The method for filtering data correlation between human health data and air quality index based on adaptive multi-level mutual information algorithm model as claimed in claim 3, wherein in step S23, the granularity of the next-level time window is smaller than the granularity of the current-level time window, and the filtered data list performs sliding scan of finer granularity of the next-level time window; and the sliding scanning of the finer time window granularity is a new round of interval size obtained by subtracting a preset sliding step length from the time window granularity of the current sliding scanning.

5. The method for filtering data correlation between human health data and air quality index based on adaptive multi-level mutual information algorithm model as claimed in claim 1, wherein the mutual information threshold is a non-negative minimum value representing data correlation, and is pre-selected by two-step normalization method or data coverage rate method; the two-step standardization method for selecting the mutual information threshold comprises the following steps: time window omega with n pairs of data point samples obtained for sliding sampling _X,Y ＝{(x ₁ ,y ₁ ),…,(x _n ,y _n ) That is, the entropy of the time window is normalized by the maximum possible entropy to obtain the maximum possible entropy-normalized entropy

Expressed as:

normalizing the entropy for the maximum possible entropy;

secondly, the entropy of the time window is used for standardizing the mutual information value by the maximum possible entropy to obtain the mutual information value standardized by the maximum possible entropy

Expressed as:

Expressed as:

wherein I _ω Is the mutual information value, H, of all sampled data points within the time window _ω For the entropy value of the time window it is,

(1) If it is not

Selecting the time window;

(2) For the time window selected in step 1), if

Or->

The time window is selected as the time window satisfying the condition and will be fullTaking mutual information values of all sampling data points in a time window of the condition as mutual information threshold values;

wherein, _σH in order to normalize the threshold value of entropy, _σI for standardizing mutual information value threshold, take value as _σH ＝ _σI ＝0.2；

The data coverage rate method calculates a mutual information threshold: the mutual information threshold value (a) _σ ) The data coverage method is used for obtaining:

6. the filtering method for data correlation between human health data and air quality index based on adaptive multi-level mutual information algorithm model according to claim 1, wherein one index data of the selected human health data is any one of heartbeat, blood oxygen saturation and blood pressure; the selected air quality index data index is any one of the concentration of inhaled suspended particulate matters, the concentration of smoke dust and the concentration of nitrogen dioxide; the input data set is a combination of any index data in the selected human health data and any index data in the selected air quality index data.