WO2021169088A1 - 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法 - Google Patents

用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法 Download PDF

Info

Publication number
WO2021169088A1
WO2021169088A1 PCT/CN2020/096484 CN2020096484W WO2021169088A1 WO 2021169088 A1 WO2021169088 A1 WO 2021169088A1 CN 2020096484 W CN2020096484 W CN 2020096484W WO 2021169088 A1 WO2021169088 A1 WO 2021169088A1
Authority
WO
WIPO (PCT)
Prior art keywords
nearest neighbor
granularity
electronic health
subpopulation
super
Prior art date
Application number
PCT/CN2020/096484
Other languages
English (en)
French (fr)
Inventor
丁卫平
孙颖
李铭
鞠恒荣
冯志豪
曹金鑫
张毅
任龙杰
丁帅荣
陈森博
万杰
赵理莉
Original Assignee
南通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南通大学 filed Critical 南通大学
Priority to AU2020331559A priority Critical patent/AU2020331559A1/en
Publication of WO2021169088A1 publication Critical patent/WO2021169088A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Definitions

  • the present invention relates to the field of intelligent processing of medical information, in particular to a nearest neighbor multi-granularity profit method for collaborative reduction of large-scale electronic health file knowledge.
  • Electronic health records are electronic personal health historical records that are formed when people are engaged in medical and health-related activities and have the value of preservation for future reference. After these years of development, my country has accumulated a large amount of medical and health data information in the field of electronic health records.
  • the use of artificial intelligence methods to automatically discover hidden medical laws from the rich electronic health file data information is of great significance and value for disease prevention, control and treatment.
  • the application of traditional artificial intelligence, machine learning and data mining algorithms is greatly restricted.
  • the data training samples processed by traditional data mining algorithms are generally required to not contain a large amount of missing information, that is, the completeness of the data is required.
  • Most of the data containing missing information is directly deleted, and most of the processed data types are symbolic.
  • numerical data, for fuzzy data it is converted into numerical data for processing.
  • the data in large-scale electronic health records often shows a high degree of incompleteness, and there is a considerable proportion of missing data in established electronic health records.
  • the value of some attribute columns of electronic health file data is described in descriptive language, which has strong ambiguity. If all fuzzy data is directly converted into numerical or symbolic data, it may cause a large amount of loss of electronic health file information. It even affects the subsequent intelligent auxiliary diagnosis decision-making.
  • Multi-granularity computing is one of the strategies that humans usually adopt when solving problems, and it is an important manifestation of human cognitive ability.
  • Multi-granularity-based data modeling is to conduct intelligent analysis of complex data by obtaining information granular sets and multiple granular structures, extracting available knowledge from them and forming effective decision-making schemes. If data modeling uses only one granular structure, it is called single-granularity-based data modeling; if multiple granular structures are used, it is called multi-granularity-based data modeling. Multi-granularity-based data analysis can analyze problems from multiple angles and levels, and better obtain more reasonable and satisfactory problem solutions. As one of the important characteristics of human cognition, multi-granularity plays an important role in data mining and knowledge discovery of complex data. Therefore, in the context of medical big data application, an effective multi-granularity collaborative reduction method of knowledge is proposed for the mixed incomplete and fuzzy data in large-scale electronic health records, which has important significance and value for large-scale electronic health records decision support analysis.
  • the purpose of the present invention is to disclose a method that reduces the execution time, improves the accuracy of the large-scale electronic health file knowledge collaborative reduction, and reduces the complexity cost of the large-scale electronic health file knowledge collaborative reduction on the cloud computing Spark cloud platform , Lay a good foundation for the development of intelligent services such as electronic health record feature selection, rule mining and clinical decision support. A nearest neighbor multi-granular profit method for large-scale electronic health record knowledge collaborative reduction.
  • the invention discloses a nearest neighbor multi-granularity profit method for collaborative reduction of large-scale electronic health file knowledge, which includes the following steps:
  • step B the specific steps of step B are as follows:
  • the shared nearest neighbor vector is used to represent the nearest neighbor radius set in the d i-th layer as:
  • tf(R j ) is the frequency of occurrence of the nearest neighbor radius R j in the di-th layer
  • df(R j ) is the hierarchical frequency of the weight vector w j in the nearest neighbor radius R j
  • corr (f i, f j ) represents an inner product operation f i and f j two feature vectors
  • Df (R i R j) is the nearest neighbor vector contains the total number of nearest neighbors radius of R i and R j
  • df (R j) is a vector of weights w j level nearest neighbor frequencies of radius R j;
  • ⁇ i is the number of Super-Elitist i in the i-th nearest neighbor radius used for knowledge reduction in the i-th electronic health record data subset.
  • step C is as follows:
  • Granu-Subpopulation i s super elite matrix, Is the trust degree between the nearest neighbor radius R i and R j at the kth iteration;
  • the present invention has the following advantages:
  • the present invention can support large-scale electronic health records to parallelize knowledge collaborative reduction on multiple nodes.
  • Super elites perform knowledge reduction tasks in their respective multi-granularity sub-populations, which greatly reduces the execution time and improves large-scale electronic health records. The accuracy of the collaborative reduction of health file knowledge.
  • the nearest neighbor multi-granularity profit method proposed in the present invention divides and stores large-scale electronic health files in multiple evolutionary subpopulations Granu-Subpopulation i , which reduces the knowledge reduction of large-scale electronic health files on the cloud computing Spark cloud platform.
  • the complexity cost has laid a good foundation for the development of intelligent services such as feature selection of electronic health records, rule mining, and clinical decision support.
  • the present invention can efficiently obtain the knowledge collaborative reduction set of incomplete and fuzzy data in a large-scale electronic health file, which has very important significance and value for the large-scale electronic health file decision support analysis.
  • Figure 1 is the overall flow chart of the system
  • Figure 2 is a diagram of the dynamic execution process of the nearest neighbor multi-granularity profit model
  • the present invention discloses a nearest neighbor multi-granularity profit method for large-scale electronic health file knowledge collaborative reduction, including the following steps:
  • step B The specific steps of step B are as follows:
  • the shared nearest neighbor vector is used to represent the nearest neighbor radius set in the d i-th layer as:
  • tf(R j ) is the frequency of occurrence of the nearest neighbor radius R j in the di-th layer
  • df(R j ) is the hierarchical frequency of the weight vector w j in the nearest neighbor radius R j
  • corr (f i, f j ) represents an inner product operation f i and f j two feature vectors
  • Df (R i R j) is the nearest neighbor vector contains the total number of nearest neighbors radius of R i and R j
  • df (R j) is a vector of weights w j level nearest neighbor frequencies of radius R j;
  • ⁇ i is the i-th nearest neighbor radius used for the i-th electronic health record data subset to know
  • step C The specific steps of step C are as follows:
  • Granu-Subpopulation i s super elite matrix, Is the trust degree between the nearest neighbor radius R i and R j at the kth iteration;
  • the present invention can support large-scale electronic health files to parallelize knowledge collaborative reduction on multiple nodes, and super elites perform knowledge reduction tasks in their respective multi-granularity sub-populations, which greatly reduces the execution time and improves large-scale electronic health files.
  • the accuracy rate of knowledge collaborative reduction is the
  • the nearest neighbor multi-granularity profit method proposed in the present invention divides and stores large-scale electronic health records in multiple evolutionary subpopulations Granu-Subpopulation i , and reduces the complexity of large-scale electronic health file knowledge reduction on the cloud computing Spark cloud platform Costs have laid a good foundation for the development of intelligent services such as feature selection of electronic health records, rule mining, and clinical decision support; it can efficiently obtain knowledge reduction sets of incomplete and fuzzy data in large-scale electronic health records, which is very useful for large-scale electronic health records.
  • the health file decision support analysis has very important meaning and value; the present invention will not be limited to the embodiments shown in this article, but should conform to the widest scope consistent with the principles and novel features disclosed in this article.
  • the present invention uses the above-mentioned embodiments to illustrate the implementation method and device structure of the present invention, but the present invention is not limited to the above-mentioned embodiments, which does not mean that the present invention must rely on the above-mentioned methods and structures to be implemented.
  • any improvement to the present invention, equivalent replacement of the selected implementation method of the present invention, addition of steps, selection of specific methods, etc. fall within the scope of protection and disclosure of the present invention.
  • the present invention is not limited to the above-mentioned embodiments, and all the ways to achieve the objects of the present invention by adopting structures and methods similar to those of the present invention fall within the protection scope of the present invention.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,首先在Spark云平台上将大规模电子健康档案数据集分割至不同的多粒度进化子种群中;接着构建一种基于最近邻多粒度利润模型,在最近邻半径中构造协同化的最近邻向量;然后求出超级精英的共享最近邻利润权重及其权重利润向量,执行超级精英权重利润矩阵的自适应动态调整策略;最后求出大规模电子健康档案数据知识协同约简集及其核属性,并将电子健康档案知识约简集存储至Spark云平台。该方法能高效取得大规模电子健康档案中不完备和模糊数据知识约简集,对电子健康档案决策支持分析具有重要意义与价值。

Description

用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法 技术领域:
本发明涉及到医学信息智能处理领域,具体来说涉及一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法。
背景技术:
“健康中国2020”战略规划中提到:“我国要建立起比较完善的覆盖城乡居民的基本医疗卫生制度,实现人人享有基本医疗卫生服务的目标,促进卫生服务利用的均等化,大幅度提高全民健康水平;在卫生信息化方面,要建立起覆盖城乡居民的电子健康档案使用与管理制度。”
电子健康档案是人们在从事与医疗健康相关活动时形成的、具有保存备查价值的个人健康电子化历史记录。经过这些年的发展,我国在电子健康档案领域积累了大量的医疗和健康数据信息。利用人工智能方法从丰富的电子健康档案数据信息中自动发现潜藏的医学规律,对于疾病的预防、控制和治疗等具有重要意义与价值。然而由于大规模电子健康档案数据具有高度不完备性和模糊性,极大地限制了传统人工智能、机器学习和数据挖掘算法的应用。
传统数据挖掘算法处理的数据训练样本一般要求不能包含大量的缺失信息,即要求数据的完备性,对含有缺失信息的数据大部分采用直接删除的方式处理,且处理的数据类型大部分为符号型或数值型数据,对于模糊类型数据则将其转化为数值型数据后进行处理。然而大规模电子健康档案中的数据往往呈 现出高度的不完备性,已建立的电子健康档案中存在着相当大比例的缺失数据。另外电子健康档案数据部分属性列的取值用描述性语言刻画,具有较强的模糊性,如将全部模糊型数据直接转化为数值型或者符号型数据有可能造成电子健康档案信息的大量丢失,甚至影响后续智能辅助诊断决策。
因此,拓展针对大规模电子健康档案特点的数据挖掘方法,建立电子健康档案智能辅助决策系统的实际应用,充分提取出疾病或体征之间的关联性,对开展大规模电子健康档案决策支持分析以及提供个性化、协同化与知识化的电子健康档案大数据服务等具有重要意义。
多粒度计算是人类进行问题求解时通常采用的策略之一,是人类认知能力的重要体现。基于多粒度的数据建模就是通过获得信息粒集和多个粒结构进行复杂数据智能分析,从中提取出可用的知识并形成有效决策方案。若数据建模仅使用一个粒结构,则称其为基于单粒度的数据建模;若使用多个粒结构,则称其为基于多粒度的数据建模。基于多粒度的数据分析可从多个角度、多个层次出发分析问题,较好地获得更加合理、更加满意的问题解。多粒度作为人类认知的重要特征之一,对复杂数据的数据挖掘与知识发现具有重要作用。因此在医疗大数据应用背景下,针对大规模电子健康档案中混合不完备和模糊数据提出有效的多粒度知识协同约简方法,对大规模电子健康档案决策支持分析具有重要的意义与价值。
发明内容:
本发明的目的是公开了一种降低了执行时间,提升了大规模电子健康档案知识协同约简的准确率,降低了云计算Spark云平台上大规模电子健康档案知识协同约简的复杂度成本,为开展电子健康档案特征选择、规则挖掘以及临床 决策支持等智能服务奠定了较好的基础的用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法。
本发明公开了一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,包括以下步骤:
A.在大数据Spark云平台上将大规模电子健康档案数据集分割至不同的多粒度进化子种群Granu-Subpopulation i中,i=1,2,…,N,N为多粒度进化子种群总个数,这样大规模电子健康档案数据集知识约简任务分解为多个并行化多粒度进化子种群的知识协同约简任务,分别计算出多粒度进化子种群所分配的电子健康档案数据集候选等价类;
B.设计一种最近邻多粒度利润模型,将第i个多粒度进化子种群Granu-Subpopulation i用于大规模电子健康档案第i个数据子集的知识约简,同时在多粒度进化种群Granu-Subpopulation i中根据适应度的大小,选择适应度值最大的超级精英Super-Elitist i和适应度值最小的普通精英Ordinary-Elitist i,求出共享最近邻域向量的相似度Sim(m,n)和共享最近邻利润向量ζ(e),并在最近邻半径的第d i层中构造协同化的最近邻向量;
C.构建多粒度精英矩阵Gp i,计算多粒度子种群Granu-Subpopulation i中精英矩阵Gp i的最近邻多粒度利润权重,得到其相应的权重利润矩阵Γ(e),执行超级精英权重利润矩阵自适应动态调整策略,求得各超级精英在各自多粒度子种群内利润权重
Figure PCTCN2020096484-appb-000001
然后分配给进行大规模电子健康档案数据子集知识协同约简的各个多粒度子种群Granu-Subpopulation i中超级精英Super-Elitist i
D.存储所有超级精英的多粒度利润权重集合
Figure PCTCN2020096484-appb-000002
然后利用粗糙集理论中差别矩阵公式计算大规模电子健康档案数据子集知识协同约简集及其核属性,从而将大规模电子健康档案数据集正确分类到决策属性的知识规则类中;
E.比较上述求出的大规模电子健康档案知识协同约简集精度EHR与预先设定精度值λ关系,若满足EHR≥λ,则输出大规模电子健康档案最优知识协同约简集。否则,继续执行上述C和D步骤,直至大规模电子健康档案知识协同约简精度满足EHR≥λ;
F.求出大规模电子健康档案数据知识协同约简集及其核属性,并将电子健康档案相关知识约简集存储至Spark云平台,为大规模电子健康档案决策支持分析提供重要的智能辅助诊断依据。
本发明的进一步改进在于:所述步骤B的具体步骤如下:
a.采用共享最近邻域向量表示第d i层中最近邻半径集为:
d i={w 1,w 2,...,w j,...,w m},
w j=(1+logtf(R j))*log(1+n/df(R j)),
其中tf(R j)为第d i层中最近邻域半径R j的出现频率,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
b.构造一个N i×N i的矩阵C i,其中N i是第d i层中最近邻域半径数量,则最近半径R i和R j之间共享权重C i(i,j)定义如下:
C i(i,j)=corr(f i,f j),
其中f i和f j分别对应于最近邻半径R i和R j的特征向量,corr(f i,f j)表示f i和f j两个特征向量的内积操作;
c.在最近邻半径的第d i层中,构造4个交叠邻域向量为
Figure PCTCN2020096484-appb-000003
Figure PCTCN2020096484-appb-000004
并将它们分别分解成4个子向量如下:
Figure PCTCN2020096484-appb-000005
Figure PCTCN2020096484-appb-000006
d.在第d i层中计算交叠邻域向量
Figure PCTCN2020096484-appb-000007
Figure PCTCN2020096484-appb-000008
的共享邻域为
Figure PCTCN2020096484-appb-000009
其中
Figure PCTCN2020096484-appb-000010
Figure PCTCN2020096484-appb-000011
分别是交叠邻域向量
Figure PCTCN2020096484-appb-000012
Figure PCTCN2020096484-appb-000013
对应的最近邻域集;
e.求出共享最近邻域交叠邻域向量
Figure PCTCN2020096484-appb-000014
Figure PCTCN2020096484-appb-000015
的相似度Sim(m,n),计算公式如下:
Figure PCTCN2020096484-appb-000016
f.求出共享最近邻利润向量ζ(e),计算公式如下:
Figure PCTCN2020096484-appb-000017
g.计算最近邻半径R i和R j之间的自适应利润补偿权重f i j如下:
f i j=Df(R iR j)/df(R j),
其中Df(R iR j)为最近邻域向量包含最近邻域半径R i和R j的总数量,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
h.在最近邻半径的第d i层中构造协同化最近邻向量f m,f n,f p,f t,分别如下:
Figure PCTCN2020096484-appb-000018
Figure PCTCN2020096484-appb-000019
其中ξ i为第i个最近邻半径中用于第i个电子健康档案数据子集进行知识约简的超级精英Super-Elitist i数量。
本发明的进一步改进在于:所述步骤C的具体步骤如下:
a.在第i个多粒度进化子种群Granu-Subpopulation i中,将最近邻半径矩阵表示成两个张量
Figure PCTCN2020096484-appb-000020
Figure PCTCN2020096484-appb-000021
然后将它们合并到多粒度子种群Granu-Subpopulation i的超级精英矩阵集Gp i中,其中i=1,2,…,N;
b.计算超级精英矩阵中相邻张量之间的平均共享相似度,计算公式如下:
Figure PCTCN2020096484-appb-000022
其中
Figure PCTCN2020096484-appb-000023
表示相邻张量
Figure PCTCN2020096484-appb-000024
Figure PCTCN2020096484-appb-000025
之间的相似度;
c.计算多粒度子种群Granu-Subpopulation i中超级精英矩阵Gp i的最近邻多粒度利润权重,计算公式如下:
Figure PCTCN2020096484-appb-000026
其中
Figure PCTCN2020096484-appb-000027
||Gp i||表示第i个多粒度子种群
Granu-Subpopulation i的超级精英矩阵的势,
Figure PCTCN2020096484-appb-000028
为最近邻半径R i和R j之间在第k次迭代时的信任度;
d.构造子种群Granu-Subpopulation i的多粒度染色体,其包括m个超级精英,相应的权重利润矩阵Γ(e)定义如下:
Figure PCTCN2020096484-appb-000029
e.更新超级精英Super-Elitist i的权重,在大规模电子健康档案数据子集知识协同约简过程中如果多粒度子种群Granu-Subpopulation i中超级精英
Figure PCTCN2020096484-appb-000030
矩阵的势
Figure PCTCN2020096484-appb-000031
大于
Figure PCTCN2020096484-appb-000032
N为多粒度进化子种群总个数,则
超级精英权重
Figure PCTCN2020096484-appb-000033
将相应增加,自适应动态调整公式如下:
Figure PCTCN2020096484-appb-000034
其中||Γ(e)||为权重利润矩阵Γ(e)的势,η i是控制超级精英Super-Elitist i的动态权重参数,其公式定义如下:
Figure PCTCN2020096484-appb-000035
其中
Figure PCTCN2020096484-appb-000036
为第i个超级精英Super-Elitist i的适应度,
Figure PCTCN2020096484-appb-000037
为第i个超级精英Super-Elitist i所在多粒度子种群Granu-population i的适应度;
f.将超级精英Super-Elitist i的利润权重
Figure PCTCN2020096484-appb-000038
进行归一化操作,求得其归一利润权重
Figure PCTCN2020096484-appb-000039
Figure PCTCN2020096484-appb-000040
本发明与现有技术相比具有如下优点:
1)本发明能够支持大规模电子健康档案在多个结点上并行化知识协同约简,超级精英在各自多粒度子种群内进行知识约简任务,大大降低了执行时间,提升了大规模电子健康档案知识协同约简的准确率。
2)本发明提出的最近邻多粒度利润方法将大规模电子健康档案划分和存储在多个进化子种群Granu-Subpopulation i中,降低了云计算Spark云平台上大 规模电子健康档案知识约简的复杂度成本,为开展电子健康档案特征选择、规则挖掘以及临床决策支持等智能服务奠定了较好的基础。
3)本发明能高效取得大规模电子健康档案中不完备和模糊数据的知识协同约简集,对大规模电子健康档案决策支持分析具有非常重要的意义与价值。
附图说明:
图1为系统总体流程图;
图2为最近邻多粒度利润模型动态执行过程图;
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。
如图1-2所示,本发明公开了一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,包括以下步骤:
A.在大数据Spark云平台上将大规模电子健康档案数据集分割至不同的多粒度进化子种群Granu-Subpopulation i中,i=1,2,…,N,N为多粒度进化子种群总个数,这样大规模电子健康档案数据集知识约简任务分解为多个并行化多粒度进化子种群的知识协同约简任务,分别计算出多粒度进化子种群所分配的电子健康档案数据集候选等价类;
B.设计一种最近邻多粒度利润模型,将第i个多粒度进化子种群Granu-Subpopulation i用于大规模电子健康档案第i个数据子集的知识约简,同时在多粒度进化种群Granu-Subpopulation i中根据适应度的大小,选择适应度值最大的超级精英Super-Elitist i和适应度值最小的普通精英Ordinary-Elitist i, 求出共享最近邻域向量的相似度Sim(m,n)和共享最近邻利润向量ζ(e),并在最近邻半径的第d i层中构造协同化的最近邻向量;
所述步骤B的具体步骤如下:
a.采用共享最近邻域向量表示第d i层中最近邻半径集为:
d i={w 1,w 2,...,w j,...,w m},
w j=(1+logtf(R j))*log(1+n/df(R j)),
其中tf(R j)为第d i层中最近邻域半径R j的出现频率,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
b.构造一个N i×N i的矩阵C i,其中N i是第d i层中最近邻域半径数量,则最近半径R i和R j之间共享权重C i(i,j)定义如下:
C i(i,j)=corr(f i,f j),
其中f i和f j分别对应于最近邻半径R i和R j的特征向量,corr(f i,f j)表示f i和f j两个特征向量的内积操作;
c.在最近邻半径的第d i层中,构造4个交叠邻域向量为
Figure PCTCN2020096484-appb-000041
Figure PCTCN2020096484-appb-000042
并将它们分别分解成4个子向量如下:
Figure PCTCN2020096484-appb-000043
Figure PCTCN2020096484-appb-000044
d.在第d i层中计算交叠邻域向量
Figure PCTCN2020096484-appb-000045
Figure PCTCN2020096484-appb-000046
的共享邻域为
Figure PCTCN2020096484-appb-000047
其中
Figure PCTCN2020096484-appb-000048
Figure PCTCN2020096484-appb-000049
分别是交叠邻域向量
Figure PCTCN2020096484-appb-000050
Figure PCTCN2020096484-appb-000051
对应的最近邻域集;
e.求出共享最近邻域交叠邻域向量
Figure PCTCN2020096484-appb-000052
Figure PCTCN2020096484-appb-000053
的相似度Sim(m,n),计算公式如下:
Figure PCTCN2020096484-appb-000054
f.求出共享最近邻利润向量ζ(e),计算公式如下:
Figure PCTCN2020096484-appb-000055
g.计算最近邻半径R i和R j之间的自适应利润补偿权重f i j如下:
f i j=Df(R iR j)/df(R j),
其中Df(R iR j)为最近邻域向量包含最近邻域半径R i和R j的总数量,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
h.在最近邻半径的第d i层中构造协同化最近邻向量f m,f n,f p,f t,分别如下:
Figure PCTCN2020096484-appb-000056
Figure PCTCN2020096484-appb-000057
其中ξ i为第i个最近邻半径中用于第i个电子健康档案数据子集进行知
识约简的超级精英Super-Elitist i数量。
C.构建多粒度精英矩阵Gp i,计算多粒度子种群Granu-Subpopulation i中精英矩阵Gp i的最近邻多粒度利润权重,得到其相应的权重利润矩阵Γ(e),执行超级精英权重利润矩阵自适应动态调整策略,求得各超级精英在各自多粒度子种群内利润权重
Figure PCTCN2020096484-appb-000058
然后分配给进行大规模电子健康档案数据子集知识协同约简的各个多粒度子种群Granu-Subpopulation i中超级精英Super-Elitist i
所述步骤C的具体步骤如下:
a.在第i个多粒度进化子种群Granu-Subpopulation i中,将最近邻半径矩阵表示成两个张量
Figure PCTCN2020096484-appb-000059
Figure PCTCN2020096484-appb-000060
然后将它们合并到多粒度子种群Granu-Subpopulation i的超级精英矩阵集Gp i中,其中i=1,2,…,N;
b.计算超级精英矩阵中相邻张量之间的平均共享相似度,计算公式如下:
Figure PCTCN2020096484-appb-000061
其中
Figure PCTCN2020096484-appb-000062
表示相邻张量
Figure PCTCN2020096484-appb-000063
Figure PCTCN2020096484-appb-000064
之间的相似度;
c.计算多粒度子种群Granu-Subpopulation i中超级精英矩阵Gp i的最近邻多粒度利润权重,计算公式如下:
Figure PCTCN2020096484-appb-000065
其中
Figure PCTCN2020096484-appb-000066
||Gp i||表示第i个多粒度子种群
Granu-Subpopulation i的超级精英矩阵的势,
Figure PCTCN2020096484-appb-000067
为最近邻半径R i和R j之间在第k次迭代时的信任度;
d.构造子种群Granu-Subpopulation i的多粒度染色体,其包括m个超级精英,相应的权重利润矩阵Γ(e)定义如下:
Figure PCTCN2020096484-appb-000068
e.更新超级精英Super-Elitist i的权重,在大规模电子健康档案数据子集知识协同约简过程中如果多粒度子种群Granu-Subpopulation i中超级精英
Figure PCTCN2020096484-appb-000069
矩阵的势||Gp i||大于
Figure PCTCN2020096484-appb-000070
N为多粒度进化子种群总个数,则超级精英权重
Figure PCTCN2020096484-appb-000071
将相应增加,自适应动态调整公式如下:
Figure PCTCN2020096484-appb-000072
其中||Γ(e)||为权重利润矩阵Γ(e)的势,η i是控制超级精英Super-Elitist i的动态权重参数,其公式定义如下:
Figure PCTCN2020096484-appb-000073
其中
Figure PCTCN2020096484-appb-000074
为第i个超级精英Super-Elitist i的适应度,
Figure PCTCN2020096484-appb-000075
为第i个超级精英Super-Elitist i所在多粒度子种群Granu-population i的适应度;
f.将超级精英Super-Elitist i的利润权重
Figure PCTCN2020096484-appb-000076
进行归一化操作,求得其归一利润权重
Figure PCTCN2020096484-appb-000077
Figure PCTCN2020096484-appb-000078
D.存储所有超级精英的多粒度利润权重集合
Figure PCTCN2020096484-appb-000079
然后利用粗糙集理论中差别矩阵公式计算大规模电子健康档案数据子集知识协同约简集及其核属性,从而将大规模电子健康档案数据集正确分类到决策属性的知识规则类中;
E.比较上述求出的大规模电子健康档案知识协同约简集精度EHR与预先设定精度值λ关系,若满足EHR≥λ,则输出大规模电子健康档案最优知识协同约简集。否则,继续执行上述C和D步骤,直至大规模电子健康档案知识协同约简精度满足EHR≥λ;
F.求出大规模电子健康档案数据知识协同约简集及其核属性,并将电子健康档案相关知识约简集存储至Spark云平台,为大规模电子健康档案决策支 持分析提供重要的智能辅助诊断依据。
本发明能够支持大规模电子健康档案在多个结点上并行化知识协同约简,超级精英在各自多粒度子种群内进行知识约简任务,大大降低了执行时间,提升了大规模电子健康档案知识协同约简的准确率。
本发明提出的最近邻多粒度利润方法将大规模电子健康档案划分和存储在多个进化子种群Granu-Subpopulation i中,降低了云计算Spark云平台上大规模电子健康档案知识约简的复杂度成本,为开展电子健康档案特征选择、规则挖掘以及临床决策支持等智能服务奠定了较好的基础;能高效取得大规模电子健康档案中不完备和模糊数据的知识约简集,对大规模电子健康档案决策支持分析具有非常重要的意义与价值;本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
申请人又一声明,本发明通过上述实施例来说明本发明的实现方法及装置结构,但本发明并不局限于上述实施方式,即不意味着本发明必须依赖上述方法及结构才能实施。所属技术领域的技术人员应该明了,对本发明的任何改进,对本发明所选用实现方法等效替换及步骤的添加、具体方式的选择等,均落在本发明的保护范围和公开的范围之内。
本发明并不限于上述实施方式,凡采用和本发明相似结构及其方法来实现本发明目的的所有方式,均在本发明的保护范围之内。

Claims (3)

  1. 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,其特征在于:具体步骤如下:
    A.在大数据Spark云平台上将大规模电子健康档案数据集分割至不同的多粒度进化子种群Granu-Subpopulation i中,i=1,2,…,N,N为多粒度进化子种群总个数,这样大规模电子健康档案数据集知识约简任务分解为多个并行化多粒度进化子种群的知识协同约简任务,分别计算出多粒度进化子种群所分配的电子健康档案数据集候选等价类;
    B.设计一种最近邻多粒度利润模型,将第i个多粒度进化子种群个Granu-Subpopulation i用于大规模电子健康档案第i个数据子集的知识约简,同时在多粒度进化种群Granu-Subpopulation i中根据适应度的大小,选择适应度值最大的超级精英Super-Elitist i和适应度值最小的普通精英Ordinary-Elitist i,求出共享最近邻域向量的相似度Sim(m,n)和共享最近邻利润向量ζ(e),并在最近邻半径的第d i层中构造协同化的最近邻向量;
    C.构建多粒度精英矩阵Gp i,计算多粒度子种群Granu-Subpopulation i中精英矩阵Gp i的最近邻多粒度利润权重,得到其相应的权重利润矩阵Γ(e),执行超级精英权重利润矩阵自适应动态调整策略,求得各超级精英在各自多粒度子种群内利润权重
    Figure PCTCN2020096484-appb-100001
    然后分配给进行大规模电子健康档案数据子集知识协同约简的各个多粒度子种群Granu-Subpopulation i中超级精英Super-Elitist i
    D.存储所有超级精英的多粒度利润权重集合
    Figure PCTCN2020096484-appb-100002
    然后利用粗糙集理论中差别矩阵公式计算大规模电子健康档案数据子集知识协同约简集及其核属性,从而将大规模电子健康档案数据集正确分类到决策属性的知识规则类中;
    E.比较上述求出的大规模电子健康档案知识协同约简集精度EHR与预先设定精度值λ关系,若满足EHR≥λ,则输出大规模电子健康档案最优知识协同约简集。否则,继续执行上述C和D步骤,直至大规模电子健康档案知识协同约简精度满足EHR≥λ;
    F.求出大规模电子健康档案数据知识协同约简集及其核属性,并将电子健康档案相关知识约简集存储至Spark云平台,为大规模电子健康档案决策支持分析提供重要的智能辅助诊断依据。
  2. 根据权利要求1所述一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,其特征在于:所述步骤B的具体步骤如下:
    a.采用共享最近邻域向量表示第d i层中最近邻半径集为:
    d i={w 1,w 2,...,w j,...,w m},
    w j=(1+log tf(R j))*log(1+n/df(R j)),
    其中tf(R j)为第d i层中最近邻域半径R j的出现频率,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
    b.构造一个N i×N i的矩阵C i,其中N i是第d i层中最近邻域半径数量,则最近半径R i和R j之间共享权重C i(i,j)定义如下:
    C i(i,j)=corr(f i,f j),
    其中f i和f j分别对应于最近邻半径R i和R j的特征向量,corr(f i,f j)表示f i和f j两个特征向量的内积操作;
    c.在最近邻半径的第d i层中,构造4个交叠邻域向量为
    Figure PCTCN2020096484-appb-100003
    Figure PCTCN2020096484-appb-100004
    并将它们分别分解成4个子向量如下:
    Figure PCTCN2020096484-appb-100005
    Figure PCTCN2020096484-appb-100006
    d.在第d i层中计算交叠邻域向量
    Figure PCTCN2020096484-appb-100007
    Figure PCTCN2020096484-appb-100008
    的共享邻域为
    Figure PCTCN2020096484-appb-100009
    其中
    Figure PCTCN2020096484-appb-100010
    Figure PCTCN2020096484-appb-100011
    分别是交叠邻域向量
    Figure PCTCN2020096484-appb-100012
    Figure PCTCN2020096484-appb-100013
    对应的最近邻域集;
    e.求出共享最近邻域交叠邻域向量
    Figure PCTCN2020096484-appb-100014
    Figure PCTCN2020096484-appb-100015
    的相似度Sim(m,n),计算公式如下:
    Figure PCTCN2020096484-appb-100016
    f.求出共享最近邻利润向量ζ(e),计算公式如下:
    Figure PCTCN2020096484-appb-100017
    g.计算最近邻半径R i和R j之间的自适应利润补偿权重f i j如下:
    f i j=Df(R iR j)/df(R j),
    其中Df(R iR j)为最近邻域向量包含最近邻域半径R i和R j的总数量,df(R j)为权重向量w j在最近邻域半径R j的层次频率;
    h.在最近邻半径的第d i层中构造协同化最近邻向量f m,f n,f p,f t,分别如下:
    Figure PCTCN2020096484-appb-100018
    Figure PCTCN2020096484-appb-100019
    其中ξ i为第i个最近邻半径中用于第i个电子健康档案数据子集进行知识约简的超级精英Super-Elitist i数量。
  3. 根据权利要求1所述一种用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法,其特征在于:所述步骤C的具体步骤如下:
    a.在第i个多粒度进化子种群Granu-Subpopulation i中,将最近邻半径矩阵表示成两个张量
    Figure PCTCN2020096484-appb-100020
    Figure PCTCN2020096484-appb-100021
    然后将它们合并到多粒度子种群Granu-Subpopulation i的超级精英矩阵集Gp i中,其中i=1,2,…,N;
    b.计算超级精英矩阵中相邻张量之间的平均共享相似度,计算公式如下:
    Figure PCTCN2020096484-appb-100022
    其中
    Figure PCTCN2020096484-appb-100023
    表示相邻张量
    Figure PCTCN2020096484-appb-100024
    Figure PCTCN2020096484-appb-100025
    之间的相似度;
    c.计算多粒度子种群Granu-Subpopulation i中超级精英矩阵Gp i的最近邻多粒度利润权重,计算公式如下:
    Figure PCTCN2020096484-appb-100026
    其中
    Figure PCTCN2020096484-appb-100027
    ||Gp i||表示第i个多粒度子种群Granu-Subpopulation i的超级精英矩阵的势,
    Figure PCTCN2020096484-appb-100028
    为最近邻半径R i和R j之间在第k次迭代时的信任度;
    d.构造子种群Granu-Subpopulation i的多粒度染色体,其包括m个超级精英,相应的权重利润矩阵Γ(e)定义如下:
    Figure PCTCN2020096484-appb-100029
    e.更新超级精英Super-Elitist i的权重,在大规模电子健康档案数据子集知识协同约简过程中如果多粒度子种群Granu-Subpopulation i中超级精英
    Figure PCTCN2020096484-appb-100030
    矩阵的势||Gp i||大于
    Figure PCTCN2020096484-appb-100031
    N为多粒度进化子种群总个数,则超级精英权重
    Figure PCTCN2020096484-appb-100032
    将相应增加,自适应动态调整公式如下:
    Figure PCTCN2020096484-appb-100033
    其中||Γ(e)||为权重利润矩阵Γ(e)的势,η i是控制超级精英Super-Elitist i的动态权重参数,其公式定义如下:
    Figure PCTCN2020096484-appb-100034
    其中
    Figure PCTCN2020096484-appb-100035
    为第i个超级精英Super-Elitist i的适应度,
    Figure PCTCN2020096484-appb-100036
    为第i个超级精英Super-Elitist i所在多粒度子种群Granu-Subpopulation i的适应度;
    f.将超级精英Super-Elitist i的利润权重
    Figure PCTCN2020096484-appb-100037
    进行归一化操作,求得其归一利润权重
    Figure PCTCN2020096484-appb-100038
    Figure PCTCN2020096484-appb-100039
PCT/CN2020/096484 2020-02-25 2020-06-17 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法 WO2021169088A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2020331559A AU2020331559A1 (en) 2020-02-25 2020-06-17 Nearest-neighbor multi-granularity profit method for collaborative knowledge reduction of large-scale electronic health records

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010117158.2 2020-02-25
CN202010117158.2A CN111354427B (zh) 2020-02-25 2020-02-25 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法

Publications (1)

Publication Number Publication Date
WO2021169088A1 true WO2021169088A1 (zh) 2021-09-02

Family

ID=71195847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/096484 WO2021169088A1 (zh) 2020-02-25 2020-06-17 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法

Country Status (3)

Country Link
CN (1) CN111354427B (zh)
AU (1) AU2020331559A1 (zh)
WO (1) WO2021169088A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023063A (zh) * 2021-11-02 2022-02-08 大连理工大学 一种基于认知网络的智能交通系统协同决策方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110178964A1 (en) * 2010-01-21 2011-07-21 National Cheng Kung University Recommendation System Using Rough-Set and Multiple Features Mining Integrally and Method Thereof
CN103838972A (zh) * 2014-03-13 2014-06-04 南通大学 一种用于mri病历属性约简的量子协同博弈实现方法
CN104915430A (zh) * 2015-06-15 2015-09-16 南京邮电大学 一种基于MapReduce的约束关系粗糙集规则获取方法
CN107256342A (zh) * 2017-06-15 2017-10-17 南通大学 用于电子病历知识约简效能评估的多种群协同熵级联方法
CN108986872A (zh) * 2018-06-21 2018-12-11 南通大学 用于大数据电子病历约简的多粒度属性权重Spark方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6263334B1 (en) * 1998-11-11 2001-07-17 Microsoft Corporation Density-based indexing method for efficient execution of high dimensional nearest-neighbor queries on large databases
CN104933156A (zh) * 2015-06-25 2015-09-23 西安理工大学 一种基于共享近邻聚类的协同过滤方法
CN108447534A (zh) * 2018-05-18 2018-08-24 灵玖中科软件(北京)有限公司 一种基于nlp的电子病历数据质量管理方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110178964A1 (en) * 2010-01-21 2011-07-21 National Cheng Kung University Recommendation System Using Rough-Set and Multiple Features Mining Integrally and Method Thereof
CN103838972A (zh) * 2014-03-13 2014-06-04 南通大学 一种用于mri病历属性约简的量子协同博弈实现方法
CN104915430A (zh) * 2015-06-15 2015-09-16 南京邮电大学 一种基于MapReduce的约束关系粗糙集规则获取方法
CN107256342A (zh) * 2017-06-15 2017-10-17 南通大学 用于电子病历知识约简效能评估的多种群协同熵级联方法
CN108986872A (zh) * 2018-06-21 2018-12-11 南通大学 用于大数据电子病历约简的多粒度属性权重Spark方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114023063A (zh) * 2021-11-02 2022-02-08 大连理工大学 一种基于认知网络的智能交通系统协同决策方法

Also Published As

Publication number Publication date
AU2020331559A1 (en) 2021-09-09
CN111354427B (zh) 2022-04-29
CN111354427A (zh) 2020-06-30

Similar Documents

Publication Publication Date Title
Razi et al. A comparative predictive analysis of neural networks (NNs), nonlinear regression and classification and regression tree (CART) models
Guo et al. Breaking the curse of space explosion: Towards efficient nas with curriculum search
Rahman et al. Discretization of continuous attributes through low frequency numerical values and attribute interdependency
CN109902192B (zh) 基于无监督深度回归的遥感图像检索方法、系统、设备及介质
Hu et al. A niching backtracking search algorithm with adaptive local search for multimodal multiobjective optimization
CN113693563A (zh) 一种基于超图注意力网络的脑功能网络分类方法
Biswas et al. Hybrid expert system using case based reasoning and neural network for classification
Bouchachia et al. Towards incremental fuzzy classifiers
WO2021169088A1 (zh) 用于大规模电子健康档案知识协同约简的最近邻多粒度利润方法
WO2021082444A1 (zh) 用于大规模脑病历分割的多粒度Spark超信任模糊方法
Zhang et al. An enhanced grey wolf optimizer boosted machine learning prediction model for patient-flow prediction
Hu et al. Differential evolution based on network structure for feature selection
Jain Introduction to data mining techniques
JP7207128B2 (ja) 予測システム、予測方法、および予測プログラム
CN108446740B (zh) 一种用于脑影像病历特征提取的多层一致协同方法
Hong et al. A novel and efficient neuro-fuzzy classifier for medical diagnosis
Tarle et al. Improved artificial neural network for dimension reduction in medical data classification
Eick et al. Learning Bayesian classification rules through genetic algorithms
Farhadi et al. Leveraging Meta-Learning To Improve Unsupervised Domain Adaptation
Chen et al. Intelligent Fuzzy Optimization Algorithm for Data Set Information Clustering Patterns Based on Data Mining and IoT
CN116718198B (zh) 基于时序知识图谱的无人机集群的路径规划方法及系统
Mostofi et al. Data mining and diagnosis of heart diseases: a hybrid approach to the b-mine algorithm and association rules
Dong et al. Applications in Various Decision Problems
Vivek et al. Novel Machine Learning-based Soil Characteristic Analysis
Huang et al. A revised MCDM approach for determining criteria weights: the combination of Bayesian BWM and fuzzy DEMATEL

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2020331559

Country of ref document: AU

Date of ref document: 20200617

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922369

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922369

Country of ref document: EP

Kind code of ref document: A1