CN107562800B

CN107562800B - SFp-Link-based semi-structured data frequent pattern mining method

Info

Publication number: CN107562800B
Application number: CN201710664740.9A
Authority: CN
Inventors: 蔡庆玲; 邓少风; 吕律; 李海良
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2017-08-07
Filing date: 2017-08-07
Publication date: 2020-06-05
Anticipated expiration: 2037-08-07
Also published as: CN107562800A

Abstract

The invention discloses a frequent pattern mining method for semi-structured data based on SFp-Link, which is characterized in that a semi-structured data frequent pattern linked list SFp-Link is established for the semi-structured data, frequent pattern mining is carried out based on the semi-structured data frequent pattern linked list SFp-Link, and frequent item sets in the semi-structured data can be effectively extracted according to the mining purpose. When the semi-structured data frequent pattern linked list SFp-Link is established, the mined sample database only needs to be scanned once, the sample item set obtained by scanning the contained item combination for the first time only needs to be stored, and the corresponding sample frequency is only accumulated once for the sample item set obtained by scanning the contained item combination again, so that the method has the advantages of small storage space consumption, short mining time consumption and high mining efficiency.

Description

SFp-Link-based semi-structured data frequent pattern mining method

Technical Field

The invention relates to a frequent pattern mining method of semi-structured data based on SFp-Link, belonging to the technical field of data mining.

Background

In the medical field, a large amount of diagnostic data is unstructured or semi-structured data, and the attributes are very diverse, often as many as hundreds, even thousands (e.g., genes), with very few samples. If such semi-structured data is converted into structured data, a very sparse matrix must be generated. However, since some rare diseases are more in need of research, the inherent correlation between the rare information cannot be ignored, and some effective technology is more needed to extract the relevant information. Most of the existing algorithms such as Apriori algorithms and FP-tree algorithms are directed at structured data, and the problem of frequent pattern, association and correlation extraction of unstructured or semi-structured data cannot be effectively solved. In addition, the algorithm needs to read the sample database for multiple times, which often results in high complexity of space and time.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: an SFp-Link-based semi-structured data frequent pattern mining method is provided.

The technical scheme adopted by the invention is as follows:

a SFp-Link-based semi-structured data frequent pattern mining method is characterized in that: the method for mining the frequent pattern of the semi-structured data comprises the following steps:

step one, carrying out data preprocessing on a mined sample database, namely:

extracting a sample item set of each piece of semi-structured data in the mined sample database, wherein the sample item set is a set of effective data related to a mining purpose in corresponding semi-structured data, and each effective data contained in the sample item set is an item of the sample item set;

step two, scanning all sample item sets of the mined sample database, storing the sample item sets scanned by the contained item combinations for the first time in the scanning process, recording the sample item sets as stored sample item sets, calculating the number of the same sample item sets of each stored sample item set in the mined sample database, and calculating the number of the proper subsets of each stored sample item set in the mined sample database to establish the following semi-structured data frequent pattern linked list SFp-Link:

the semi-structured data frequent pattern linked list SFp-Link is composed of an item set linked list header qSetHead and item set linked lists of m levels:

the item set chain table head qSetHead is a pointer array consisting of m pointers, and the ith pointer in the pointer array is an item set chain table qsetLink with the level i_iWherein i is an integer and is not less than 1 and not more than m, m is the maximum value in the lengths of all sample item sets of the mined sample database, and the length of a sample item set is the number of items contained in the sample item set;

the item set linked list qSetLink of level i_iFrom N_iEach item set linked list node SpcNode, wherein i is an integer, i is more than or equal to 1 and less than or equal to m, and N_iThe number of the sample item sets with the length of i in the mined sample database is obtained;

the item set linked list qSetLink of level i_iJ the item set linked list node SpcNode is based on the sample item set address qSet_ijSample frequency sCnt_ijSupport frequency tCnt_ijAnd linked list pointer link_ijWherein i is an integer, i is not less than 1 and not more than m, j is an integer, j is not less than 1 and not more than N_i；

The sample entry set address qSet_ijIs a storage address of the stored sample entry set that is the same as a scanned sample entry set, wherein the scanned sample entry set is: in the scanning process of the mined sample database, the jth scanned sample item set with the length of i;

the sample frequency sCnt_ijComprises the following steps: in all sample item sets of the mined sample database, storing the sample item set address qSet_ijThe number of sample item sets that are the same as the stored sample item set;

the support frequency tCnt_ijComprises the following steps: in all sample item sets of the mined sample database, as stored at the sample item set address qSet_ijThe number of sample item sets of the proper subset of the stored sample item sets;

the linked list pointer link_ijTo point to the item set linked list qSetLink of level i_iThe j +1 th pointer of the item set linked list node SpcNode, wherein when j is N_iThe link of the linked list pointer_ijNull;

thirdly, performing frequent pattern mining on the semi-structured data in the mined sample database based on the semi-structured data frequent pattern linked list SFp-Link, namely: setting a support frequency threshold s according to the mining purpose_minAnd scanning the item set linked list of the m levels from the high level to the low level one by one to extract the support frequency tCnt_ijAt the support frequency threshold s_minThe extracted stored sample item sets are frequent item sets.

Compared with the prior art, the invention has the following beneficial effects:

the invention establishes a semi-structured data frequent pattern linked list SFp-Link for semi-structured data, and carries out frequent pattern mining based on the semi-structured data frequent pattern linked list SFp-Link, so that frequent item sets in the semi-structured data can be effectively extracted according to the mining purpose, and the problem that in the prior art, the semi-structured data is converted into structured data to generate an extremely sparse matrix (because the semi-structured data applied by the invention has the characteristics of less quantity but more contained items) is solved;

when the semi-structured data frequent pattern linked list SFp-Link is established, the mined sample database only needs to be scanned once, the sample item set obtained by scanning the contained item combination for the first time only needs to be stored, and the corresponding sample frequency is only accumulated once for the sample item set obtained by scanning the contained item combination again, so that the method has the advantages of small storage space consumption, short mining time consumption and high mining efficiency.

Drawings

The invention is described in further detail below with reference to the following figures and specific examples:

FIG. 1 is a block flow diagram of a semi-structured data mining method of the present invention;

FIG. 2 is a schematic diagram of a semi-structured data frequent pattern linked list SFp-Link in the present invention;

FIG. 3 is a diagram of a jth item set linked list node SpcNode of the ith unidirectional linked list in the present invention.

Detailed Description

As shown in FIGS. 1 to 3, the invention discloses a frequent pattern mining method of semi-structured data based on SFp-Link, which comprises the following steps:

step one, carrying out data preprocessing on a mined sample database, namely:

extracting a sample item set of each piece of semi-structured data in a mined sample database, wherein the sample item set is a set of effective data related to a mining purpose in corresponding semi-structured data, and each effective data contained in the sample item set is an item of the sample item set; for example, the content of one piece of semi-structured data is that "the patient enters the ward due to' dry mouth, polydipsia, polyuria, weight reduction for 4 years, red swelling of the left foot, and 3 weeks of ulceration", wherein the application-oriented keywords "dry mouth, polydipsia, diuresis, weight reduction for 4 years, red swelling of the left foot, and 3 weeks of ulceration" are all valid data related to the mining purpose, and constitute a sample item set of the semi-structured data, that is, the sample item set includes 6 items; the auxiliary text "patient, income this ward" is invalid data irrelevant to the mining purpose.

And secondly, scanning all sample item sets of the mined sample database, storing the sample item sets which are scanned by the contained item combinations for the first time in the scanning process, recording the sample item sets as stored sample item sets, calculating the number of the same sample item sets of each stored sample item set in the mined sample database, and calculating the number of the proper subsets of each stored sample item set in the mined sample database to establish the following semi-structured data frequent pattern linked list SFp-Link.

Referring to FIG. 2, the semi-structured data frequent-pattern linked list SFp-Link is composed of an item set linked list header qSethead and an item set linked list of m levels.

The item set chain table head qSetHead is a pointer array consisting of m pointers, and the ith pointer in the pointer array is an item set chain table qsetLink with the level i_iWherein i is an integer and is not less than 1 and not more than m, m is the maximum value of the lengths of all sample item sets of the sample database to be mined, and the length of a sample item set is the number of items contained in the sample item set.

Item set linked list qSetLink of level i_iFrom N_iEach item set linked list node SpcNode, wherein i is an integer, i is more than or equal to 1 and less than or equal to m, and N_iThe number of sample item sets with the length of i in the mined sample database is determined; for example, referring to FIG. 2, in the item set chain header qSethead, the item set chain table qSetLink to which the 1 st pointer points_i＝1Containing N_iItem set chain pointed to by the 2 nd pointer as the o item set chain table nodes SpcNodeTable qSetLink_i＝2Containing N_iThe item set linked list qSetLink pointed to by the ith pointer is p item set linked list nodes SpcNode_iContaining N_iQ item set linked list nodes SpcNode, the item set linked list qSetLink to which the mth pointer points_i＝mContaining N_iR item set linked list nodes SpcNode.

Item set linked list qSetLink of level i_iThe jth item set linked list node SpcNode is composed of sample item set address qSet_ijSample frequency sCnt_ijSupport frequency tCnt_ijAnd linked list pointer link_ijWherein i is an integer, i is not less than 1 and not more than m, j is an integer, j is not less than 1 and not more than N_i。

Sample entry set address qSet_ijIs the same storage address of the stored sample item set as the scanned sample item set, wherein the scanned sample item set is: in the scanning process of the mined sample database, the jth scanned sample item set with the length of i, for example, the item set linked list node SpcNode circled by the oval dashed frame in fig. 2, and the sample item set address qSet thereof_ijThe length of the corresponding sample item set is i, and the scanned order is 2.

Sample frequency sCnt_ijComprises the following steps: all sample item sets of the mined sample database are compared with the sample item set address qSet_ijThe number of sample item sets that are the same as the stored sample item set.

Support frequency tCnt_ijComprises the following steps: in all sample item sets of the mined sample database, the sample item set addresses qSet are used as the addresses stored in the sample item sets_ijThe number of sample item sets of the proper subset of the stored sample item sets.

Link of linked list pointer_ijTo point to the item set linked list qSetLink of level i_iThe j +1 th item set linked list node SpcNode, wherein the item set linked list qSetLink with the level i_iN of (2)_iThe item set linked list node SpcNode has no next node, so when j is N_iTime, linked list pointer link_ijIs null.

For example, when the first scan includes "dry mouth, polydipsia, polyuria, and body weightWhen the sample item set which is reduced by 4 years and has red and swollen left foot and ulcerated for 3 weeks' is used as a stored sample item set to be stored, and the sample frequency sCnt of the stored sample item set is used as the sample frequency sCnt of the stored sample item set_ijAdding 1, when a sample item set containing 'dry mouth, polydipsia, polyuria, weight loss for 4 years, red swelling of left foot and ulceration for 3 weeks' is scanned for the second time, the sample item set does not need to be stored again, and only the sample frequency sCnt of the stored sample item set is needed_ijAdding 1 to obtain the final product.

Thirdly, performing frequent pattern mining on the semi-structured data in the mined sample database based on the semi-structured data frequent pattern linked list SFp-Link, namely: setting support frequency threshold s according to the need of digging purpose_minAnd scanning the item set linked list of m levels from the high level to the low level one by one to extract the support frequency tCnt_ijAt the support frequency threshold s_minThe extracted stored sample item sets are frequent item sets.

The present invention is not limited to the above embodiments, and various other equivalent modifications, substitutions and alterations can be made without departing from the basic technical concept of the invention as described above, according to the common technical knowledge and conventional means in the field.

Claims

1. A SFp-Link-based semi-structured data frequent pattern mining method is characterized in that: the method for mining the frequent pattern of the semi-structured data comprises the following steps:

step one, carrying out data preprocessing on a mined sample database, namely: