CN107562800B - SFp-Link-based semi-structured data frequent pattern mining method - Google Patents

SFp-Link-based semi-structured data frequent pattern mining method Download PDF

Info

Publication number
CN107562800B
CN107562800B CN201710664740.9A CN201710664740A CN107562800B CN 107562800 B CN107562800 B CN 107562800B CN 201710664740 A CN201710664740 A CN 201710664740A CN 107562800 B CN107562800 B CN 107562800B
Authority
CN
China
Prior art keywords
sample
item set
linked list
item
semi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710664740.9A
Other languages
Chinese (zh)
Other versions
CN107562800A (en
Inventor
蔡庆玲
邓少风
吕律
李海良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201710664740.9A priority Critical patent/CN107562800B/en
Publication of CN107562800A publication Critical patent/CN107562800A/en
Application granted granted Critical
Publication of CN107562800B publication Critical patent/CN107562800B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a frequent pattern mining method for semi-structured data based on SFp-Link, which is characterized in that a semi-structured data frequent pattern linked list SFp-Link is established for the semi-structured data, frequent pattern mining is carried out based on the semi-structured data frequent pattern linked list SFp-Link, and frequent item sets in the semi-structured data can be effectively extracted according to the mining purpose. When the semi-structured data frequent pattern linked list SFp-Link is established, the mined sample database only needs to be scanned once, the sample item set obtained by scanning the contained item combination for the first time only needs to be stored, and the corresponding sample frequency is only accumulated once for the sample item set obtained by scanning the contained item combination again, so that the method has the advantages of small storage space consumption, short mining time consumption and high mining efficiency.

Description

SFp-Link-based semi-structured data frequent pattern mining method
Technical Field
The invention relates to a frequent pattern mining method of semi-structured data based on SFp-Link, belonging to the technical field of data mining.
Background
In the medical field, a large amount of diagnostic data is unstructured or semi-structured data, and the attributes are very diverse, often as many as hundreds, even thousands (e.g., genes), with very few samples. If such semi-structured data is converted into structured data, a very sparse matrix must be generated. However, since some rare diseases are more in need of research, the inherent correlation between the rare information cannot be ignored, and some effective technology is more needed to extract the relevant information. Most of the existing algorithms such as Apriori algorithms and FP-tree algorithms are directed at structured data, and the problem of frequent pattern, association and correlation extraction of unstructured or semi-structured data cannot be effectively solved. In addition, the algorithm needs to read the sample database for multiple times, which often results in high complexity of space and time.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: an SFp-Link-based semi-structured data frequent pattern mining method is provided.
The technical scheme adopted by the invention is as follows:
a SFp-Link-based semi-structured data frequent pattern mining method is characterized in that: the method for mining the frequent pattern of the semi-structured data comprises the following steps:
step one, carrying out data preprocessing on a mined sample database, namely:
extracting a sample item set of each piece of semi-structured data in the mined sample database, wherein the sample item set is a set of effective data related to a mining purpose in corresponding semi-structured data, and each effective data contained in the sample item set is an item of the sample item set;
step two, scanning all sample item sets of the mined sample database, storing the sample item sets scanned by the contained item combinations for the first time in the scanning process, recording the sample item sets as stored sample item sets, calculating the number of the same sample item sets of each stored sample item set in the mined sample database, and calculating the number of the proper subsets of each stored sample item set in the mined sample database to establish the following semi-structured data frequent pattern linked list SFp-Link:
the semi-structured data frequent pattern linked list SFp-Link is composed of an item set linked list header qSetHead and item set linked lists of m levels:
the item set chain table head qSetHead is a pointer array consisting of m pointers, and the ith pointer in the pointer array is an item set chain table qsetLink with the level iiWherein i is an integer and is not less than 1 and not more than m, m is the maximum value in the lengths of all sample item sets of the mined sample database, and the length of a sample item set is the number of items contained in the sample item set;
the item set linked list qSetLink of level iiFrom NiEach item set linked list node SpcNode, wherein i is an integer, i is more than or equal to 1 and less than or equal to m, and NiThe number of the sample item sets with the length of i in the mined sample database is obtained;
the item set linked list qSetLink of level iiJ the item set linked list node SpcNode is based on the sample item set address qSetijSample frequency sCntijSupport frequency tCntijAnd linked list pointer linkijWherein i is an integer, i is not less than 1 and not more than m, j is an integer, j is not less than 1 and not more than Ni
The sample entry set address qSetijIs a storage address of the stored sample entry set that is the same as a scanned sample entry set, wherein the scanned sample entry set is: in the scanning process of the mined sample database, the jth scanned sample item set with the length of i;
the sample frequency sCntijComprises the following steps: in all sample item sets of the mined sample database, storing the sample item set address qSetijThe number of sample item sets that are the same as the stored sample item set;
the support frequency tCntijComprises the following steps: in all sample item sets of the mined sample database, as stored at the sample item set address qSetijThe number of sample item sets of the proper subset of the stored sample item sets;
the linked list pointer linkijTo point to the item set linked list qSetLink of level iiThe j +1 th pointer of the item set linked list node SpcNode, wherein when j is NiThe link of the linked list pointerijNull;
thirdly, performing frequent pattern mining on the semi-structured data in the mined sample database based on the semi-structured data frequent pattern linked list SFp-Link, namely: setting a support frequency threshold s according to the mining purposeminAnd scanning the item set linked list of the m levels from the high level to the low level one by one to extract the support frequency tCntijAt the support frequency threshold sminThe extracted stored sample item sets are frequent item sets.
Compared with the prior art, the invention has the following beneficial effects:
the invention establishes a semi-structured data frequent pattern linked list SFp-Link for semi-structured data, and carries out frequent pattern mining based on the semi-structured data frequent pattern linked list SFp-Link, so that frequent item sets in the semi-structured data can be effectively extracted according to the mining purpose, and the problem that in the prior art, the semi-structured data is converted into structured data to generate an extremely sparse matrix (because the semi-structured data applied by the invention has the characteristics of less quantity but more contained items) is solved;
when the semi-structured data frequent pattern linked list SFp-Link is established, the mined sample database only needs to be scanned once, the sample item set obtained by scanning the contained item combination for the first time only needs to be stored, and the corresponding sample frequency is only accumulated once for the sample item set obtained by scanning the contained item combination again, so that the method has the advantages of small storage space consumption, short mining time consumption and high mining efficiency.
Drawings
The invention is described in further detail below with reference to the following figures and specific examples:
FIG. 1 is a block flow diagram of a semi-structured data mining method of the present invention;
FIG. 2 is a schematic diagram of a semi-structured data frequent pattern linked list SFp-Link in the present invention;
FIG. 3 is a diagram of a jth item set linked list node SpcNode of the ith unidirectional linked list in the present invention.
Detailed Description
As shown in FIGS. 1 to 3, the invention discloses a frequent pattern mining method of semi-structured data based on SFp-Link, which comprises the following steps:
step one, carrying out data preprocessing on a mined sample database, namely:
extracting a sample item set of each piece of semi-structured data in a mined sample database, wherein the sample item set is a set of effective data related to a mining purpose in corresponding semi-structured data, and each effective data contained in the sample item set is an item of the sample item set; for example, the content of one piece of semi-structured data is that "the patient enters the ward due to' dry mouth, polydipsia, polyuria, weight reduction for 4 years, red swelling of the left foot, and 3 weeks of ulceration", wherein the application-oriented keywords "dry mouth, polydipsia, diuresis, weight reduction for 4 years, red swelling of the left foot, and 3 weeks of ulceration" are all valid data related to the mining purpose, and constitute a sample item set of the semi-structured data, that is, the sample item set includes 6 items; the auxiliary text "patient, income this ward" is invalid data irrelevant to the mining purpose.
And secondly, scanning all sample item sets of the mined sample database, storing the sample item sets which are scanned by the contained item combinations for the first time in the scanning process, recording the sample item sets as stored sample item sets, calculating the number of the same sample item sets of each stored sample item set in the mined sample database, and calculating the number of the proper subsets of each stored sample item set in the mined sample database to establish the following semi-structured data frequent pattern linked list SFp-Link.
Referring to FIG. 2, the semi-structured data frequent-pattern linked list SFp-Link is composed of an item set linked list header qSethead and an item set linked list of m levels.
The item set chain table head qSetHead is a pointer array consisting of m pointers, and the ith pointer in the pointer array is an item set chain table qsetLink with the level iiWherein i is an integer and is not less than 1 and not more than m, m is the maximum value of the lengths of all sample item sets of the sample database to be mined, and the length of a sample item set is the number of items contained in the sample item set.
Item set linked list qSetLink of level iiFrom NiEach item set linked list node SpcNode, wherein i is an integer, i is more than or equal to 1 and less than or equal to m, and NiThe number of sample item sets with the length of i in the mined sample database is determined; for example, referring to FIG. 2, in the item set chain header qSethead, the item set chain table qSetLink to which the 1 st pointer pointsi=1Containing NiItem set chain pointed to by the 2 nd pointer as the o item set chain table nodes SpcNodeTable qSetLinki=2Containing NiThe item set linked list qSetLink pointed to by the ith pointer is p item set linked list nodes SpcNodeiContaining NiQ item set linked list nodes SpcNode, the item set linked list qSetLink to which the mth pointer pointsi=mContaining NiR item set linked list nodes SpcNode.
Item set linked list qSetLink of level iiThe jth item set linked list node SpcNode is composed of sample item set address qSetijSample frequency sCntijSupport frequency tCntijAnd linked list pointer linkijWherein i is an integer, i is not less than 1 and not more than m, j is an integer, j is not less than 1 and not more than Ni
Sample entry set address qSetijIs the same storage address of the stored sample item set as the scanned sample item set, wherein the scanned sample item set is: in the scanning process of the mined sample database, the jth scanned sample item set with the length of i, for example, the item set linked list node SpcNode circled by the oval dashed frame in fig. 2, and the sample item set address qSet thereofijThe length of the corresponding sample item set is i, and the scanned order is 2.
Sample frequency sCntijComprises the following steps: all sample item sets of the mined sample database are compared with the sample item set address qSetijThe number of sample item sets that are the same as the stored sample item set.
Support frequency tCntijComprises the following steps: in all sample item sets of the mined sample database, the sample item set addresses qSet are used as the addresses stored in the sample item setsijThe number of sample item sets of the proper subset of the stored sample item sets.
Link of linked list pointerijTo point to the item set linked list qSetLink of level iiThe j +1 th item set linked list node SpcNode, wherein the item set linked list qSetLink with the level iiN of (2)iThe item set linked list node SpcNode has no next node, so when j is NiTime, linked list pointer linkijIs null.
For example, when the first scan includes "dry mouth, polydipsia, polyuria, and body weightWhen the sample item set which is reduced by 4 years and has red and swollen left foot and ulcerated for 3 weeks' is used as a stored sample item set to be stored, and the sample frequency sCnt of the stored sample item set is used as the sample frequency sCnt of the stored sample item setijAdding 1, when a sample item set containing 'dry mouth, polydipsia, polyuria, weight loss for 4 years, red swelling of left foot and ulceration for 3 weeks' is scanned for the second time, the sample item set does not need to be stored again, and only the sample frequency sCnt of the stored sample item set is neededijAdding 1 to obtain the final product.
Thirdly, performing frequent pattern mining on the semi-structured data in the mined sample database based on the semi-structured data frequent pattern linked list SFp-Link, namely: setting support frequency threshold s according to the need of digging purposeminAnd scanning the item set linked list of m levels from the high level to the low level one by one to extract the support frequency tCntijAt the support frequency threshold sminThe extracted stored sample item sets are frequent item sets.
The present invention is not limited to the above embodiments, and various other equivalent modifications, substitutions and alterations can be made without departing from the basic technical concept of the invention as described above, according to the common technical knowledge and conventional means in the field.

Claims (1)

1. A SFp-Link-based semi-structured data frequent pattern mining method is characterized in that: the method for mining the frequent pattern of the semi-structured data comprises the following steps:
step one, carrying out data preprocessing on a mined sample database, namely:
extracting a sample item set of each piece of semi-structured data in the mined sample database, wherein the sample item set is a set of effective data related to a mining purpose in corresponding semi-structured data, and each effective data contained in the sample item set is an item of the sample item set;
step two, scanning all sample item sets of the mined sample database, storing the sample item sets scanned by the contained item combinations for the first time in the scanning process, recording the sample item sets as stored sample item sets, calculating the number of the same sample item sets of each stored sample item set in the mined sample database, and calculating the number of the proper subsets of each stored sample item set in the mined sample database to establish the following semi-structured data frequent pattern linked list SFp-Link:
the semi-structured data frequent pattern linked list SFp-Link is composed of an item set linked list header qSetHead and item set linked lists of m levels:
the item set chain table head qSetHead is a pointer array consisting of m pointers, and the ith pointer in the pointer array is an item set chain table qsetLink with the level iiWherein i is an integer and is not less than 1 and not more than m, m is the maximum value in the lengths of all sample item sets of the mined sample database, and the length of a sample item set is the number of items contained in the sample item set;
the item set linked list qSetLink of level iiFrom NiEach item set linked list node SpcNode, wherein i is an integer, i is more than or equal to 1 and less than or equal to m, and NiThe number of the sample item sets with the length of i in the mined sample database is obtained;
the item set linked list qSetLink of level iiJ the item set linked list node SpcNode is based on the sample item set address qSetijSample frequency sCntijSupport frequency tCntijAnd linked list pointer linkijWherein i is an integer, i is not less than 1 and not more than m, j is an integer, j is not less than 1 and not more than Ni
The sample entry set address qSetijIs a storage address of the stored sample entry set that is the same as a scanned sample entry set, wherein the scanned sample entry set is: in the scanning process of the mined sample database, the jth scanned sample item set with the length of i;
the sample frequency sCntijComprises the following steps: in all sample item sets of the mined sample database, storing the sample item set address qSetijThe number of sample item sets that are the same as the stored sample item set;
the support frequency tCntijComprises the following steps: in all sample item sets of the mined sample database, as stored at the sample item set address qSetijThe number of sample item sets of the proper subset of the stored sample item sets;
the linked list pointer linkijTo point to the item set linked list qSetLink of level iiThe j +1 th pointer of the item set linked list node SpcNode, wherein when j is NiThe link of the linked list pointerijNull;
thirdly, performing frequent pattern mining on the semi-structured data in the mined sample database based on the semi-structured data frequent pattern linked list SFp-Link, namely: setting a support frequency threshold s according to the mining purposeminAnd scanning the item set linked list of the m levels from the high level to the low level one by one to extract the support frequency tCntijAt the support frequency threshold sminThe extracted stored sample item sets are frequent item sets.
CN201710664740.9A 2017-08-07 2017-08-07 SFp-Link-based semi-structured data frequent pattern mining method Active CN107562800B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710664740.9A CN107562800B (en) 2017-08-07 2017-08-07 SFp-Link-based semi-structured data frequent pattern mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710664740.9A CN107562800B (en) 2017-08-07 2017-08-07 SFp-Link-based semi-structured data frequent pattern mining method

Publications (2)

Publication Number Publication Date
CN107562800A CN107562800A (en) 2018-01-09
CN107562800B true CN107562800B (en) 2020-06-05

Family

ID=60975022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710664740.9A Active CN107562800B (en) 2017-08-07 2017-08-07 SFp-Link-based semi-structured data frequent pattern mining method

Country Status (1)

Country Link
CN (1) CN107562800B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163218A (en) * 2011-03-28 2011-08-24 武汉大学 Graph-index-based graph database keyword vicinity searching method
CN103955542A (en) * 2014-05-20 2014-07-30 广西教育学院 Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160371368A1 (en) * 2015-06-17 2016-12-22 Qualcomm Incorporated Facilitating searches in a semi-structured database

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163218A (en) * 2011-03-28 2011-08-24 武汉大学 Graph-index-based graph database keyword vicinity searching method
CN103955542A (en) * 2014-05-20 2014-07-30 广西教育学院 Method of item-all-weighted positive or negative association model mining between text terms and mining system applied to method

Also Published As

Publication number Publication date
CN107562800A (en) 2018-01-09

Similar Documents

Publication Publication Date Title
Begerow et al. A phylogenetic hypothesis of Ustilaginomycotina based on multiple gene analyses and morphological data
CN110175168B (en) Time sequence data filling method and system based on generation of countermeasure network
CN106570128A (en) Mining algorithm based on association rule analysis
CN110046249A (en) Training method, classification method, system, equipment and the storage medium of capsule network
US20160210333A1 (en) Method and device for mining data regular expression
CN109918498B (en) Problem warehousing method and device
Sucahyo et al. CT-ITL: Efficient frequent item set mining using a compressed prefix tree with pattern growth
Parry et al. Multiple optimality criteria support Ornithoscelida
Liu et al. Gene ontology friendly biclustering of expression profiles
CN111640471A (en) Method and system for predicting activity of drug micromolecules based on two-way long-short memory model
CN105260387A (en) Massive transactional database-oriented association rule analysis method
EP3955256A1 (en) Non-redundant gene clustering method and system, and electronic device
Jamil et al. Performance evaluation of top-k sequential mining methods on synthetic and real datasets
CN107562800B (en) SFp-Link-based semi-structured data frequent pattern mining method
CN111368092B (en) Knowledge graph construction method based on trusted webpage resources
CN112463956A (en) Text summary generation system and method based on counterstudy and hierarchical neural network
CN110442674B (en) Label propagation clustering method, terminal equipment, storage medium and device
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data
CN111400617A (en) Social robot detection data set extension method and system based on active learning
CN111428821A (en) Asset classification method based on decision tree
US20220179890A1 (en) Information processing apparatus, non-transitory computer-readable storage medium, and information processing method
US20220171815A1 (en) System and method for generating filters for k-mismatch search
Amarasiri et al. HDGSOMr: a high dimensional growing self-organizing map using randomness for efficient web and text mining
CN107247813A (en) A kind of network struction and evolution method based on weighting technique
WO2015029158A1 (en) Data conversion device, data conversion method, and data conversion program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant