Summary of the invention
The embodiment of the present invention provides a kind of Frequent Pattern Mining method and device, can reduce Frequent Pattern Mining process
In space consuming, and the Frequent Pattern Mining of significant frequent mode length can be carried out for different application scene, from
And effectively reduce the time loss and resource consumption of Frequent Pattern Mining.
In a first aspect, providing a kind of Frequent Pattern Mining method, described method includes following steps:
Each participle that structural data is concentrated is converted into corresponding coding, forms each participle and corresponding coding
Mapping relations one by one;
Any N number of coding in each coding is combined, obtains several the first candidate combinations, and screen full
First candidate combinations of the first predetermined condition of foot, obtain several the second candidate combinations;Wherein, N is more than or equal to 2
Positive integer, first candidate combinations for meeting predetermined condition are that the length of the corresponding vocabulary of all codings therein exists
The first candidate combinations within the scope of predetermined length;
Screening meets second candidate combinations of the second predetermined condition, obtains several target frequent item set code sets
It closes;Wherein, second candidate combinations for meeting the second predetermined condition are its support within the scope of predetermined support
Second candidate combinations;
According to the participle and the mapping relations one by one of corresponding coding, each target frequent item set code set is obtained
Corresponding participle is closed, the corresponding frequent item set of each target frequent item set coded combination is obtained.
With reference to first aspect, in the first possible implementation, the method also includes following steps:
Concentrated according to the structural data, the data source file of each participle and corresponding coding, formed coding with
The mapping relations of the mark of data source file;Wherein, each data source file has a unique mark;
According to the mapping relations of the coding and the mark, determine each in each target frequent item set coded combination
A intersection for encoding the corresponding mark, obtains the collection of the corresponding mark of each target frequent item set coded combination
It closes;
According to the set of each mark, the corresponding source file of each target frequent item set coded combination is determined
Set.
With reference to first aspect or the first possible implementation of first aspect, in second of possible implementation
In, the method also includes forming the structured data sets, which includes following sub-step:
Segmenting word processing is carried out to input text, obtains the participle data set comprising several participles;
Using deactivated set of words, the stop words in the participle data set is removed;
The repetition participle in the participle data set is removed, one of participle is only retained;
It is separated the processing of participle to each participle in the participle data set, obtains the structure for meeting predetermined structure
Change data set.
The possible implementation of with reference to first aspect the first, in the third possible implementation,
The method further includes walking as follows before obtaining the corresponding participle of each target frequent item set coded combination
Suddenly:
FP-Tree is constructed using each target frequent item set coded combination.
With reference to first aspect, in the fourth possible implementation, the method determines described using following steps
The support of two candidate combinations:
Determine each coding in presently described second candidate combinations;
It is closed according to the participle with the mapping relations one by one of corresponding coding and the coding and the mapping of the mark
System screens the data source file that the corresponding participle of each coding in presently described second candidate combinations occurs jointly, and calculates
The quantity for screening obtained data source file, obtains co-occurrence quantity of documents;
It is closed according to the participle with the mapping relations one by one of corresponding coding and the coding and the mapping of the mark
System calculates the sum of each quantity for encoding corresponding data source file in each second candidate combinations, obtains each described
The corresponding source file quantity of second candidate combinations;
The sum for calculating all source file quantity, obtains source file total quantity;
The quotient for calculating the co-occurrence quantity of documents Yu the source file total quantity obtains presently described second candidate combinations
Support.
With reference to first aspect, in the 5th possible implementation, the method also includes the predetermined support is arranged
The step of range and the predetermined length range.
Second aspect, provides a kind of Frequent Pattern Mining device, and described device includes:
Transcoding module, each participle for concentrating structural data are converted to corresponding coding, are formed each
It segments and the mapping relations one by one of corresponding coding;
First screening module, for by each coding it is any it is N number of coding be combined, obtain several first
Candidate combinations, and first candidate combinations for meeting the first predetermined condition are screened, obtain several the second candidate combinations;Its
In, N is the positive integer more than or equal to 2, and first candidate combinations for meeting predetermined condition are all codings therein
First candidate combinations of the length of corresponding vocabulary within the scope of predetermined length;
Second screening module obtains several mesh for screening second candidate combinations for meeting the second predetermined condition
Mark frequent item set coded combination;Wherein, second candidate combinations for meeting the second predetermined condition are its support pre-
Determine the second candidate combinations within the scope of support;
Frequent item set determining module, for obtaining each according to the participle and the mapping relations one by one of corresponding coding
It is corresponding frequently to obtain each target frequent item set coded combination for the corresponding participle of the target frequent item set coded combination
Item collection.
In conjunction with second aspect, in the first possible implementation, described device further includes:
Data source tracing module, for being concentrated according to the structural data, the data source file of each participle and right
The coding answered forms the mapping relations of coding with the mark of data source file;Wherein, each data source file has one
Unique mark;
Source file obtains module, for the mapping relations according to the coding and the mark, determines each target
Each intersection for encoding the corresponding mark, obtains each target frequent item set code set in frequent item set coded combination
Close the set of the corresponding mark;
Source file determining module determines that each target frequent item set is compiled for the set according to each mark
Code character closes the set of corresponding source file.
In conjunction with the possible implementation of the first of second aspect or second aspect, in second of possible implementation
In, described device further includes data processing module, and the data processing module obtains institute for pre-processing to input file
State structured data sets;
The data processing module includes:
Segmenting word submodule obtains the participle number comprising several participles for carrying out segmenting word processing to input text
According to collection;
Stop words handles submodule, for using set of words is deactivated, removing the stop words in the participle data set;
Duplicate removal submodule only retains one of participle for removing the repetition participle in the participle data set;
Separate participle submodule, for being separated the processing of participle to each participle in the participle data set, obtains
To the structured data sets for meeting predetermined structure.
In conjunction with second aspect, in the third possible implementation, described device further includes support determining module, is used
In the support for determining each second candidate combinations;
The support determining module includes:
Determining module is encoded, for determining each coding in presently described second candidate combinations;
Co-occurrence quantity of documents determining module, for according to the participle and the mapping relations one by one of corresponding coding and institute
The mapping relations of coding with the mark are stated, the corresponding participle of each coding screened in presently described second candidate combinations is common
The data source file of appearance, and the quantity of data source file that calculating sifting obtains, obtain co-occurrence quantity of documents;
Source file quantity determining module, for according to the participle and mapping relations one by one of corresponding coding and described
Coding and the mapping relations of the mark calculate the corresponding data source file of each coding in each second candidate combinations
The sum of quantity obtains the corresponding source file quantity of each second candidate combinations;
Source file total quantity determining module obtains source file sum for calculating the sum of all source file quantity
Amount;
Support determining module is worked as calculating the quotient of the co-occurrence quantity of documents Yu the source file total quantity
The support of preceding second candidate combinations.
In the above-mentioned technical proposal of the embodiment of the present invention, each participle is converted for corresponding coding, later first
It is screened using coding, obtains target frequent item set coded combination, wherein target frequent item set coded combination includes frequent episode
The corresponding coding of each vocabulary is concentrated, the target frequent item set coded combination construction FP-Tree or progress frequency of coding composition are utilized
Numerous item set mining, rather than directly construct FP-Tree using vocabulary or carry out frequent item set mining, frequent mould can be effectively reduced
Space consuming in formula mining process.Meanwhile it being screened in the above-mentioned technical proposal of the embodiment of the present invention using predetermined length range
The corresponding coding of frequent item set realizes the frequent mode that significant frequent mode length can be carried out for different application scene
It excavates, to effectively reduce the time loss and resource consumption of Frequent Pattern Mining, enhances technical solution of the present invention
It is engineered application power.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiment is a part of the embodiments of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
A kind of Frequent Pattern Mining method, as shown in Figure 1, this method comprises the following steps:
110, each participle that structural data is concentrated is converted into corresponding coding, forms each participle and corresponding volume
The mapping relations one by one of code;
In this step, the connection relationship of the adjacent participle of any two that structural data is concentrated meets scheduled format,
Such as the adjacent participle of any two is connected with space, in another example, the data format which concentrates is can
The data format identified by corresponding algorithm can such as identify by FPGrowth algorithm, i.e. the data concentrated of the structural data
Format meets requirement of the FPGrowth algorithm to data format;Here the data that structural data is concentrated can regard a series of as
The data of serializing;
In this step, each participle is converted into a corresponding coding, participle and coding are one-to-one relationships;
Here it can indicate to segment with any form of coding, can be each participle with the method that can arbitrarily generate displacement coding
A unique coding is generated, the present embodiment is to specific coding form and generates the code conversion technique of coding without limit
It is fixed;Simply example is lifted, the encoding setting that first in above structure data set can be segmented is 0, second point
The encoding setting of word is 1, and so on, the encoding setting of n-th participle is N, and N is greater than 2, to realize participle to coding
Conversion;
In this step, the mapping relations of participle and coding can be automatically generated into file, and save into disk;
120, any N number of coding in each coding is combined, obtains several the first candidate combinations, and screen full
First candidate combinations of the first predetermined condition of foot, obtain several the second candidate combinations;Wherein, N is just whole more than or equal to 2
Number, the first candidate combinations for meeting predetermined condition are the length of the corresponding participle of all codings therein in predetermined length range
Within the first candidate combinations, i.e. the second candidate combinations are the length of the corresponding participle of all codings therein in predetermined length
Within the scope of the first candidate combinations;
In this step, predetermined length range be it is preset, according to practical application scene to Mining Frequent Patterns requirement
Difference can be set to different numerical value, such as can be set to 2-4,3-5,6-8 etc.;In addition, before executing this step
The step of may include setting predetermined length range;
In this step, the first candidate combinations correspond to possible frequent item set, and frequent item set is the collection of the participle occurred simultaneously
It closes, therefore two participles must be at least while being occurred, therefore the first candidate combinations must include two or more volumes
Code;
In this step, using the corresponding coding of predetermined length range screening frequent item set, different application scene can be directed to
Carry out the Frequent Pattern Mining of significant frequent mode length, so as to be effectively reduced Frequent Pattern Mining time loss and
Resource consumption enhances the engineering application power of technical solution of the present invention;
130, screening meets the second candidate combinations of the second predetermined condition, obtains several target frequent item set code sets
It closes;Wherein, the second candidate combinations for meeting the second predetermined condition are second candidate of its support within the scope of predetermined support
Combination, i.e. target frequent item set coded combination are second candidate combinations of its support within the scope of predetermined support;
In this step, predetermined support range be it is preset, Mining Frequent Patterns are wanted according to practical application scene
The difference asked can be set to different numerical value, such as can be set to 0.3-0.5,0.85-0.95,0.8-0.9, is greater than 0.75
Deng;Here maximum support is 1, and the smallest support is 0.0;In addition, may include that setting is pre- before executing this step
The step of determining support range;
In this step, the frequency that support indicates that several participles are appeared in jointly in the same data source file is (i.e. total
Existing frequency), this step meets the target frequent item set coded combination of support requirement using co-occurrence frequency as conditional filtering, then mesh
The corresponding frequency occurred in the same data source file that segments of mark frequent item set coded combination can reach a pre-provisioning request,
That is the corresponding participle of target frequent item set coded combination can form the frequent item set for meeting pre-provisioning request;
Here, it after screening obtains target frequent item set coded combination, can directly be compiled using each target frequent item set
Code character closes building FP-Tree or carries out frequent item set mining, such as the specifically FPGrowth algorithm of reference open source algorithms library, benefit
FP-Tree is constructed with each and every one each target frequent item set coded combination, recurrence Mining Frequent Patterns are realized, to avoid directly utilizing
Vocabulary constructs FP-Tree, can be effectively reduced the space consuming during Frequent Pattern Mining;
140, according to participle and the mapping relations one by one of corresponding coding, each target frequent item set coded combination pair is obtained
The participle answered obtains the corresponding frequent item set of each target frequent item set coded combination;
It is corresponding participle by code conversion using code conversion technique corresponding with step 110 in this step, from
And frequent item set is obtained;With sample step to the specific code conversion technique of progress code conversion without limiting.
Each participle is converted first for corresponding coding, is screened later using coding, obtain mesh by the present embodiment
Frequent item set coded combination is marked, wherein target frequent item set coded combination includes the corresponding coding of each vocabulary in frequent item set,
Later using the target frequent item set coded combination construction FP-Tree of coding composition or progress frequent item set mining, rather than directly
It connects and constructs FP-Tree using vocabulary or carry out frequent item set mining, the space during Frequent Pattern Mining can be effectively reduced
Consumption.It, can meanwhile using the corresponding coding of predetermined length range screening frequent item set in the above-mentioned technical proposal of the present embodiment
The Frequent Pattern Mining of significant frequent mode length is carried out for different application scene, to effectively reduce frequent mode digging
The time loss and resource consumption of pick enhance the engineering application power of technical solution of the present invention.
In one embodiment, as shown in Fig. 2, Frequent Pattern Mining method further includes following steps:
210, concentrated according to structural data, the data source file of each participle and corresponding coding, formed coding with
The mapping relations of the mark of data source file;Wherein, each data source file has a unique mark;
In this step, the mapping relations of participle and coding can be automatically generated into file, and save into disk;
In this step, coding and the mapping relations of mark can be one-to-many relationship, i.e. corresponding point of one and same coding
Word possibly is present in multiple data source files;
220, according to the mapping relations of coding and mark, each coding pair in each target frequent item set coded combination is determined
The intersection for the mark answered obtains the set of the corresponding mark of each target frequent item set coded combination;
In this step, the set of the corresponding above-mentioned mark of each target frequent item set coded combination, in the set
There are all codings pair in corresponding target frequent item set coded combination simultaneously in each identify in corresponding data source file
The participle answered;
230, it according to the set of each above-mentioned mark and the mapping relations of data source file and mark, determines each above-mentioned
The set of the corresponding data source file of set of mark, i.e., the corresponding data source file of each target frequent item set coded combination
Set.
The present embodiment introduces label tracer technique in transcoding procedure, that is, caches the corresponding data source of all codings
The mark of file realizes the function of frequent mode automatic tracing derived data.
In one embodiment, as shown in figure 3, Frequent Pattern Mining method further includes forming the step of structured data sets
Suddenly, which includes following sub-step:
310, segmenting word processing is carried out to input text, obtains the participle data set comprising several participles;
In this step, any segmenting word method can use to input file progress and word segmentation processing;Here participle
There are several words including can be used as a vocabulary;
320, using deactivated set of words, removal segments the stop words in data set;
Nonsensical stop words is removed using existing deactivated set of words in this step;
330, the repetition participle in removal participle data set, only retains one of participle;
In this step, it can use the method that any removal repeats participle and remove the repetition participle segmented in data set;
340, it is separated the processing of participle to each participle in participle data set, obtains the structure for meeting predetermined structure
Change data set;
In this step, the processing for separating participle includes two phases of the decollator being arranged between two adjacent participles or setting
Type of attachment between adjacent participle, such as set two adjacent participles and connected with space;
It can also include the steps that filtering out symbol in this step.
The present embodiment is additionally arranged flow chart of data processing before Mining Frequent Patterns, including automatic word segmentation, filtering stop words,
Automatic separation participle etc., makes one embodiment be able to receive directly the serialized data for meeting predetermined structure, enhances frequently
Mode method itself is engineered application power.
In one embodiment, Frequent Pattern Mining method determines the support of the second candidate combinations using following steps:
410, each coding in current second candidate combinations is determined;
It include two and more than two codings in the second candidate combinations in this step;
420, according to the mapping relations of participle and the mapping relations one by one of corresponding coding and coding and mark, screening is worked as
The data source file that the corresponding participle of each coding in preceding second candidate combinations occurs jointly, and the data that calculating sifting obtains
The quantity of source file, obtaining co-occurrence quantity of documents, (i.e. the corresponding participle of each coding in the second candidate combinations occurs jointly
Data source file quantity);
430, according to the mapping relations of coding and mark, the corresponding source document number of packages of each second candidate combinations is calculated
Amount;Wherein, source file quantity is the sum of each quantity for encoding corresponding data source file in corresponding second candidate combinations;
440, the sum for calculating all source file quantity, obtains source file total quantity;
In this step, source file total quantity is the sum of the corresponding source file quantity of the second all candidate combinations;
450, the quotient for calculating co-occurrence quantity of documents and source file total quantity, obtains the support of current second candidate combinations.
The present embodiment has determined the support of the second candidate combinations according to participle co-occurrence frequency.
The Frequent Pattern Mining method of above-described embodiment utilizes after it will receive and meet the serialized data of predetermined structure
The separation text of structuring is mapped on corresponding coding one by one, i.e., will meet the sequence of predetermined structure by code conversion technique
Change data and be converted to orderly unique coding, with code construction FP-Tree or carries out frequency in Frequent Pattern Mining process later
Numerous item set mining avoids space expense huge caused by a large amount of text maninulations.Meanwhile above-described embodiment combines actual project
Application experience has practical meaning by utilizing predetermined length range in Mining Frequent Patterns and subscribing the screening of support range
The frequent item set of justice and the processing for tracking data source file, a kind of method for realizing improved Mining Frequent Patterns make its branch
It holds from external custom and control mining mode length and support index, while capableing of the data source of track frequent item collection, mention
The high engineering adaptability of Frequent Pattern Mining method.
Frequent Pattern Mining method of the invention is described in detail below by another specific embodiment.
Frequent Pattern Mining method of the invention is combined the excavation for realizing frequent mode by the present embodiment with FPGrowth,
Specifically comprise the following steps:
Step 1: data prediction;
Specifically, the segmenting word to input text is realized using open source participle tool first;Deactivating using default later
Word lexical set filters the stop words in segmenting word result;The data format required later according to FPGrowth, on automatic separation
State participle and remove stop words as a result, simultaneously structured data sets (i.e. by separate participle after each participle be set as meeting
The structured data sets of predetermined structure).
Step 2: each participle is converted to corresponding coding, each participle and the mapping relations of corresponding coding are established,
And establish the mapping relations of coding with the mark of the data source file of each participle;Above-mentioned two mapping relations are given birth to automatically later
At file, save into disk.
Step 3: carrying out Frequent Pattern Mining using predetermined condition;
Specifically, the present embodiment improves FPGrowth algorithm source code, after so that corresponding algorithm is received above-mentioned pretreatment
Coding, while receiving the predetermined support range and predetermined length range of setting;Wherein, predetermined length range is by minimal mode
Length and max model length limit, and predetermined support range is limited by minimum support and max support;
Judge that present encoding combines the branch of corresponding frequent item set that is, when connecting frequent mode in screening frequent item set
Whether degree of holding and the wherein length of each participle meet the received relevant parameter condition of current algorithm (i.e. whether above-mentioned predetermined
Within the scope of length range and predetermined support).If met, corresponding each participle is connected, and is recorded current frequent
Mode (records current frequent item set), otherwise continues whether each vocabulary in next coded combination constitutes frequent item set
Judgement and the work such as connection.And so on, realize (the having business value) frequent item set for excavating and conforming to a predetermined condition.
Step 4: being and to carry out data source tracking to corresponding participle vocabulary by code conversion;
Specifically,
Automatically the mapping relations of each participle and coding and the mark of each coding and data source file are extracted from disk
The mapping relations of knowledge;
Each frequent item set is traversed, is carried out the following processing for each frequent item set:Extract the volume in current frequent item set
Number, according to the mapping relations of each coding and the mark of data source file, the set of the corresponding mark of current each number is extracted,
And each intersection of sets collection is taken, obtain the set of the data source file of current frequent item set;It is reflected according to each participle with what is encoded
Relationship is penetrated, the corresponding participle of current each number is extracted, obtains each participle that current frequent item set specifically includes, realize coding
To the conversion of participle;
Finally, the support angle value of frequent item set of the improved FPGrowth algorithm output comprising participle, frequent item set with
And the set of the corresponding data source file of each frequent item set.
The Frequent Pattern Mining method of the present embodiment significantly reduces space expense using code conversion technique;It improves
FPGrowth algorithm source code designs realization condition Frequent Pattern Mining strategy, excavates the high fuzzy frequent itemsets of business use value
It closes, effectively reduces the time loss and resource consumption of Frequent Pattern Mining, while utilizing data source tracing scheme, realize tracking
The data source file of frequent item set;Provided with data prediction step, the work of improved FPGrowth algorithm itself is improved
Industry application power.
Method is filled corresponding to above-mentioned Frequent Pattern Mining, the embodiment of the invention also provides a kind of Frequent Pattern Mining dresses
It sets, as shown in figure 4, the device includes:
Transcoding module, each participle for concentrating structural data are converted to corresponding coding, are formed each
It segments and the mapping relations one by one of corresponding coding;
First screening module obtains several the first candidates for any N number of coding in each coding to be combined
Combination, and the first candidate combinations for meeting the first predetermined condition are screened, obtain several the second candidate combinations;Wherein, N be greater than
Or the positive integer equal to 2, the first candidate combinations for meeting predetermined condition are that all length for encoding corresponding participle therein are equal
The first candidate combinations within the scope of predetermined length;
Second screening module obtains several targets frequency for screening the second candidate combinations for meeting the second predetermined condition
Numerous item collection coded combination;Wherein, the second candidate combinations for meeting the second predetermined condition are its support in predetermined support range
The second interior candidate combinations;
Frequent item set determining module, for obtaining each target according to participle and the mapping relations one by one of corresponding coding
The corresponding participle of frequent item set coded combination obtains the corresponding frequent item set of each target frequent item set coded combination.
In one embodiment, as shown in figure 5, Frequent Pattern Mining device further includes:
Data source tracing module, for being concentrated according to structural data, the data source file of each participle and corresponding
Coding forms the mapping relations of coding with the mark of data source file;Wherein, each data source file has a unique mark
Know;
Source file obtains module, for the mapping relations according to coding and mark, determines each target frequent item set coding
Each intersection for encoding corresponding mark, obtains the set of the corresponding mark of each target frequent item set coded combination in combination;
Source file determining module determines each target frequent item set coded combination pair for the set according to each mark
The set for the source file answered.
In one embodiment, as shown in figure 5, Frequent Pattern Mining device further includes data processing module, data processing
Module obtains structured data sets for pre-processing to input file;
Data processing module includes:
Segmenting word submodule obtains the participle number comprising several participles for carrying out segmenting word processing to input text
According to collection;
Stop words handles submodule, for removing the stop words in participle data set using set of words is deactivated;
Duplicate removal submodule only retains one of participle for removing the repetition participle in participle data set;
Separate participle submodule, for being separated the processing of participle to each participle in participle data set, is expired
The structured data sets of sufficient predetermined structure.
In one embodiment, Frequent Pattern Mining device further includes support determining module, for determining each second
The support of candidate combinations;
Support determining module includes:
Determining module is encoded, for determining each coding in current second candidate combinations;
Co-occurrence quantity of documents determining module, for according to participle with the mapping relations one by one of corresponding coding and coding and
The mapping relations of mark screen the data source document that the corresponding participle of each coding in current second candidate combinations occurs jointly
Part, and the quantity of data source file that calculating sifting obtains, obtain co-occurrence quantity of documents;
It is candidate to be calculated each second for the mapping relations according to coding and mark for source file quantity determining module
Combine corresponding source file quantity;Wherein, source file quantity is the corresponding number of each coding in corresponding second candidate combinations
According to the sum of the quantity of source file;
Source file total quantity determining module obtains source file total quantity for calculating the sum of all source file quantity;
Support determining module obtains current second and waits for calculating the quotient of co-occurrence quantity of documents Yu source file total quantity
Select combined support.
Device in the above embodiment of the present invention is product corresponding with the method in the above embodiment of the present invention, the present invention
Each step of method in above-described embodiment is completed by the component or module of the device in the above embodiment of the present invention, because
This no longer repeats identical part.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those skilled in the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all cover
Within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.