CN108846083A

CN108846083A - Frequent Pattern Mining method and device

Info

Publication number: CN108846083A
Application number: CN201810594153.1A
Authority: CN
Inventors: 李德彦; 晋耀红; 席丽娜
Original assignee: Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Beijing Shenzhou Taiyue Software Co Ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2018-11-20
Anticipated expiration: 2038-06-11
Also published as: CN108846083B

Abstract

The embodiment of the invention provides a kind of Frequent Pattern Mining method and devices, each participle is converted to corresponding coding first by the embodiment of the present invention, it is screened later using coding, obtain target frequent item set coded combination, wherein target frequent item set coded combination includes the corresponding coding of each vocabulary in frequent item set, utilize the target frequent item set coded combination construction FP-Tree or progress Frequent Pattern Mining of coding composition, rather than FP-Tree directly is constructed using vocabulary or carries out frequent item set mining, the space consuming during Frequent Pattern Mining can be effectively reduced.Simultaneously, the corresponding coding of predetermined length range screening frequent item set is utilized in the technical solution of the embodiment of the present invention, the Frequent Pattern Mining of significant frequent mode length can be carried out for different application scene, to effectively reduce the time loss and resource consumption of Frequent Pattern Mining, the engineering application power of technical solution of the present invention is enhanced.

Description

Frequent Pattern Mining method and device

Technical field

The present embodiments relate to technical field of data processing, and more particularly, to a kind of Frequent Pattern Mining side Method and device.

Background technique

Frequent mode is that frequent item set, subsequence or minor structure frequently appear in mode in data set, frequency here Numerous item collection, subsequence or minor structure refer to appearing in two or more vocabulary in data set simultaneously.Frequent mode is for digging The connection of association, correlation and many other aspects between pick data can play the role of it is vital, in addition, to frequent The research of mode facilitates data classification, data clusters and otherwise data mining, therefore the excavation of frequent mode is just The problem of being paid close attention at an important data mining task and data Research on Mining.

Currently, two kinds of algorithms are relied primarily on to the excavation of frequent mode：Apriori algorithm and FPGrowth algorithm.Wherein, Apriori algorithm is initially formed candidate during to Frequent Pattern Mining, then to candidate carry out matching and It counts, to judge that these candidates are frequent item set.The algorithm can generate a large amount of candidate, cause the big of space Amount consumption, and the huge Candidate Set of matched data amount can generate huge time and resource consumption, and cost is very big.

In order to avoid generating and matching the candidate of big data quantity, above-mentioned FPGrowth is proposed in the prior art and is calculated Method, which directly receives the data set for separating and completing, and utilizes dataset construction FP-Tree, utilizes recursive side later Method carries out Frequent Pattern Mining on FP-Tree, and this method can excavate frequent mode without generating above-mentioned candidate collection, visitor The defect of above-mentioned Apriori algorithm is taken.But FPGrowth algorithm is directly inputted using text participle, interval as it, space Expense is very big, carries out Frequent Pattern Mining furthermore FPGrowth algorithm is 1 by frequent mode length, in mining process, Frequent mode increasing lengths terminate until excavating, and in practical application scene, more attention is long to specific frequent mode The Frequent Pattern Mining of degree, therefore the excavation of aforesaid way has carried out a large amount of nonsensical work causes resource and time Waste, and the engineering application power for causing FPGrowth algorithm is weaker.

To sum up, space, time and resource consumption become current urgent need to resolve during how reducing Frequent Pattern Mining The technical issues of.

Summary of the invention

The embodiment of the present invention provides a kind of Frequent Pattern Mining method and device, can reduce Frequent Pattern Mining process In space consuming, and the Frequent Pattern Mining of significant frequent mode length can be carried out for different application scene, from And effectively reduce the time loss and resource consumption of Frequent Pattern Mining.

In a first aspect, providing a kind of Frequent Pattern Mining method, described method includes following steps：

Each participle that structural data is concentrated is converted into corresponding coding, forms each participle and corresponding coding Mapping relations one by one；

Any N number of coding in each coding is combined, obtains several the first candidate combinations, and screen full First candidate combinations of the first predetermined condition of foot, obtain several the second candidate combinations；Wherein, N is more than or equal to 2 Positive integer, first candidate combinations for meeting predetermined condition are that the length of the corresponding vocabulary of all codings therein exists The first candidate combinations within the scope of predetermined length；

Screening meets second candidate combinations of the second predetermined condition, obtains several target frequent item set code sets It closes；Wherein, second candidate combinations for meeting the second predetermined condition are its support within the scope of predetermined support Second candidate combinations；

According to the participle and the mapping relations one by one of corresponding coding, each target frequent item set code set is obtained Corresponding participle is closed, the corresponding frequent item set of each target frequent item set coded combination is obtained.

With reference to first aspect, in the first possible implementation, the method also includes following steps：

Concentrated according to the structural data, the data source file of each participle and corresponding coding, formed coding with The mapping relations of the mark of data source file；Wherein, each data source file has a unique mark；

According to the mapping relations of the coding and the mark, determine each in each target frequent item set coded combination A intersection for encoding the corresponding mark, obtains the collection of the corresponding mark of each target frequent item set coded combination It closes；

According to the set of each mark, the corresponding source file of each target frequent item set coded combination is determined Set.

With reference to first aspect or the first possible implementation of first aspect, in second of possible implementation In, the method also includes forming the structured data sets, which includes following sub-step：

Segmenting word processing is carried out to input text, obtains the participle data set comprising several participles；

Using deactivated set of words, the stop words in the participle data set is removed；

The repetition participle in the participle data set is removed, one of participle is only retained；

It is separated the processing of participle to each participle in the participle data set, obtains the structure for meeting predetermined structure Change data set.

The possible implementation of with reference to first aspect the first, in the third possible implementation,

The method further includes walking as follows before obtaining the corresponding participle of each target frequent item set coded combination Suddenly：

FP-Tree is constructed using each target frequent item set coded combination.

With reference to first aspect, in the fourth possible implementation, the method determines described using following steps The support of two candidate combinations：

Determine each coding in presently described second candidate combinations；

It is closed according to the participle with the mapping relations one by one of corresponding coding and the coding and the mapping of the mark System screens the data source file that the corresponding participle of each coding in presently described second candidate combinations occurs jointly, and calculates The quantity for screening obtained data source file, obtains co-occurrence quantity of documents；

It is closed according to the participle with the mapping relations one by one of corresponding coding and the coding and the mapping of the mark System calculates the sum of each quantity for encoding corresponding data source file in each second candidate combinations, obtains each described The corresponding source file quantity of second candidate combinations；

The sum for calculating all source file quantity, obtains source file total quantity；

The quotient for calculating the co-occurrence quantity of documents Yu the source file total quantity obtains presently described second candidate combinations Support.

With reference to first aspect, in the 5th possible implementation, the method also includes the predetermined support is arranged The step of range and the predetermined length range.

Second aspect, provides a kind of Frequent Pattern Mining device, and described device includes：

Transcoding module, each participle for concentrating structural data are converted to corresponding coding, are formed each It segments and the mapping relations one by one of corresponding coding；

First screening module, for by each coding it is any it is N number of coding be combined, obtain several first Candidate combinations, and first candidate combinations for meeting the first predetermined condition are screened, obtain several the second candidate combinations；Its In, N is the positive integer more than or equal to 2, and first candidate combinations for meeting predetermined condition are all codings therein First candidate combinations of the length of corresponding vocabulary within the scope of predetermined length；

Second screening module obtains several mesh for screening second candidate combinations for meeting the second predetermined condition Mark frequent item set coded combination；Wherein, second candidate combinations for meeting the second predetermined condition are its support pre- Determine the second candidate combinations within the scope of support；

Frequent item set determining module, for obtaining each according to the participle and the mapping relations one by one of corresponding coding It is corresponding frequently to obtain each target frequent item set coded combination for the corresponding participle of the target frequent item set coded combination Item collection.

In conjunction with second aspect, in the first possible implementation, described device further includes：

Data source tracing module, for being concentrated according to the structural data, the data source file of each participle and right The coding answered forms the mapping relations of coding with the mark of data source file；Wherein, each data source file has one Unique mark；

Source file obtains module, for the mapping relations according to the coding and the mark, determines each target Each intersection for encoding the corresponding mark, obtains each target frequent item set code set in frequent item set coded combination Close the set of the corresponding mark；

Source file determining module determines that each target frequent item set is compiled for the set according to each mark Code character closes the set of corresponding source file.

In conjunction with the possible implementation of the first of second aspect or second aspect, in second of possible implementation In, described device further includes data processing module, and the data processing module obtains institute for pre-processing to input file State structured data sets；

The data processing module includes：

Segmenting word submodule obtains the participle number comprising several participles for carrying out segmenting word processing to input text According to collection；

Stop words handles submodule, for using set of words is deactivated, removing the stop words in the participle data set；

Duplicate removal submodule only retains one of participle for removing the repetition participle in the participle data set；

Separate participle submodule, for being separated the processing of participle to each participle in the participle data set, obtains To the structured data sets for meeting predetermined structure.

In conjunction with second aspect, in the third possible implementation, described device further includes support determining module, is used In the support for determining each second candidate combinations；

The support determining module includes：

Determining module is encoded, for determining each coding in presently described second candidate combinations；

Co-occurrence quantity of documents determining module, for according to the participle and the mapping relations one by one of corresponding coding and institute The mapping relations of coding with the mark are stated, the corresponding participle of each coding screened in presently described second candidate combinations is common The data source file of appearance, and the quantity of data source file that calculating sifting obtains, obtain co-occurrence quantity of documents；

Source file quantity determining module, for according to the participle and mapping relations one by one of corresponding coding and described Coding and the mapping relations of the mark calculate the corresponding data source file of each coding in each second candidate combinations The sum of quantity obtains the corresponding source file quantity of each second candidate combinations；

Source file total quantity determining module obtains source file sum for calculating the sum of all source file quantity Amount；

Support determining module is worked as calculating the quotient of the co-occurrence quantity of documents Yu the source file total quantity The support of preceding second candidate combinations.

In the above-mentioned technical proposal of the embodiment of the present invention, each participle is converted for corresponding coding, later first It is screened using coding, obtains target frequent item set coded combination, wherein target frequent item set coded combination includes frequent episode The corresponding coding of each vocabulary is concentrated, the target frequent item set coded combination construction FP-Tree or progress frequency of coding composition are utilized Numerous item set mining, rather than directly construct FP-Tree using vocabulary or carry out frequent item set mining, frequent mould can be effectively reduced Space consuming in formula mining process.Meanwhile it being screened in the above-mentioned technical proposal of the embodiment of the present invention using predetermined length range The corresponding coding of frequent item set realizes the frequent mode that significant frequent mode length can be carried out for different application scene It excavates, to effectively reduce the time loss and resource consumption of Frequent Pattern Mining, enhances technical solution of the present invention It is engineered application power.

Detailed description of the invention

It, below will be in embodiment or description of the prior art for the clearer technical solution for illustrating the embodiment of the present invention Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention Example is applied, it for those of ordinary skill in the art, without creative efforts, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 schematically illustrates the flow chart of Frequent Pattern Mining method according to an embodiment of the invention.

Fig. 2 schematically illustrates the flow chart of Frequent Pattern Mining method according to yet another embodiment of the invention.

Fig. 3 schematically illustrates the flow chart of the Frequent Pattern Mining method of another embodiment according to the present invention

Fig. 4 schematically illustrates the block diagram of Frequent Pattern Mining device according to an embodiment of the invention.

Fig. 5 schematically illustrates the block diagram of Frequent Pattern Mining device according to yet another embodiment of the invention.

Fig. 6 schematically illustrates the block diagram of the Frequent Pattern Mining device of another embodiment according to the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiment is a part of the embodiments of the present invention, instead of all the embodiments.Based on this Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall within the protection scope of the present invention.

A kind of Frequent Pattern Mining method, as shown in Figure 1, this method comprises the following steps：

110, each participle that structural data is concentrated is converted into corresponding coding, forms each participle and corresponding volume The mapping relations one by one of code；

In this step, the connection relationship of the adjacent participle of any two that structural data is concentrated meets scheduled format, Such as the adjacent participle of any two is connected with space, in another example, the data format which concentrates is can The data format identified by corresponding algorithm can such as identify by FPGrowth algorithm, i.e. the data concentrated of the structural data Format meets requirement of the FPGrowth algorithm to data format；Here the data that structural data is concentrated can regard a series of as The data of serializing；

In this step, each participle is converted into a corresponding coding, participle and coding are one-to-one relationships； Here it can indicate to segment with any form of coding, can be each participle with the method that can arbitrarily generate displacement coding A unique coding is generated, the present embodiment is to specific coding form and generates the code conversion technique of coding without limit It is fixed；Simply example is lifted, the encoding setting that first in above structure data set can be segmented is 0, second point The encoding setting of word is 1, and so on, the encoding setting of n-th participle is N, and N is greater than 2, to realize participle to coding Conversion；

In this step, the mapping relations of participle and coding can be automatically generated into file, and save into disk；

120, any N number of coding in each coding is combined, obtains several the first candidate combinations, and screen full First candidate combinations of the first predetermined condition of foot, obtain several the second candidate combinations；Wherein, N is just whole more than or equal to 2 Number, the first candidate combinations for meeting predetermined condition are the length of the corresponding participle of all codings therein in predetermined length range Within the first candidate combinations, i.e. the second candidate combinations are the length of the corresponding participle of all codings therein in predetermined length Within the scope of the first candidate combinations；

In this step, predetermined length range be it is preset, according to practical application scene to Mining Frequent Patterns requirement Difference can be set to different numerical value, such as can be set to 2-4,3-5,6-8 etc.；In addition, before executing this step The step of may include setting predetermined length range；

In this step, the first candidate combinations correspond to possible frequent item set, and frequent item set is the collection of the participle occurred simultaneously It closes, therefore two participles must be at least while being occurred, therefore the first candidate combinations must include two or more volumes Code；

In this step, using the corresponding coding of predetermined length range screening frequent item set, different application scene can be directed to Carry out the Frequent Pattern Mining of significant frequent mode length, so as to be effectively reduced Frequent Pattern Mining time loss and Resource consumption enhances the engineering application power of technical solution of the present invention；

130, screening meets the second candidate combinations of the second predetermined condition, obtains several target frequent item set code sets It closes；Wherein, the second candidate combinations for meeting the second predetermined condition are second candidate of its support within the scope of predetermined support Combination, i.e. target frequent item set coded combination are second candidate combinations of its support within the scope of predetermined support；

In this step, predetermined support range be it is preset, Mining Frequent Patterns are wanted according to practical application scene The difference asked can be set to different numerical value, such as can be set to 0.3-0.5,0.85-0.95,0.8-0.9, is greater than 0.75 Deng；Here maximum support is 1, and the smallest support is 0.0；In addition, may include that setting is pre- before executing this step The step of determining support range；

In this step, the frequency that support indicates that several participles are appeared in jointly in the same data source file is (i.e. total Existing frequency), this step meets the target frequent item set coded combination of support requirement using co-occurrence frequency as conditional filtering, then mesh The corresponding frequency occurred in the same data source file that segments of mark frequent item set coded combination can reach a pre-provisioning request, That is the corresponding participle of target frequent item set coded combination can form the frequent item set for meeting pre-provisioning request；

Here, it after screening obtains target frequent item set coded combination, can directly be compiled using each target frequent item set Code character closes building FP-Tree or carries out frequent item set mining, such as the specifically FPGrowth algorithm of reference open source algorithms library, benefit FP-Tree is constructed with each and every one each target frequent item set coded combination, recurrence Mining Frequent Patterns are realized, to avoid directly utilizing Vocabulary constructs FP-Tree, can be effectively reduced the space consuming during Frequent Pattern Mining；

140, according to participle and the mapping relations one by one of corresponding coding, each target frequent item set coded combination pair is obtained The participle answered obtains the corresponding frequent item set of each target frequent item set coded combination；

It is corresponding participle by code conversion using code conversion technique corresponding with step 110 in this step, from And frequent item set is obtained；With sample step to the specific code conversion technique of progress code conversion without limiting.

Each participle is converted first for corresponding coding, is screened later using coding, obtain mesh by the present embodiment Frequent item set coded combination is marked, wherein target frequent item set coded combination includes the corresponding coding of each vocabulary in frequent item set, Later using the target frequent item set coded combination construction FP-Tree of coding composition or progress frequent item set mining, rather than directly It connects and constructs FP-Tree using vocabulary or carry out frequent item set mining, the space during Frequent Pattern Mining can be effectively reduced Consumption.It, can meanwhile using the corresponding coding of predetermined length range screening frequent item set in the above-mentioned technical proposal of the present embodiment The Frequent Pattern Mining of significant frequent mode length is carried out for different application scene, to effectively reduce frequent mode digging The time loss and resource consumption of pick enhance the engineering application power of technical solution of the present invention.

In one embodiment, as shown in Fig. 2, Frequent Pattern Mining method further includes following steps：

210, concentrated according to structural data, the data source file of each participle and corresponding coding, formed coding with The mapping relations of the mark of data source file；Wherein, each data source file has a unique mark；

In this step, coding and the mapping relations of mark can be one-to-many relationship, i.e. corresponding point of one and same coding Word possibly is present in multiple data source files；

220, according to the mapping relations of coding and mark, each coding pair in each target frequent item set coded combination is determined The intersection for the mark answered obtains the set of the corresponding mark of each target frequent item set coded combination；

In this step, the set of the corresponding above-mentioned mark of each target frequent item set coded combination, in the set There are all codings pair in corresponding target frequent item set coded combination simultaneously in each identify in corresponding data source file The participle answered；

230, it according to the set of each above-mentioned mark and the mapping relations of data source file and mark, determines each above-mentioned The set of the corresponding data source file of set of mark, i.e., the corresponding data source file of each target frequent item set coded combination Set.

The present embodiment introduces label tracer technique in transcoding procedure, that is, caches the corresponding data source of all codings The mark of file realizes the function of frequent mode automatic tracing derived data.

In one embodiment, as shown in figure 3, Frequent Pattern Mining method further includes forming the step of structured data sets Suddenly, which includes following sub-step：

310, segmenting word processing is carried out to input text, obtains the participle data set comprising several participles；

In this step, any segmenting word method can use to input file progress and word segmentation processing；Here participle There are several words including can be used as a vocabulary；

320, using deactivated set of words, removal segments the stop words in data set；

Nonsensical stop words is removed using existing deactivated set of words in this step；

330, the repetition participle in removal participle data set, only retains one of participle；

In this step, it can use the method that any removal repeats participle and remove the repetition participle segmented in data set；

340, it is separated the processing of participle to each participle in participle data set, obtains the structure for meeting predetermined structure Change data set；

In this step, the processing for separating participle includes two phases of the decollator being arranged between two adjacent participles or setting Type of attachment between adjacent participle, such as set two adjacent participles and connected with space；

It can also include the steps that filtering out symbol in this step.

The present embodiment is additionally arranged flow chart of data processing before Mining Frequent Patterns, including automatic word segmentation, filtering stop words, Automatic separation participle etc., makes one embodiment be able to receive directly the serialized data for meeting predetermined structure, enhances frequently Mode method itself is engineered application power.

In one embodiment, Frequent Pattern Mining method determines the support of the second candidate combinations using following steps：

410, each coding in current second candidate combinations is determined；

It include two and more than two codings in the second candidate combinations in this step；

420, according to the mapping relations of participle and the mapping relations one by one of corresponding coding and coding and mark, screening is worked as The data source file that the corresponding participle of each coding in preceding second candidate combinations occurs jointly, and the data that calculating sifting obtains The quantity of source file, obtaining co-occurrence quantity of documents, (i.e. the corresponding participle of each coding in the second candidate combinations occurs jointly Data source file quantity)；

430, according to the mapping relations of coding and mark, the corresponding source document number of packages of each second candidate combinations is calculated Amount；Wherein, source file quantity is the sum of each quantity for encoding corresponding data source file in corresponding second candidate combinations；

440, the sum for calculating all source file quantity, obtains source file total quantity；

In this step, source file total quantity is the sum of the corresponding source file quantity of the second all candidate combinations；

450, the quotient for calculating co-occurrence quantity of documents and source file total quantity, obtains the support of current second candidate combinations.

The present embodiment has determined the support of the second candidate combinations according to participle co-occurrence frequency.

The Frequent Pattern Mining method of above-described embodiment utilizes after it will receive and meet the serialized data of predetermined structure The separation text of structuring is mapped on corresponding coding one by one, i.e., will meet the sequence of predetermined structure by code conversion technique Change data and be converted to orderly unique coding, with code construction FP-Tree or carries out frequency in Frequent Pattern Mining process later Numerous item set mining avoids space expense huge caused by a large amount of text maninulations.Meanwhile above-described embodiment combines actual project Application experience has practical meaning by utilizing predetermined length range in Mining Frequent Patterns and subscribing the screening of support range The frequent item set of justice and the processing for tracking data source file, a kind of method for realizing improved Mining Frequent Patterns make its branch It holds from external custom and control mining mode length and support index, while capableing of the data source of track frequent item collection, mention The high engineering adaptability of Frequent Pattern Mining method.

Frequent Pattern Mining method of the invention is described in detail below by another specific embodiment.

Frequent Pattern Mining method of the invention is combined the excavation for realizing frequent mode by the present embodiment with FPGrowth, Specifically comprise the following steps：

Step 1: data prediction；

Specifically, the segmenting word to input text is realized using open source participle tool first；Deactivating using default later Word lexical set filters the stop words in segmenting word result；The data format required later according to FPGrowth, on automatic separation State participle and remove stop words as a result, simultaneously structured data sets (i.e. by separate participle after each participle be set as meeting The structured data sets of predetermined structure).

Step 2: each participle is converted to corresponding coding, each participle and the mapping relations of corresponding coding are established, And establish the mapping relations of coding with the mark of the data source file of each participle；Above-mentioned two mapping relations are given birth to automatically later At file, save into disk.

Step 3: carrying out Frequent Pattern Mining using predetermined condition；

Specifically, the present embodiment improves FPGrowth algorithm source code, after so that corresponding algorithm is received above-mentioned pretreatment Coding, while receiving the predetermined support range and predetermined length range of setting；Wherein, predetermined length range is by minimal mode Length and max model length limit, and predetermined support range is limited by minimum support and max support；

Judge that present encoding combines the branch of corresponding frequent item set that is, when connecting frequent mode in screening frequent item set Whether degree of holding and the wherein length of each participle meet the received relevant parameter condition of current algorithm (i.e. whether above-mentioned predetermined Within the scope of length range and predetermined support).If met, corresponding each participle is connected, and is recorded current frequent Mode (records current frequent item set), otherwise continues whether each vocabulary in next coded combination constitutes frequent item set Judgement and the work such as connection.And so on, realize (the having business value) frequent item set for excavating and conforming to a predetermined condition.

Step 4: being and to carry out data source tracking to corresponding participle vocabulary by code conversion；

Specifically,

Automatically the mapping relations of each participle and coding and the mark of each coding and data source file are extracted from disk The mapping relations of knowledge；

Each frequent item set is traversed, is carried out the following processing for each frequent item set：Extract the volume in current frequent item set Number, according to the mapping relations of each coding and the mark of data source file, the set of the corresponding mark of current each number is extracted, And each intersection of sets collection is taken, obtain the set of the data source file of current frequent item set；It is reflected according to each participle with what is encoded Relationship is penetrated, the corresponding participle of current each number is extracted, obtains each participle that current frequent item set specifically includes, realize coding To the conversion of participle；

Finally, the support angle value of frequent item set of the improved FPGrowth algorithm output comprising participle, frequent item set with And the set of the corresponding data source file of each frequent item set.

The Frequent Pattern Mining method of the present embodiment significantly reduces space expense using code conversion technique；It improves FPGrowth algorithm source code designs realization condition Frequent Pattern Mining strategy, excavates the high fuzzy frequent itemsets of business use value It closes, effectively reduces the time loss and resource consumption of Frequent Pattern Mining, while utilizing data source tracing scheme, realize tracking The data source file of frequent item set；Provided with data prediction step, the work of improved FPGrowth algorithm itself is improved Industry application power.

Method is filled corresponding to above-mentioned Frequent Pattern Mining, the embodiment of the invention also provides a kind of Frequent Pattern Mining dresses It sets, as shown in figure 4, the device includes：

First screening module obtains several the first candidates for any N number of coding in each coding to be combined Combination, and the first candidate combinations for meeting the first predetermined condition are screened, obtain several the second candidate combinations；Wherein, N be greater than Or the positive integer equal to 2, the first candidate combinations for meeting predetermined condition are that all length for encoding corresponding participle therein are equal The first candidate combinations within the scope of predetermined length；

Second screening module obtains several targets frequency for screening the second candidate combinations for meeting the second predetermined condition Numerous item collection coded combination；Wherein, the second candidate combinations for meeting the second predetermined condition are its support in predetermined support range The second interior candidate combinations；

Frequent item set determining module, for obtaining each target according to participle and the mapping relations one by one of corresponding coding The corresponding participle of frequent item set coded combination obtains the corresponding frequent item set of each target frequent item set coded combination.

In one embodiment, as shown in figure 5, Frequent Pattern Mining device further includes：

Data source tracing module, for being concentrated according to structural data, the data source file of each participle and corresponding Coding forms the mapping relations of coding with the mark of data source file；Wherein, each data source file has a unique mark Know；

Source file obtains module, for the mapping relations according to coding and mark, determines each target frequent item set coding Each intersection for encoding corresponding mark, obtains the set of the corresponding mark of each target frequent item set coded combination in combination；

Source file determining module determines each target frequent item set coded combination pair for the set according to each mark The set for the source file answered.

In one embodiment, as shown in figure 5, Frequent Pattern Mining device further includes data processing module, data processing Module obtains structured data sets for pre-processing to input file；

Data processing module includes：

Stop words handles submodule, for removing the stop words in participle data set using set of words is deactivated；

Duplicate removal submodule only retains one of participle for removing the repetition participle in participle data set；

Separate participle submodule, for being separated the processing of participle to each participle in participle data set, is expired The structured data sets of sufficient predetermined structure.

In one embodiment, Frequent Pattern Mining device further includes support determining module, for determining each second The support of candidate combinations；

Support determining module includes：

Determining module is encoded, for determining each coding in current second candidate combinations；

Co-occurrence quantity of documents determining module, for according to participle with the mapping relations one by one of corresponding coding and coding and The mapping relations of mark screen the data source document that the corresponding participle of each coding in current second candidate combinations occurs jointly Part, and the quantity of data source file that calculating sifting obtains, obtain co-occurrence quantity of documents；

It is candidate to be calculated each second for the mapping relations according to coding and mark for source file quantity determining module Combine corresponding source file quantity；Wherein, source file quantity is the corresponding number of each coding in corresponding second candidate combinations According to the sum of the quantity of source file；

Source file total quantity determining module obtains source file total quantity for calculating the sum of all source file quantity；

Support determining module obtains current second and waits for calculating the quotient of co-occurrence quantity of documents Yu source file total quantity Select combined support.

Device in the above embodiment of the present invention is product corresponding with the method in the above embodiment of the present invention, the present invention Each step of method in above-described embodiment is completed by the component or module of the device in the above embodiment of the present invention, because This no longer repeats identical part.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those skilled in the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all cover Within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a kind of Frequent Pattern Mining method, which is characterized in that described method includes following steps：

Each participle that structural data is concentrated is converted into corresponding coding, forms each participle and corresponding coding one by one Mapping relations；

Any N number of coding in each coding is combined, obtains several the first candidate combinations, and screen satisfaction the First candidate combinations of one predetermined condition, obtain several the second candidate combinations；Wherein, N is just whole more than or equal to 2 Number, first candidate combinations for meeting predetermined condition are the length of the corresponding participle of all codings therein predetermined The first candidate combinations within length range；

Screening meets second candidate combinations of the second predetermined condition, obtains several target frequent item set coded combinations；Its In, second candidate combinations for meeting the second predetermined condition are second time of its support within the scope of predetermined support Choosing combination；

According to the participle and the mapping relations one by one of corresponding coding, each target frequent item set coded combination pair is obtained The participle answered obtains the corresponding frequent item set of each target frequent item set coded combination.

2. the method according to claim 1, wherein the method also includes following steps：

It is concentrated according to the structural data, the data source file of each participle and corresponding coding form coding and data The mapping relations of the mark of source file；Wherein, each data source file has a unique mark；

According to the mapping relations of the coding and the mark, each volume in each target frequent item set coded combination is determined The intersection of the corresponding mark of code, obtains the set of the corresponding mark of each target frequent item set coded combination；

According to the set of each mark, the corresponding data source file of each target frequent item set coded combination is determined Set.

3. method according to claim 1 or 2, which is characterized in that the method also includes forming the structural data The step of collection, the step include following sub-step：

It is separated the processing of participle to each participle in the participle data set, obtains the structuring number for meeting predetermined structure According to collection.

4. according to the method described in claim 2, it is characterized in that, the method is obtaining each target frequent item set volume It further include following steps before code character closes corresponding participle：

FP-Tree is constructed using each target frequent item set coded combination.

5. the method according to claim 1, wherein the method determines that described second is candidate using following steps The support of combination：

Determine each coding in presently described second candidate combinations；

According to the mapping relations of the participle and the mapping relations one by one of corresponding coding and the coding and the mark, sieve Select each coding in presently described second candidate combinations is corresponding to segment the data source file occurred jointly, and calculating sifting obtains The quantity of the data source file arrived obtains co-occurrence quantity of documents；

According to the mapping relations of coding and mark, the corresponding source file quantity of each second candidate combinations is calculated；Wherein, source Quantity of documents is the sum of each quantity for encoding corresponding data source file in corresponding second candidate combinations；

The quotient for calculating the co-occurrence quantity of documents Yu the source file total quantity obtains the support of presently described second candidate combinations Degree.

6. the method according to claim 1, wherein the method also includes the predetermined support range is arranged And the step of predetermined length range.

7. a kind of Frequent Pattern Mining device, which is characterized in that described device includes：

Transcoding module, each participle for concentrating structural data are converted to corresponding coding, form each participle With the mapping relations one by one of corresponding coding；

First screening module obtains several the first candidates for any N number of coding in each coding to be combined Combination, and first candidate combinations for meeting the first predetermined condition are screened, obtain several the second candidate combinations；Wherein, N is Positive integer more than or equal to 2, first candidate combinations for meeting predetermined condition are that all codings therein are corresponding First candidate combinations of the length of participle within the scope of predetermined length；

Second screening module obtains several targets frequency for screening second candidate combinations for meeting the second predetermined condition Numerous item collection coded combination；Wherein, second candidate combinations for meeting the second predetermined condition are its support in predetermined branch The second candidate combinations in degree of holding range；

Frequent item set determining module, for obtaining each described according to the participle and the mapping relations one by one of corresponding coding The corresponding participle of target frequent item set coded combination obtains the corresponding frequent episode of each target frequent item set coded combination Collection.

8. device according to claim 7, which is characterized in that described device further includes：

Data source tracing module, for being concentrated according to the structural data, the data source file of each participle and corresponding Coding forms the mapping relations of coding with the mark of data source file；Wherein, each data source file is unique with one The mark；

Source file obtains module, for the mapping relations according to the coding and the mark, determines that each target is frequent Each intersection for encoding the corresponding mark, obtains each target frequent item set coded combination pair in item collection coded combination The set for the mark answered；

Source file determining module determines each target frequent item set code set for the set according to each mark Close the set of corresponding source file.

9. device according to claim 7 or 8, which is characterized in that described device further includes data processing module, the number According to processing module for pre-processing to input file, the structured data sets are obtained；

The data processing module includes：

Segmenting word submodule obtains the participle data set comprising several participles for carrying out segmenting word processing to input text；

Separate participle submodule, for being separated the processing of participle to each participle in the participle data set, is expired The structured data sets of sufficient predetermined structure.

10. device according to claim 7, which is characterized in that described device further includes support determining module, for true The support of fixed each second candidate combinations；

The support determining module includes：

Co-occurrence quantity of documents determining module, for according to the participle and the mapping relations one by one of corresponding coding and the volume The mapping relations of code and the mark screen the common appearance of the corresponding participle of each coding in presently described second candidate combinations Data source file, and the quantity of data source file that calculating sifting obtains obtains co-occurrence quantity of documents；

Each second candidate combinations are calculated for the mapping relations according to coding and mark in source file quantity determining module Corresponding source file quantity；Wherein, source file quantity is the corresponding data source of each coding in corresponding second candidate combinations The sum of the quantity of file；

Support determining module obtains current institute for calculating the quotient of the co-occurrence quantity of documents Yu the source file total quantity State the support of the second candidate combinations.