CN106599163A

CN106599163A - Data mining method and device for big data

Info

Publication number: CN106599163A
Application number: CN201611123018.6A
Authority: CN
Inventors: 刘春明
Original assignee: Shanghai Cloud Letter To Mdt Infotech Ltd
Current assignee: Shanghai Cloud Letter To Mdt Infotech Ltd
Priority date: 2016-12-08
Filing date: 2016-12-08
Publication date: 2017-04-26
Anticipated expiration: 2036-12-08
Also published as: CN106599163B

Abstract

The invention provides a data mining method for big data. The method comprises the following steps: performing word segmentation on each statement in text database contents; identifying whether characters, words and word groups belong to entities after the word segmentation; then performing semantic annotation analysis on the characters, words and word groups after the word segmentation; performing syntactic analysis on the text database contents; generating a complete structured database according to a syntactic analysis result; dividing the complete structured database into different sub-databases; and selecting corresponding sub-databases, combination of the sub-databases or the complete structured database to perform mining analysis according to a specific mining target. By adoption of the method provided by the invention, the data mining efficiency can be improved. The invention further provides a data mining device for big data.

Description

A kind of data digging method and device for big data

Technical field

The present invention relates to technical field of computer information processing, in particular, is related to a kind of data for big data Method for digging and device.

Background technology

At present, as the class of business of the increasingly extensive and different field of cyber-net application becomes increasingly abundant, Different classes of object is effectively excavated from mass data record and is implemented from different to be directed to different classes of object It is more and more important that reason scheme becomes.

However, existing technical scheme there are the following problems：As whole data base will be processed when excavating, required time compared with It is long, data mining it is less efficient.

The content of the invention

The technical problem to be solved is to provide a kind of data digging method for big data, for improving The efficiency of data mining.

To reach object above, according to an aspect of the invention, there is provided a kind of data mining side for big data Method, comprises the steps：

Step 101：Participle is carried out to each sentence in the middle of text database content；

Step 102：Whether belong to entity to the word after participle described in step 101, word and phrase to be identified；

Step 103：Semantic tagger analysis is carried out to the word after participle described in step 101, word and phrase；

Step 104：Syntactic analysis is carried out to text database content；

Step 105：Fully structured data base is generated according to syntactic analysis result；

Step 106：Fully structured data base is divided into into different subdata bases；

Step 107：Target is excavated according to specific, corresponding subdata base, the combination of subdata base or complete is selected Structured database carries out mining analysis.

Preferably, in step 103, the word after Entity recognition is counted and is classified after semantic tagger, and with point The class labelling sentence.

Further, potential excavation target can be considered during classification annotation, while limiting the key words sorting of a sentence Quantity.

Preferably, in step 105, the fully structured data base that generated statement structure is fixed, and generating complete knot During structure data base, the key words sorting of each sentence is preserved, while counting to key words sorting.

Preferably, in step 106, statistical result or the conventional excavation target according to statement classification labelling, will be complete Whole structured database is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling Or excavate based on target.

Further, when splitting subdata base, the sentence for making labelling similar is put in same subdata base, different sons Between data base, similarity is as far as possible little, wherein：

Between computing statement, the formula of similarity is：

Or：

Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0。

Between computing statement and subdata base, the computing formula of similarity is：

Or：

Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0.

Calculating formula of similarity between subdata base is：

Or：

Wherein, front formula is adapted to the preresearch estimates of large-scale data, numbers of the L (D1) for the index in subdata base D1, L The number that (D1 ∩ D2) is indexed for subdata base D1 and D2 identical, n5 and n6 are scalable coefficient, and its value is more than 0.

Preferably, in step 107, according to the difference for excavating target, different subdata bases, the group of subdata base are selected Close or fully structured data base carries out mining analysis.

According to another aspect of the present invention, there is provided a kind of data mining device for big data, including：

Word-dividing mode, for carrying out participle to each sentence in the middle of text database content；

Words Entity recognition module, is identified for whether the word after participle, word and phrase belong to entity；

Semantic tagger module, for carrying out semantic tagger analysis to the word after participle, word and phrase；

Syntactic analysis module, for carrying out syntactic analysis to text database content；

Data base's generation module, for generating fully structured data base according to syntactic analysis result；

Data base splits module, for fully structured data base is divided into different subdata bases；

Data-mining module, for excavating target according to specific, selects corresponding subdata base, the combination of subdata base Or fully structured data base carries out mining analysis.

Preferably, semanteme marks module, for the word after Entity recognition being counted and being classified after semantic tagger, And with the key words sorting sentence.

Preferably, data base's generation module, for the fully structured data base that generated statement structure is fixed, and is generating During fully structured data base, the key words sorting of each sentence is preserved, while counting to key words sorting.

Preferably, data base's segmentation module, for the statistical result according to statement classification labelling or conventional excavation mesh Fully structured data base is divided into different subdata bases, and gives subdata base to index by mark, and its index is with sentence point Class labelling is excavated based on target, and during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, different Subdata base between similarity it is as far as possible little, wherein：

Between computing statement, the formula of similarity is：

Or：

Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0；

Or：

Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0；

Calculating formula of similarity between subdata base is：

Or：

Preferably, data-mining module, for according to the difference for excavating target, selecting different subdata bases, subdata The combination in storehouse or fully structured data base carry out mining analysis.

Description of the drawings

Fig. 1 is a kind of flow chart of data digging method for big data according to embodiments of the present invention；

Fig. 2 is a kind of schematic diagram of data mining device for big data according to embodiments of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

Fig. 1 is a kind of flow chart of data digging method for big data according to embodiments of the present invention.

In step 101, participle is carried out to each sentence in the middle of text database content.

In step 102, whether entity is belonged to the word after participle described in step 101, word and phrase and is identified.

In step 103, semantic tagger analysis is carried out to the word after participle described in step 101, word and phrase.

The word after Entity recognition is counted and classified after semantic tagger, classified with the noun (object etc.) in sentence Affiliated physics classification is carried out, and can such as be divided into vehicles class, electronic product etc., and with key words sorting this article database Sentence.In one embodiment of the invention, the key words sorting of 4 sentences is respectively：

Sentence 1：A, B, C, D；

Sentence 2：A, B, C, E；

Sentence 3：A, F, G, H；

Sentence 4：A, F, I, J.

In step 104, syntactic analysis is carried out to text database content；

In step 105, fully structured data base is generated according to syntactic analysis result；

In one embodiment, the fully structured data base that generated statement structure is fixed, sentence structure fix and refer to institute Some sentences are recombinated with fixed structure, are such as arranged according to the order of subject, predicate, object, attribute, the adverbial modifier, complement Row, the composition lacked in sentence are filled with empty content.When fully structured data base is generated, the classification of each sentence is preserved Labelling, while counting to key words sorting.In one embodiment of the invention, 4 sentences contain key words sorting A, contain The sentence for having key words sorting B, C, F respectively has 2.

In step 106, fully structured data base is divided into into different subdata bases；

In one embodiment, statistical result or the conventional excavation target according to statement classification labelling, completely will tie Structure data base is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling or digging Based on pick target, during segmentation subdata base, the sentence for making similarity higher is put in same subdata base, different subdatas Similarity between storehouse is as far as possible little, wherein：

Between computing statement, the formula of similarity is：

Or：

Calculating formula of similarity between subdata base is：

Or：

In one embodiment of the invention, key words sorting is only divided into 1 granularity, sets which as 1.Granularity illustrates sentence The degree of roughness classified in tag along sort or subdata base index, the such as granularity of the fineness ratio household electrical appliances of electronic product are thick, family The granularity of the fineness ratio TV of electricity is thick, and the thicker coverage rate for representing a statement classification label of granularity is bigger, is calculated by formula Similarity it is also higher.When while statement tag along sort or subdata base index are divided into multiple granularities, the calculating of similarity needs to use Go to calculate with the tag along sort or index of primary particle size.

As the similarity of sentence 1 and sentence 2 is higher, the similarity of sentence 3 and sentence 4 is higher, therefore preliminary by sentence 1 Same subdata base D1 is put into sentence 2,4 points of sentence 3 and sentence are put into another subdata base D2.2 subdata bases Index can take the higher front N items of frequency of tag along sort and determine, in one embodiment of the invention, take front 3 contingency table Sign as index.Therefore, the tag along sort of subdata base D1 is { A, B, C }, and subdata base D2 tag along sorts are { A, F, G } (its Middle G is selected for alphabet sequence).

Now, the similarity between 2 subdata bases is：

When a newly-increased sentence 5, (its label is：B, C, E, F) when, the similarity of computing statement 5 and subdata base, And put it in the higher subdata base of similarity,

Now, (sentence 5, D1,1) (1) sentence 5 and its tag along sort, therefore are pressed certain to ＞ sim to sim by sentence 5, D2 Structure (sentence is put into by structure fixed in subdata base) is put in subdata base D1.

In another embodiment of the present invention, the key words sorting of 4 sentences used is remained as：

Sentence 1：A, B, C, D；

Sentence 2：A, B, C, E；

Sentence 3：A, F, G, H；

Sentence 4：A, F, I, J.

One of conventional excavation target is categorized as：D3 { A, B, C }, wherein D4 { E, F, G }, D3, D4 are the son of generation filling Data base, { A, B, C } and { E, F, G } are respectively one index, are made up of conventional excavation target.Computing statement and subdata Similarity between storehouse：

When filling similarity threshold is (when the similarity between certain sentence and subdata base is more than this value, by the language Sentence and its tag along sort by a fixed structure add subdata base) for 0 when, then subdata base D3 comprising sentence 1, sentence 2, sentence 3, Sentence 4 totally 4 sentences and its tag along sort, subdata base D4 include sentence 2, sentence 3, sentence 4 totally 3 sentences and its contingency table Sign.When it is 0.5 to fill similarity threshold, then subdata base D3 includes sentence 1, sentence 2 totally 2 sentences and its tag along sort, Subdata base D4 includes sentence 3 totally 1 sentence and its tag along sort.

In step 107, target is excavated according to specific, select corresponding subdata base, the combination of subdata base or complete Whole structured database carries out mining analysis.

In one embodiment of the invention, when excavation target has the characteristic of B, then using the sentence in subdata base D1 Structure and tag along sort carry out mining analysis, when excavation target has the characteristic of A, then using subdata base D1 and subdata base Sentence structure and tag along sort in D2 carries out mining analysis.

Word-dividing mode 201, for carrying out participle to each sentence in the middle of text database content；

Words Entity recognition module 202, is identified for whether the word after participle, word and phrase belong to entity；

Semantic tagger module 203, for carrying out semantic tagger analysis to the word after participle, word and phrase；

Syntactic analysis module 204, for carrying out syntactic analysis to text database content；

Data base's generation module 205, for generating fully structured data base according to syntactic analysis result；

Data base splits module 206, for fully structured data base is divided into different subdata bases；

Data-mining module 207, for excavating target according to specific, selects corresponding subdata base, subdata base Combination or fully structured data base carry out mining analysis.

Preferably, semanteme marks module 203, for the word after Entity recognition being counted and being divided after semantic tagger Class, and with the key words sorting sentence.

Preferably, data base's generation module 205, for the fully structured data base that generated statement structure is fixed, and When generating fully structured data base, the key words sorting of each sentence is preserved, while counting to key words sorting.

Preferably, data base's segmentation module 206, for the statistical result according to statement classification labelling or conventional excavation Fully structured data base is divided into different subdata bases, and gives subdata base to index by target, and its index is with sentence Key words sorting is excavated based on target, and during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, no Between same subdata base, similarity is as far as possible little, wherein：

Between computing statement, the formula of similarity is：

Or：

Calculating formula of similarity between subdata base is：

Or：

Preferably, data-mining module 207, for according to the difference for excavating target, selecting different subdata bases, subnumber Mining analysis are carried out according to the combination or fully structured data base in storehouse.

With the above-mentioned desirable embodiment according to the present invention as enlightenment, by above-mentioned description, ordinary skill Personnel can carry out various change and modification completely in the range of without departing from this invention technological thought.This invention Technical scope be not limited to the content in description, it is necessary to its technical model is determined according to right Enclose.

Claims

1. a kind of data digging method for big data, it is characterised in that comprise the steps：

Step 104：Syntactic analysis is carried out to text database content；

Step 107：Target is excavated according to specific, corresponding subdata base, the combination of subdata base or complete structure is selected Changing data base carries out mining analysis.

2. method according to claim 1, it is characterised in that in step 103,

The word after Entity recognition is counted and classified after semantic tagger, and with the key words sorting sentence.

3. method according to claim 1, it is characterised in that in step 105,

The fully structured data base that generated statement structure is fixed, and when fully structured data base is generated, preserve each language The key words sorting of sentence, while counting to key words sorting.

4. method according to claim 1, it is characterised in that in step 106,

According to statistical result or the conventional excavation target of statement classification labelling, fully structured data base is divided into into difference Subdata base, and give subdata base to index, its index is with statement classification labelling or excavates based on target, splits subdata During storehouse, the sentence for making labelling similar is put in same subdata base, and between different subdata bases, similarity is as far as possible little, its In：

Between computing statement, the formula of similarity is：

s i m (d 1, d 2, α) = \sqrt[n 1]{\frac{L (d 1 \cap d 2)}{2 \times L (d 1) - L (d 1 \cap d 2)}} \sin [\frac{π}{2} \cdot {(\frac{L (d 1 \cap d 2)}{L (d 1)})}^{n 2}]

Wherein, sim () is Similarity Measure function, and d1, d2 are sentence, and α is the granularity of key words sorting, and L (d1) is structuring number According to the key words sorting number of the d1 sentences in storehouse, its value is equal with L (d2), and L (d1 ∩ d2) is the phase in sentence d1 and sentence d2 The number of same key words sorting, n1 and n2 are scalable coefficient, and its value is more than 0；

s i m (d 1, D, α) = \sqrt[n 3]{\frac{L (d 1 \cap D)}{2 \times L (d 1) - L (d 1 \cap D)}} \sin [\frac{π}{2} \cdot {(\frac{L (d 1 \cap D)}{L (d 1)})}^{n 4}]

Wherein, D is subdata base, and L (d1 ∩ D) is the index being contained in subdata base D in the key words sorting of sentence d1 Number, n3 and n4 are scalable coefficient, and its value is more than 0；

Calculating formula of similarity between subdata base is：

s i m (D 1, D 2, α) = \sqrt[n 5]{\frac{L (D 1 \cap D 2)}{2 \times L (D 1) - L (D 1 \cap D 2)}} \sin [\frac{π}{2} \cdot {(\frac{L (D 1 \cap D 2)}{L (D 1)})}^{n 6}]

Wherein, numbers of the L (D1) for the index in subdata base D1, L (D1 ∩ D2) are indexed for subdata base D1 and D2 identical Number, n5 and n6 be scalable coefficient, its value be more than 0.

5. method according to claim 1, it is characterised in that in step 107,

According to the difference for excavating target, the combination or fully structured data base for selecting different subdata base, subdata bases is entered Row mining analysis.

6. a kind of data mining device for big data, it is characterised in that include：

Data-mining module, for excavating target according to specific, select corresponding subdata base, the combination of subdata base or Fully structured data base carries out mining analysis.

7. device according to claim 6, it is characterised in that：

Semanteme marks module, for the word after Entity recognition being counted and being classified after semantic tagger, and uses contingency table Remember the sentence.

8. device according to claim 6, it is characterised in that：

Data base's generation module, for the fully structured data base that generated statement structure is fixed, and is generating fully structured During data base, the key words sorting of each sentence is preserved, while counting to key words sorting.

9. device according to claim 6, it is characterised in that：

Data base splits module, for the statistical result according to statement classification labelling or conventional excavation target, completely will tie Structure data base is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling or digging Based on pick target, during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, different subdata bases Between similarity it is as far as possible little, wherein：

Between computing statement, the formula of similarity is：

s i m (d 1, d 2, α) = \sqrt[n 1]{\frac{L (d 1 \cap d 2)}{2 \times L (d 1) - L (d 1 \cap d 2)}} \sin [\frac{π}{2} \cdot {(\frac{L (d 1 \cap d 2)}{L (d 1)})}^{n 2}]

s i m (d 1, D, α) = \sqrt[n 3]{\frac{L (d 1 \cap D)}{2 \times L (d 1) - L (d 1 \cap D)}} \sin [\frac{π}{2} \cdot {(\frac{L (d 1 \cap D)}{L (d 1)})}^{n 4}]

Calculating formula of similarity between subdata base is：

s i m (D 1, D 2, α) = \sqrt[n 5]{\frac{L (D 1 \cap D 2)}{2 \times L (D 1) - L (D 1 \cap D 2)}} \sin [\frac{π}{2} \cdot {(\frac{L (D 1 \cap D 2)}{L (D 1)})}^{n 6}]

10. device according to claim 6, it is characterised in that：

Data-mining module, for according to the difference for excavating target, selecting the combination or complete of different subdata base, subdata bases Whole structured database carries out mining analysis.