CN106599163A - Data mining method and device for big data - Google Patents

Data mining method and device for big data Download PDF

Info

Publication number
CN106599163A
CN106599163A CN201611123018.6A CN201611123018A CN106599163A CN 106599163 A CN106599163 A CN 106599163A CN 201611123018 A CN201611123018 A CN 201611123018A CN 106599163 A CN106599163 A CN 106599163A
Authority
CN
China
Prior art keywords
sentence
subdata
base
cap
subdata base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611123018.6A
Other languages
Chinese (zh)
Other versions
CN106599163B (en
Inventor
刘春明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cloud Letter To Mdt Infotech Ltd
Original Assignee
Shanghai Cloud Letter To Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cloud Letter To Mdt Infotech Ltd filed Critical Shanghai Cloud Letter To Mdt Infotech Ltd
Priority to CN201611123018.6A priority Critical patent/CN106599163B/en
Publication of CN106599163A publication Critical patent/CN106599163A/en
Application granted granted Critical
Publication of CN106599163B publication Critical patent/CN106599163B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a data mining method for big data. The method comprises the following steps: performing word segmentation on each statement in text database contents; identifying whether characters, words and word groups belong to entities after the word segmentation; then performing semantic annotation analysis on the characters, words and word groups after the word segmentation; performing syntactic analysis on the text database contents; generating a complete structured database according to a syntactic analysis result; dividing the complete structured database into different sub-databases; and selecting corresponding sub-databases, combination of the sub-databases or the complete structured database to perform mining analysis according to a specific mining target. By adoption of the method provided by the invention, the data mining efficiency can be improved. The invention further provides a data mining device for big data.

Description

A kind of data digging method and device for big data
Technical field
The present invention relates to technical field of computer information processing, in particular, is related to a kind of data for big data Method for digging and device.
Background technology
At present, as the class of business of the increasingly extensive and different field of cyber-net application becomes increasingly abundant, Different classes of object is effectively excavated from mass data record and is implemented from different to be directed to different classes of object It is more and more important that reason scheme becomes.
However, existing technical scheme there are the following problems:As whole data base will be processed when excavating, required time compared with It is long, data mining it is less efficient.
The content of the invention
The technical problem to be solved is to provide a kind of data digging method for big data, for improving The efficiency of data mining.
To reach object above, according to an aspect of the invention, there is provided a kind of data mining side for big data Method, comprises the steps:
Step 101:Participle is carried out to each sentence in the middle of text database content;
Step 102:Whether belong to entity to the word after participle described in step 101, word and phrase to be identified;
Step 103:Semantic tagger analysis is carried out to the word after participle described in step 101, word and phrase;
Step 104:Syntactic analysis is carried out to text database content;
Step 105:Fully structured data base is generated according to syntactic analysis result;
Step 106:Fully structured data base is divided into into different subdata bases;
Step 107:Target is excavated according to specific, corresponding subdata base, the combination of subdata base or complete is selected Structured database carries out mining analysis.
Preferably, in step 103, the word after Entity recognition is counted and is classified after semantic tagger, and with point The class labelling sentence.
Further, potential excavation target can be considered during classification annotation, while limiting the key words sorting of a sentence Quantity.
Preferably, in step 105, the fully structured data base that generated statement structure is fixed, and generating complete knot During structure data base, the key words sorting of each sentence is preserved, while counting to key words sorting.
Preferably, in step 106, statistical result or the conventional excavation target according to statement classification labelling, will be complete Whole structured database is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling Or excavate based on target.
Further, when splitting subdata base, the sentence for making labelling similar is put in same subdata base, different sons Between data base, similarity is as far as possible little, wherein:
Between computing statement, the formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0。
Between computing statement and subdata base, the computing formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0.
Calculating formula of similarity between subdata base is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, numbers of the L (D1) for the index in subdata base D1, L The number that (D1 ∩ D2) is indexed for subdata base D1 and D2 identical, n5 and n6 are scalable coefficient, and its value is more than 0.
Preferably, in step 107, according to the difference for excavating target, different subdata bases, the group of subdata base are selected Close or fully structured data base carries out mining analysis.
According to another aspect of the present invention, there is provided a kind of data mining device for big data, including:
Word-dividing mode, for carrying out participle to each sentence in the middle of text database content;
Words Entity recognition module, is identified for whether the word after participle, word and phrase belong to entity;
Semantic tagger module, for carrying out semantic tagger analysis to the word after participle, word and phrase;
Syntactic analysis module, for carrying out syntactic analysis to text database content;
Data base's generation module, for generating fully structured data base according to syntactic analysis result;
Data base splits module, for fully structured data base is divided into different subdata bases;
Data-mining module, for excavating target according to specific, selects corresponding subdata base, the combination of subdata base Or fully structured data base carries out mining analysis.
Preferably, semanteme marks module, for the word after Entity recognition being counted and being classified after semantic tagger, And with the key words sorting sentence.
Preferably, data base's generation module, for the fully structured data base that generated statement structure is fixed, and is generating During fully structured data base, the key words sorting of each sentence is preserved, while counting to key words sorting.
Preferably, data base's segmentation module, for the statistical result according to statement classification labelling or conventional excavation mesh Fully structured data base is divided into different subdata bases, and gives subdata base to index by mark, and its index is with sentence point Class labelling is excavated based on target, and during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, different Subdata base between similarity it is as far as possible little, wherein:
Between computing statement, the formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0;
Between computing statement and subdata base, the computing formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0;
Calculating formula of similarity between subdata base is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, numbers of the L (D1) for the index in subdata base D1, L The number that (D1 ∩ D2) is indexed for subdata base D1 and D2 identical, n5 and n6 are scalable coefficient, and its value is more than 0.
Preferably, data-mining module, for according to the difference for excavating target, selecting different subdata bases, subdata The combination in storehouse or fully structured data base carry out mining analysis.
Description of the drawings
Fig. 1 is a kind of flow chart of data digging method for big data according to embodiments of the present invention;
Fig. 2 is a kind of schematic diagram of data mining device for big data according to embodiments of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is below in conjunction with drawings and Examples, right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.
Fig. 1 is a kind of flow chart of data digging method for big data according to embodiments of the present invention.
In step 101, participle is carried out to each sentence in the middle of text database content.
In step 102, whether entity is belonged to the word after participle described in step 101, word and phrase and is identified.
In step 103, semantic tagger analysis is carried out to the word after participle described in step 101, word and phrase.
The word after Entity recognition is counted and classified after semantic tagger, classified with the noun (object etc.) in sentence Affiliated physics classification is carried out, and can such as be divided into vehicles class, electronic product etc., and with key words sorting this article database Sentence.In one embodiment of the invention, the key words sorting of 4 sentences is respectively:
Sentence 1:A, B, C, D;
Sentence 2:A, B, C, E;
Sentence 3:A, F, G, H;
Sentence 4:A, F, I, J.
In step 104, syntactic analysis is carried out to text database content;
In step 105, fully structured data base is generated according to syntactic analysis result;
In one embodiment, the fully structured data base that generated statement structure is fixed, sentence structure fix and refer to institute Some sentences are recombinated with fixed structure, are such as arranged according to the order of subject, predicate, object, attribute, the adverbial modifier, complement Row, the composition lacked in sentence are filled with empty content.When fully structured data base is generated, the classification of each sentence is preserved Labelling, while counting to key words sorting.In one embodiment of the invention, 4 sentences contain key words sorting A, contain The sentence for having key words sorting B, C, F respectively has 2.
In step 106, fully structured data base is divided into into different subdata bases;
In one embodiment, statistical result or the conventional excavation target according to statement classification labelling, completely will tie Structure data base is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling or digging Based on pick target, during segmentation subdata base, the sentence for making similarity higher is put in same subdata base, different subdatas Similarity between storehouse is as far as possible little, wherein:
Between computing statement, the formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0;
Between computing statement and subdata base, the computing formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0;
Calculating formula of similarity between subdata base is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, numbers of the L (D1) for the index in subdata base D1, L The number that (D1 ∩ D2) is indexed for subdata base D1 and D2 identical, n5 and n6 are scalable coefficient, and its value is more than 0.
In one embodiment of the invention, key words sorting is only divided into 1 granularity, sets which as 1.Granularity illustrates sentence The degree of roughness classified in tag along sort or subdata base index, the such as granularity of the fineness ratio household electrical appliances of electronic product are thick, family The granularity of the fineness ratio TV of electricity is thick, and the thicker coverage rate for representing a statement classification label of granularity is bigger, is calculated by formula Similarity it is also higher.When while statement tag along sort or subdata base index are divided into multiple granularities, the calculating of similarity needs to use Go to calculate with the tag along sort or index of primary particle size.
As the similarity of sentence 1 and sentence 2 is higher, the similarity of sentence 3 and sentence 4 is higher, therefore preliminary by sentence 1 Same subdata base D1 is put into sentence 2,4 points of sentence 3 and sentence are put into another subdata base D2.2 subdata bases Index can take the higher front N items of frequency of tag along sort and determine, in one embodiment of the invention, take front 3 contingency table Sign as index.Therefore, the tag along sort of subdata base D1 is { A, B, C }, and subdata base D2 tag along sorts are { A, F, G } (its Middle G is selected for alphabet sequence).
Now, the similarity between 2 subdata bases is:
When a newly-increased sentence 5, (its label is:B, C, E, F) when, the similarity of computing statement 5 and subdata base, And put it in the higher subdata base of similarity,
Now, (sentence 5, D1,1) (1) sentence 5 and its tag along sort, therefore are pressed certain to > sim to sim by sentence 5, D2 Structure (sentence is put into by structure fixed in subdata base) is put in subdata base D1.
In another embodiment of the present invention, the key words sorting of 4 sentences used is remained as:
Sentence 1:A, B, C, D;
Sentence 2:A, B, C, E;
Sentence 3:A, F, G, H;
Sentence 4:A, F, I, J.
One of conventional excavation target is categorized as:D3 { A, B, C }, wherein D4 { E, F, G }, D3, D4 are the son of generation filling Data base, { A, B, C } and { E, F, G } are respectively one index, are made up of conventional excavation target.Computing statement and subdata Similarity between storehouse:
When filling similarity threshold is (when the similarity between certain sentence and subdata base is more than this value, by the language Sentence and its tag along sort by a fixed structure add subdata base) for 0 when, then subdata base D3 comprising sentence 1, sentence 2, sentence 3, Sentence 4 totally 4 sentences and its tag along sort, subdata base D4 include sentence 2, sentence 3, sentence 4 totally 3 sentences and its contingency table Sign.When it is 0.5 to fill similarity threshold, then subdata base D3 includes sentence 1, sentence 2 totally 2 sentences and its tag along sort, Subdata base D4 includes sentence 3 totally 1 sentence and its tag along sort.
In step 107, target is excavated according to specific, select corresponding subdata base, the combination of subdata base or complete Whole structured database carries out mining analysis.
In one embodiment of the invention, when excavation target has the characteristic of B, then using the sentence in subdata base D1 Structure and tag along sort carry out mining analysis, when excavation target has the characteristic of A, then using subdata base D1 and subdata base Sentence structure and tag along sort in D2 carries out mining analysis.
Fig. 2 is a kind of schematic diagram of data mining device for big data according to embodiments of the present invention.
According to another aspect of the present invention, there is provided a kind of data mining device for big data, including:
Word-dividing mode 201, for carrying out participle to each sentence in the middle of text database content;
Words Entity recognition module 202, is identified for whether the word after participle, word and phrase belong to entity;
Semantic tagger module 203, for carrying out semantic tagger analysis to the word after participle, word and phrase;
Syntactic analysis module 204, for carrying out syntactic analysis to text database content;
Data base's generation module 205, for generating fully structured data base according to syntactic analysis result;
Data base splits module 206, for fully structured data base is divided into different subdata bases;
Data-mining module 207, for excavating target according to specific, selects corresponding subdata base, subdata base Combination or fully structured data base carry out mining analysis.
Preferably, semanteme marks module 203, for the word after Entity recognition being counted and being divided after semantic tagger Class, and with the key words sorting sentence.
Preferably, data base's generation module 205, for the fully structured data base that generated statement structure is fixed, and When generating fully structured data base, the key words sorting of each sentence is preserved, while counting to key words sorting.
Preferably, data base's segmentation module 206, for the statistical result according to statement classification labelling or conventional excavation Fully structured data base is divided into different subdata bases, and gives subdata base to index by target, and its index is with sentence Key words sorting is excavated based on target, and during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, no Between same subdata base, similarity is as far as possible little, wherein:
Between computing statement, the formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and sim () is Similarity Measure function, and d1, d2 are sentence, α For the granularity of key words sorting, key words sorting numbers of the L (d1) for the d1 sentences in structured database, its value are equal with L (d2), L (d1 ∩ d2) is the number of the identical key words sorting in sentence d1 and sentence d2, and n1 and n2 is scalable coefficient, and its value is more than 0;
Between computing statement and subdata base, the computing formula of similarity is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, and D is subdata base, and L (d1 ∩ D) is the classification of sentence d1 The number of the index being contained in subdata base D in labelling, n3 and n4 are scalable coefficient, and its value is more than 0;
Calculating formula of similarity between subdata base is:
Or:
Wherein, front formula is adapted to the preresearch estimates of large-scale data, numbers of the L (D1) for the index in subdata base D1, L The number that (D1 ∩ D2) is indexed for subdata base D1 and D2 identical, n5 and n6 are scalable coefficient, and its value is more than 0.
Preferably, data-mining module 207, for according to the difference for excavating target, selecting different subdata bases, subnumber Mining analysis are carried out according to the combination or fully structured data base in storehouse.
With the above-mentioned desirable embodiment according to the present invention as enlightenment, by above-mentioned description, ordinary skill Personnel can carry out various change and modification completely in the range of without departing from this invention technological thought.This invention Technical scope be not limited to the content in description, it is necessary to its technical model is determined according to right Enclose.

Claims (10)

1. a kind of data digging method for big data, it is characterised in that comprise the steps:
Step 101:Participle is carried out to each sentence in the middle of text database content;
Step 102:Whether belong to entity to the word after participle described in step 101, word and phrase to be identified;
Step 103:Semantic tagger analysis is carried out to the word after participle described in step 101, word and phrase;
Step 104:Syntactic analysis is carried out to text database content;
Step 105:Fully structured data base is generated according to syntactic analysis result;
Step 106:Fully structured data base is divided into into different subdata bases;
Step 107:Target is excavated according to specific, corresponding subdata base, the combination of subdata base or complete structure is selected Changing data base carries out mining analysis.
2. method according to claim 1, it is characterised in that in step 103,
The word after Entity recognition is counted and classified after semantic tagger, and with the key words sorting sentence.
3. method according to claim 1, it is characterised in that in step 105,
The fully structured data base that generated statement structure is fixed, and when fully structured data base is generated, preserve each language The key words sorting of sentence, while counting to key words sorting.
4. method according to claim 1, it is characterised in that in step 106,
According to statistical result or the conventional excavation target of statement classification labelling, fully structured data base is divided into into difference Subdata base, and give subdata base to index, its index is with statement classification labelling or excavates based on target, splits subdata During storehouse, the sentence for making labelling similar is put in same subdata base, and between different subdata bases, similarity is as far as possible little, its In:
Between computing statement, the formula of similarity is:
s i m ( d 1 , d 2 , α ) = L ( d 1 ∩ d 2 ) 2 × L ( d 1 ) - L ( d 1 ∩ d 2 ) n 1 sin [ π 2 · ( L ( d 1 ∩ d 2 ) L ( d 1 ) ) n 2 ]
Wherein, sim () is Similarity Measure function, and d1, d2 are sentence, and α is the granularity of key words sorting, and L (d1) is structuring number According to the key words sorting number of the d1 sentences in storehouse, its value is equal with L (d2), and L (d1 ∩ d2) is the phase in sentence d1 and sentence d2 The number of same key words sorting, n1 and n2 are scalable coefficient, and its value is more than 0;
Between computing statement and subdata base, the computing formula of similarity is:
s i m ( d 1 , D , α ) = L ( d 1 ∩ D ) 2 × L ( d 1 ) - L ( d 1 ∩ D ) n 3 sin [ π 2 · ( L ( d 1 ∩ D ) L ( d 1 ) ) n 4 ]
Wherein, D is subdata base, and L (d1 ∩ D) is the index being contained in subdata base D in the key words sorting of sentence d1 Number, n3 and n4 are scalable coefficient, and its value is more than 0;
Calculating formula of similarity between subdata base is:
s i m ( D 1 , D 2 , α ) = L ( D 1 ∩ D 2 ) 2 × L ( D 1 ) - L ( D 1 ∩ D 2 ) n 5 sin [ π 2 · ( L ( D 1 ∩ D 2 ) L ( D 1 ) ) n 6 ]
Wherein, numbers of the L (D1) for the index in subdata base D1, L (D1 ∩ D2) are indexed for subdata base D1 and D2 identical Number, n5 and n6 be scalable coefficient, its value be more than 0.
5. method according to claim 1, it is characterised in that in step 107,
According to the difference for excavating target, the combination or fully structured data base for selecting different subdata base, subdata bases is entered Row mining analysis.
6. a kind of data mining device for big data, it is characterised in that include:
Word-dividing mode, for carrying out participle to each sentence in the middle of text database content;
Words Entity recognition module, is identified for whether the word after participle, word and phrase belong to entity;
Semantic tagger module, for carrying out semantic tagger analysis to the word after participle, word and phrase;
Syntactic analysis module, for carrying out syntactic analysis to text database content;
Data base's generation module, for generating fully structured data base according to syntactic analysis result;
Data base splits module, for fully structured data base is divided into different subdata bases;
Data-mining module, for excavating target according to specific, select corresponding subdata base, the combination of subdata base or Fully structured data base carries out mining analysis.
7. device according to claim 6, it is characterised in that:
Semanteme marks module, for the word after Entity recognition being counted and being classified after semantic tagger, and uses contingency table Remember the sentence.
8. device according to claim 6, it is characterised in that:
Data base's generation module, for the fully structured data base that generated statement structure is fixed, and is generating fully structured During data base, the key words sorting of each sentence is preserved, while counting to key words sorting.
9. device according to claim 6, it is characterised in that:
Data base splits module, for the statistical result according to statement classification labelling or conventional excavation target, completely will tie Structure data base is divided into different subdata bases, and gives subdata base to index, and its index is with statement classification labelling or digging Based on pick target, during segmentation subdata base, the sentence for making labelling similar is put in same subdata base, different subdata bases Between similarity it is as far as possible little, wherein:
Between computing statement, the formula of similarity is:
s i m ( d 1 , d 2 , α ) = L ( d 1 ∩ d 2 ) 2 × L ( d 1 ) - L ( d 1 ∩ d 2 ) n 1 sin [ π 2 · ( L ( d 1 ∩ d 2 ) L ( d 1 ) ) n 2 ]
Wherein, sim () is Similarity Measure function, and d1, d2 are sentence, and α is the granularity of key words sorting, and L (d1) is structuring number According to the key words sorting number of the d1 sentences in storehouse, its value is equal with L (d2), and L (d1 ∩ d2) is the phase in sentence d1 and sentence d2 The number of same key words sorting, n1 and n2 are scalable coefficient, and its value is more than 0;
Between computing statement and subdata base, the computing formula of similarity is:
s i m ( d 1 , D , α ) = L ( d 1 ∩ D ) 2 × L ( d 1 ) - L ( d 1 ∩ D ) n 3 sin [ π 2 · ( L ( d 1 ∩ D ) L ( d 1 ) ) n 4 ]
Wherein, D is subdata base, and L (d1 ∩ D) is the index being contained in subdata base D in the key words sorting of sentence d1 Number, n3 and n4 are scalable coefficient, and its value is more than 0;
Calculating formula of similarity between subdata base is:
s i m ( D 1 , D 2 , α ) = L ( D 1 ∩ D 2 ) 2 × L ( D 1 ) - L ( D 1 ∩ D 2 ) n 5 sin [ π 2 · ( L ( D 1 ∩ D 2 ) L ( D 1 ) ) n 6 ]
Wherein, numbers of the L (D1) for the index in subdata base D1, L (D1 ∩ D2) are indexed for subdata base D1 and D2 identical Number, n5 and n6 be scalable coefficient, its value be more than 0.
10. device according to claim 6, it is characterised in that:
Data-mining module, for according to the difference for excavating target, selecting the combination or complete of different subdata base, subdata bases Whole structured database carries out mining analysis.
CN201611123018.6A 2016-12-08 2016-12-08 A kind of data digging method and device for big data Active CN106599163B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611123018.6A CN106599163B (en) 2016-12-08 2016-12-08 A kind of data digging method and device for big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611123018.6A CN106599163B (en) 2016-12-08 2016-12-08 A kind of data digging method and device for big data

Publications (2)

Publication Number Publication Date
CN106599163A true CN106599163A (en) 2017-04-26
CN106599163B CN106599163B (en) 2019-11-22

Family

ID=58598579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611123018.6A Active CN106599163B (en) 2016-12-08 2016-12-08 A kind of data digging method and device for big data

Country Status (1)

Country Link
CN (1) CN106599163B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
US20150324349A1 (en) * 2014-05-12 2015-11-12 Google Inc. Automated reading comprehension
CN105117388A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN105260178A (en) * 2015-09-21 2016-01-20 上海智臻智能网络科技股份有限公司 Intelligent cloud service application development method and system
CN105302859A (en) * 2015-09-21 2016-02-03 上海智臻智能网络科技股份有限公司 Intelligent interaction system based on Internet
CN105528410A (en) * 2015-12-05 2016-04-27 浙江大学 Method for concluding and classifying online comments of hospital

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043812A (en) * 2009-10-13 2011-05-04 北京大学 Method and system for retrieving medical information
US20150324349A1 (en) * 2014-05-12 2015-11-12 Google Inc. Automated reading comprehension
CN105117388A (en) * 2015-09-21 2015-12-02 上海智臻智能网络科技股份有限公司 Intelligent robot interaction system
CN105260178A (en) * 2015-09-21 2016-01-20 上海智臻智能网络科技股份有限公司 Intelligent cloud service application development method and system
CN105302859A (en) * 2015-09-21 2016-02-03 上海智臻智能网络科技股份有限公司 Intelligent interaction system based on Internet
CN105528410A (en) * 2015-12-05 2016-04-27 浙江大学 Method for concluding and classifying online comments of hospital

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑美玉: "基于本体的中文博客二级自动分类研究", 《情报科学》 *

Also Published As

Publication number Publication date
CN106599163B (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN103617157B (en) Based on semantic Text similarity computing method
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN103678564B (en) Internet product research system based on data mining
Sebastiani Classification of text, automatic
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
CN104111933A (en) Method and device for acquiring business object label and building training model
CN102436480B (en) Incidence relation excavation method for text-oriented knowledge unit
CN106547875A (en) A kind of online incident detection method of the microblogging based on sentiment analysis and label
CN102262653A (en) Label recommendation method and system based on user motivation orientation
CN107239564A (en) A kind of text label based on supervision topic model recommends method
CN105488029A (en) KNN based evidence taking method for instant communication tool of intelligent mobile phone
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN106445914A (en) Microblog emotion classifier establishing method and device
CN105868387A (en) Method for outlier data mining based on parallel computation
Sara-Meshkizadeh et al. Webpage classification based on compound of using HTML features & URL features and features of sibling pages
CN107368610B (en) Full-text-based large text CRF and rule classification method and system
CN101470699A (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN106372123B (en) Tag-based related content recommendation method and system
CN112270189A (en) Question type analysis node generation method, question type analysis node generation system and storage medium
Sun Research on product attribute extraction and classification method for online review
CN106599163A (en) Data mining method and device for big data
CN113268614B (en) Label system updating method and device, electronic equipment and readable storage medium
CN104281695A (en) Combination theory based quasi natural language semantic information extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant