CN110110082A - Multi-source heterogeneous data fusion optimization method - Google Patents

Multi-source heterogeneous data fusion optimization method Download PDF

Info

Publication number
CN110110082A
CN110110082A CN201910294678.8A CN201910294678A CN110110082A CN 110110082 A CN110110082 A CN 110110082A CN 201910294678 A CN201910294678 A CN 201910294678A CN 110110082 A CN110110082 A CN 110110082A
Authority
CN
China
Prior art keywords
short text
matching
matched
source heterogeneous
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910294678.8A
Other languages
Chinese (zh)
Inventor
黄红梅
何卓华
谢新屋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201910294678.8A priority Critical patent/CN110110082A/en
Publication of CN110110082A publication Critical patent/CN110110082A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of multi-source heterogeneous data fusion optimization methods, include the following steps: A) data instance, classification and attribute are extracted and analyzed, establish dictionary and short text library;B multi-source heterogeneous data) are obtained from internet;C standardization processing) is carried out to multi-source heterogeneous data, generates short text;Short text has multiple words to constitute, and standardization processing includes segmenting and removing stop words;D) using short text as short text to be matched, the short text stored in short text to be matched and short text library is matched, short text matching result is obtained;E) data are merged according to short text matching result, establish big data content model, obtain data fusion result;F) data fusion results are evaluated, obtain evaluation result;Evaluation result include it is excellent, good, neutralize it is poor.The present invention can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.

Description

Multi-source heterogeneous data fusion optimization method
Technical field
The present invention relates to domain of data fusion, in particular to a kind of multi-source heterogeneous data fusion optimization method.
Background technique
Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data More preferably, richer information.There are some multisource data fusion technologies by carrying out pretreatment and text matches to data at present, it is real The fusion of multi-source heterogeneous data is showed, but it can't establish integrality, accuracy and the strong knowledge base of consistency.
Summary of the invention
The technical problem to be solved in the present invention is that in view of the above drawbacks of the prior art, providing a kind of can establish completely The multi-source heterogeneous data fusion optimization method of the big data knowledge base of property, accuracy and the stronger high quality of consistency.
The technical solution adopted by the present invention to solve the technical problems is: constructing a kind of multi-source heterogeneous data fusion optimization side Method includes the following steps:
A) data instance, classification and attribute are extracted and analyzed, establishes dictionary and short text library;
B multi-source heterogeneous data) are obtained from internet;
C standardization processing) is carried out to the multi-source heterogeneous data, generates short text;The short text is made of multiple words, The standardization processing includes segmenting and removing stop words;
D short by being stored in the short text to be matched and short text library) using the short text as short text to be matched Text is matched, and short text matching result is obtained;
E) data are merged according to the short text matching result, establish big data content model, data is obtained and melts Close result;
F) the data fusion result is evaluated, obtains evaluation result;The evaluation result includes excellent, good, neutralization Difference.
In multi-source heterogeneous data fusion optimization method of the present invention, the step D) further comprise:
D1 the character match factor between the short text in the short text to be matched and short text library) is calculated;
D2 the word matching attribute between the short text in the short text to be matched and short text library) is calculated;
D3) according to the character match factor and word matching attribute, in the short text to be matched and short text library Short text is matched, and short text matching attribute is calculated.
In multi-source heterogeneous data fusion optimization method of the present invention, the character match factor uses following formula It is calculated:
Wherein, F1Indicate the character match factor, c1Indicate the number of characters that the short text to be matched includes, c2It indicates The number of characters that short text in the short text library includes, p indicate that matched number of characters, h indicate the number of transposition.
In multi-source heterogeneous data fusion optimization method of the present invention, institute's predicate matching attribute using following formula into Row calculates:
Wherein, F2Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor Son, σ ∈ [0.9,1.3] increase word bring error, A for correctingiFor i-th of word in the short text to be matched, BiFor I-th of word in short text in short text library.
In multi-source heterogeneous data fusion optimization method of the present invention, the short text matching attribute is using following public Formula is calculated:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate the short essay to be matched This matches with the short text in short text library, if Y < Y0, then illustrate the short essay in the short text to be matched and short text library This mismatch.
In multi-source heterogeneous data fusion optimization method of the present invention, the step E) specifically: by the quotient of acquisition The name of an article claims, Property Name and attribute value constitute set L={ l1, l2..., lm, m indicates the number of element in set, calculates liWith ljBetween short text matching attribute, i, j=1,2 ..., m generate matching degree matrix according to short text matching attribute:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1, 2 ..., m.
In multi-source heterogeneous data fusion optimization method of the present invention, if the value of element is small in the matching degree matrix In matching threshold, then it is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for being greater than matching threshold Two elements are exported the biggish element of matching degree as fusion results.
Implement multi-source heterogeneous data fusion optimization method of the invention, have the advantages that due to data instance, Classification and attribute extract and dictionary and short text library are established in analysis;Multi-source heterogeneous data are obtained from internet;It is different to multi-source Structure data carry out standardization processing, generate short text;Using short text as short text to be matched, by short text to be matched and short essay The short text stored in this library is matched, and short text matching result is obtained;Data are melted according to short text matching result It closes, establishes big data content model, obtain data fusion result;Data fusion results are evaluated, evaluation result is obtained;This Invention is able to achieve the fusion of multi-source heterogeneous data, can establish the big data of integrality, accuracy and the stronger high quality of consistency Knowledge base.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow chart in the multi-source heterogeneous data fusion optimization method one embodiment of the present invention;
Fig. 2 be using short text as short text to be matched in the embodiment, will be in short text to be matched and short text library The short text of storage is matched, and the specific flow chart of short text matching result is obtained.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In the multi-source heterogeneous data fusion optimization method embodiment of the present invention, the multi-source heterogeneous data fusion optimization method Flow chart is as shown in Figure 1.In Fig. 1, which includes the following steps:
Step S01 extracts data instance, classification and attribute and dictionary and short text library: this step are established in analysis In, data instance, classification and attribute are extracted and are analyzed, establishes dictionary and short text library in server beyond the clouds, wherein It is stored with the word of magnanimity in dictionary, the short text of magnanimity is stored in short text library, short text is made of several words.
Step S02 is from the multi-source heterogeneous data in internet: in this step, obtaining multi-source heterogeneous data from internet.
Step S03 carries out standardization processing to multi-source heterogeneous data, generates short text: in this step, to multi-source heterogeneous number According to standardization processing is carried out, cut down the ambiguity of isomeric data, generate short text, wherein short text is made of multiple words, standardization Processing includes segmenting and removing stop words.
Step S04 is using short text as short text to be matched, the short essay that will be stored in short text to be matched and short text library This is matched, and short text matching result is obtained: in this step, using short text as short text is generated, by the short essay to be matched This is matched with the short text stored in short text library, that is, short by storing in the short text to be matched and short text library Text compares, and then obtains short text matching result.
Step S05 merges data according to short text matching result, establishes big data content model, obtains data and melts It closes result: in this step, data being merged according to above-mentioned short text matching result, establishing big data content model (can be with It is model in the prior art, is also possible to the model of original creation), finally obtain the data fusion result of high quality.
Step S06 evaluates data fusion results, obtains evaluation result: in this step, to data fusion results into Row evaluation, obtains evaluation result.The evaluation result includes four grades such as excellent, good, neutralization difference.The present invention is able to achieve multi-source heterogeneous The fusion of data can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.
For the present embodiment, above-mentioned steps S04 can also be refined further, and the flow chart after refinement is as shown in Figure 2. In Fig. 2, step S04 further comprises following steps:
Step S41 calculates the character match factor between the short text in short text and short text to be matched library: this step In, calculate the character match factor between the short text in short text and short text to be matched library.Specifically, character match because Son is calculated using following formula:
Wherein, wherein F1Indicate the character match factor, c1Indicate the number of characters that short text to be matched includes, c2Indicate short essay The number of characters that short text in this library includes,pIndicate that matched number of characters, h indicate the number of transposition.The number of transposition is equal to not The half of homosequential matching number of characters;The character match factor is bigger, indicates that the matching degree of text to be matched is higher.This step Using character as basic unit, by determining matching character and transposition number, the accurate calculating of the character match factor is realized, after being Continuous short text matching is laid a good foundation.
Step S42 calculates the word matching attribute between the short text in short text and short text to be matched library: in this step, Calculate the word matching attribute between the short text in short text and short text to be matched library.Specifically, for short essay to be matched Short text, is seen the vector for composition of writing words, A by the short text B in this A and short text libraryiFor i-th in short text to be matched Word, BiFor i-th of word in the short text in short text library, if in short text B in short text A to be matched and short text library The quantity of word is different, then the lower short text of vector dimension increase the operation of word first, and increased word is from presetting Dictionary in select at random, keep its dimension identical as the higher text dimension of dimension, using following formula calculate word matching attribute:
Wherein, F2Indicate that word matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈ [0.9,1.3] increases word bring error, A for correctingiFor i-th of word in short text to be matched, BiFor in short text library Short text in i-th of word.Word matching attribute is bigger, indicates that the matching degree of text to be matched is higher.
This step is converted into vector using word as basic unit, by short text, and converts dimension phase for text to be matched Same vector, realizes the accurate calculating of word matching attribute, lays a good foundation for the matching of subsequent short text.
Step S43 is according to the character match factor and word matching attribute, to the short essay in short text to be matched and short text library This is matched, and calculates short text matching attribute: in this step, according to the character match factor and word matching attribute, to be matched Short text is matched with the short text in short text library, calculates short text matching attribute.
Specifically, short text matching attribute is calculated using following formula:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate short text to be matched with Short text in short text library matches, if Y < Y0, then do not illustrate short text in short text to be matched and short text library not Match.This step considers the part of speech similitude and Semantic Similarity of short text simultaneously, improves matching accuracy, is follow-up data Fusion is laid a good foundation.
In the present embodiment, when carrying out data fusion, the product name of acquisition, Property Name and attribute value is constituted and gathered L={ l1, l2..., lm, m indicates the number of element in set, calculates liAnd ljBetween short text matching attribute, i, j=1, 2 ..., m generate matching degree matrix according to short text matching attribute:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1, 2 ..., m.
In the present embodiment, if in matching degree matrix element value be less than matching threshold, be denoted as 0, by matching degree be greater than Element with threshold value is merged, and for being greater than two elements of matching threshold, is tied the biggish element of matching degree as fusion Fruit output.The present embodiment according to product name, Property Name and attribute value generation matching degree matrix, and according to logm according into Row fusion, has obtained accurate fusion results.
When evaluating data fusion results, specifically: product name, Property Name and attribute value are constituted Set L={ l1, l2..., lm, the element in set L is merged, and amalgamation result is exported.To data fusion results Evaluation result is divided into four grades, excellent (90 points or more), good (70 points or more, 90 points or less), in (60 points or more, 70 points or less) and it is poor (60 points or less).
In short, the present invention is able to achieve the fusion of multi-source heterogeneous data, it is stronger that integrality, accuracy and consistency can be established Knowledge base.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (7)

1. a kind of multi-source heterogeneous data fusion optimization method, which comprises the steps of:
A) data instance, classification and attribute are extracted and analyzed, establishes dictionary and short text library;
B multi-source heterogeneous data) are obtained from internet;
C standardization processing) is carried out to the multi-source heterogeneous data, generates short text;The short text is made of multiple words, described Standardization processing includes segmenting and removing stop words;
D) using the short text as short text to be matched, the short text that will be stored in the short text to be matched and short text library It is matched, obtains short text matching result;
E) data are merged according to the short text matching result, establish big data content model, obtain data fusion knot Fruit;
F) the data fusion result is evaluated, obtains evaluation result;The evaluation result include it is excellent, good, neutralize it is poor.
2. multi-source heterogeneous data fusion optimization method according to claim 1, which is characterized in that the step D) further Include:
D1 the character match factor between the short text in the short text to be matched and short text library) is calculated;
D2 the word matching attribute between the short text in the short text to be matched and short text library) is calculated;
D3) according to the character match factor and word matching attribute, to the short essay in the short text to be matched and short text library This is matched, and short text matching attribute is calculated.
3. multi-source heterogeneous data fusion optimization method according to claim 2, which is characterized in that the character match factor It is calculated using following formula:
Wherein, F1Indicate the character match factor, c1Indicate the number of characters that the short text to be matched includes, c2Described in expression The number of characters that short text in short text library includes, p indicate that matched number of characters, h indicate the number of transposition.
4. multi-source heterogeneous data fusion optimization method according to claim 3, which is characterized in that institute's predicate matching attribute is adopted It is calculated with following formula:
Wherein, F2Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈ [0.9,1.3] increases word bring error, A for correctingiFor i-th of word in the short text to be matched, BiFor short text I-th of word in short text in library.
5. multi-source heterogeneous data fusion optimization method according to claim 4, which is characterized in that short text matching because Son is calculated using following formula:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate the short text to be matched with Short text in short text library matches, if Y < Y0, then illustrate the short text to be matched with the short text in short text library not Matching.
6. multi-source heterogeneous data fusion optimization method according to claim 5, which is characterized in that the step E) it is specific Are as follows: the product name of acquisition, Property Name and attribute value are constituted into set L={ l1, l2..., lm, m indicates element in set Number calculates liAnd ljBetween short text matching attribute, i, j=1,2 ..., m generate matching according to short text matching attribute Spend matrix:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1,2 ..., m.
7. multi-source heterogeneous data fusion optimization method according to claim 6, which is characterized in that if the matching degree matrix The value of middle element is less than matching threshold, then is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for big In two elements of matching threshold, exported the biggish element of matching degree as fusion results.
CN201910294678.8A 2019-04-12 2019-04-12 Multi-source heterogeneous data fusion optimization method Pending CN110110082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910294678.8A CN110110082A (en) 2019-04-12 2019-04-12 Multi-source heterogeneous data fusion optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910294678.8A CN110110082A (en) 2019-04-12 2019-04-12 Multi-source heterogeneous data fusion optimization method

Publications (1)

Publication Number Publication Date
CN110110082A true CN110110082A (en) 2019-08-09

Family

ID=67483866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910294678.8A Pending CN110110082A (en) 2019-04-12 2019-04-12 Multi-source heterogeneous data fusion optimization method

Country Status (1)

Country Link
CN (1) CN110110082A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning
CN113065000A (en) * 2021-03-29 2021-07-02 泰瑞数创科技(北京)有限公司 Multisource heterogeneous data fusion method based on geographic entity

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146644A (en) * 2018-09-05 2019-01-04 广州小楠科技有限公司 A kind of e-commerce system
CN109189886A (en) * 2018-09-05 2019-01-11 广州小楠科技有限公司 A kind of intelligent video recommender system
CN109255049A (en) * 2018-09-05 2019-01-22 广州小楠科技有限公司 A kind of wisdom music recommender system
CN109308311A (en) * 2018-09-05 2019-02-05 广州小楠科技有限公司 A kind of multi-source heterogeneous data fusion system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146644A (en) * 2018-09-05 2019-01-04 广州小楠科技有限公司 A kind of e-commerce system
CN109189886A (en) * 2018-09-05 2019-01-11 广州小楠科技有限公司 A kind of intelligent video recommender system
CN109255049A (en) * 2018-09-05 2019-01-22 广州小楠科技有限公司 A kind of wisdom music recommender system
CN109308311A (en) * 2018-09-05 2019-02-05 广州小楠科技有限公司 A kind of multi-source heterogeneous data fusion system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767325A (en) * 2020-09-03 2020-10-13 国网浙江省电力有限公司营销服务中心 Multi-source data deep fusion method based on deep learning
CN113065000A (en) * 2021-03-29 2021-07-02 泰瑞数创科技(北京)有限公司 Multisource heterogeneous data fusion method based on geographic entity
CN113065000B (en) * 2021-03-29 2021-10-22 泰瑞数创科技(北京)有限公司 Multisource heterogeneous data fusion method based on geographic entity

Similar Documents

Publication Publication Date Title
CN106156365B (en) A kind of generation method and device of knowledge mapping
CN106649487B (en) Image retrieval method based on interest target
CN111026842A (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN109783657A (en) Multistep based on limited text space is from attention cross-media retrieval method and system
CN105631468B (en) A kind of picture based on RNN describes automatic generation method
CN110008335A (en) The method and device of natural language processing
GB2574087A (en) Compositing aware digital image search
CN111626362B (en) Image processing method, device, computer equipment and storage medium
CN109145152A (en) A kind of self-adapting intelligent generation image-text video breviary drawing method based on query word
CN107967258B (en) Method and system for emotion analysis of text information
WO2015021937A1 (en) Method and device for user recommendation
CN108959531A (en) Information search method, device, equipment and storage medium
CN109711465A (en) Image method for generating captions based on MLL and ASCA-FR
CN111143617A (en) Automatic generation method and system for picture or video text description
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN110245228A (en) The method and apparatus for determining text categories
CN103425686B (en) A kind of information issuing method and device
CN107609055B (en) Text image multi-modal retrieval method based on deep layer topic model
CN109992788A (en) Depth text matching technique and device based on unregistered word processing
Hani et al. Image caption generation using a deep architecture
CN106202206A (en) A kind of source code searching functions method based on software cluster
Prata et al. Social data analysis of Brazilian's mood from Twitter
CN110110082A (en) Multi-source heterogeneous data fusion optimization method
CN108470026A (en) The sentence trunk method for extracting content and device of headline
CN112749330A (en) Information pushing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190809

RJ01 Rejection of invention patent application after publication