CN110110082A - Multi-source heterogeneous data fusion optimization method - Google Patents
Multi-source heterogeneous data fusion optimization method Download PDFInfo
- Publication number
- CN110110082A CN110110082A CN201910294678.8A CN201910294678A CN110110082A CN 110110082 A CN110110082 A CN 110110082A CN 201910294678 A CN201910294678 A CN 201910294678A CN 110110082 A CN110110082 A CN 110110082A
- Authority
- CN
- China
- Prior art keywords
- short text
- matching
- matched
- source heterogeneous
- library
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of multi-source heterogeneous data fusion optimization methods, include the following steps: A) data instance, classification and attribute are extracted and analyzed, establish dictionary and short text library;B multi-source heterogeneous data) are obtained from internet;C standardization processing) is carried out to multi-source heterogeneous data, generates short text;Short text has multiple words to constitute, and standardization processing includes segmenting and removing stop words;D) using short text as short text to be matched, the short text stored in short text to be matched and short text library is matched, short text matching result is obtained;E) data are merged according to short text matching result, establish big data content model, obtain data fusion result;F) data fusion results are evaluated, obtain evaluation result;Evaluation result include it is excellent, good, neutralize it is poor.The present invention can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.
Description
Technical field
The present invention relates to domain of data fusion, in particular to a kind of multi-source heterogeneous data fusion optimization method.
Background technique
Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means
It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each
Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data
More preferably, richer information.There are some multisource data fusion technologies by carrying out pretreatment and text matches to data at present, it is real
The fusion of multi-source heterogeneous data is showed, but it can't establish integrality, accuracy and the strong knowledge base of consistency.
Summary of the invention
The technical problem to be solved in the present invention is that in view of the above drawbacks of the prior art, providing a kind of can establish completely
The multi-source heterogeneous data fusion optimization method of the big data knowledge base of property, accuracy and the stronger high quality of consistency.
The technical solution adopted by the present invention to solve the technical problems is: constructing a kind of multi-source heterogeneous data fusion optimization side
Method includes the following steps:
A) data instance, classification and attribute are extracted and analyzed, establishes dictionary and short text library;
B multi-source heterogeneous data) are obtained from internet;
C standardization processing) is carried out to the multi-source heterogeneous data, generates short text;The short text is made of multiple words,
The standardization processing includes segmenting and removing stop words;
D short by being stored in the short text to be matched and short text library) using the short text as short text to be matched
Text is matched, and short text matching result is obtained;
E) data are merged according to the short text matching result, establish big data content model, data is obtained and melts
Close result;
F) the data fusion result is evaluated, obtains evaluation result;The evaluation result includes excellent, good, neutralization
Difference.
In multi-source heterogeneous data fusion optimization method of the present invention, the step D) further comprise:
D1 the character match factor between the short text in the short text to be matched and short text library) is calculated;
D2 the word matching attribute between the short text in the short text to be matched and short text library) is calculated;
D3) according to the character match factor and word matching attribute, in the short text to be matched and short text library
Short text is matched, and short text matching attribute is calculated.
In multi-source heterogeneous data fusion optimization method of the present invention, the character match factor uses following formula
It is calculated:
Wherein, F1Indicate the character match factor, c1Indicate the number of characters that the short text to be matched includes, c2It indicates
The number of characters that short text in the short text library includes, p indicate that matched number of characters, h indicate the number of transposition.
In multi-source heterogeneous data fusion optimization method of the present invention, institute's predicate matching attribute using following formula into
Row calculates:
Wherein, F2Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor
Son, σ ∈ [0.9,1.3] increase word bring error, A for correctingiFor i-th of word in the short text to be matched, BiFor
I-th of word in short text in short text library.
In multi-source heterogeneous data fusion optimization method of the present invention, the short text matching attribute is using following public
Formula is calculated:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate the short essay to be matched
This matches with the short text in short text library, if Y < Y0, then illustrate the short essay in the short text to be matched and short text library
This mismatch.
In multi-source heterogeneous data fusion optimization method of the present invention, the step E) specifically: by the quotient of acquisition
The name of an article claims, Property Name and attribute value constitute set L={ l1, l2..., lm, m indicates the number of element in set, calculates liWith
ljBetween short text matching attribute, i, j=1,2 ..., m generate matching degree matrix according to short text matching attribute:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1,
2 ..., m.
In multi-source heterogeneous data fusion optimization method of the present invention, if the value of element is small in the matching degree matrix
In matching threshold, then it is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for being greater than matching threshold
Two elements are exported the biggish element of matching degree as fusion results.
Implement multi-source heterogeneous data fusion optimization method of the invention, have the advantages that due to data instance,
Classification and attribute extract and dictionary and short text library are established in analysis;Multi-source heterogeneous data are obtained from internet;It is different to multi-source
Structure data carry out standardization processing, generate short text;Using short text as short text to be matched, by short text to be matched and short essay
The short text stored in this library is matched, and short text matching result is obtained;Data are melted according to short text matching result
It closes, establishes big data content model, obtain data fusion result;Data fusion results are evaluated, evaluation result is obtained;This
Invention is able to achieve the fusion of multi-source heterogeneous data, can establish the big data of integrality, accuracy and the stronger high quality of consistency
Knowledge base.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the flow chart in the multi-source heterogeneous data fusion optimization method one embodiment of the present invention;
Fig. 2 be using short text as short text to be matched in the embodiment, will be in short text to be matched and short text library
The short text of storage is matched, and the specific flow chart of short text matching result is obtained.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
In the multi-source heterogeneous data fusion optimization method embodiment of the present invention, the multi-source heterogeneous data fusion optimization method
Flow chart is as shown in Figure 1.In Fig. 1, which includes the following steps:
Step S01 extracts data instance, classification and attribute and dictionary and short text library: this step are established in analysis
In, data instance, classification and attribute are extracted and are analyzed, establishes dictionary and short text library in server beyond the clouds, wherein
It is stored with the word of magnanimity in dictionary, the short text of magnanimity is stored in short text library, short text is made of several words.
Step S02 is from the multi-source heterogeneous data in internet: in this step, obtaining multi-source heterogeneous data from internet.
Step S03 carries out standardization processing to multi-source heterogeneous data, generates short text: in this step, to multi-source heterogeneous number
According to standardization processing is carried out, cut down the ambiguity of isomeric data, generate short text, wherein short text is made of multiple words, standardization
Processing includes segmenting and removing stop words.
Step S04 is using short text as short text to be matched, the short essay that will be stored in short text to be matched and short text library
This is matched, and short text matching result is obtained: in this step, using short text as short text is generated, by the short essay to be matched
This is matched with the short text stored in short text library, that is, short by storing in the short text to be matched and short text library
Text compares, and then obtains short text matching result.
Step S05 merges data according to short text matching result, establishes big data content model, obtains data and melts
It closes result: in this step, data being merged according to above-mentioned short text matching result, establishing big data content model (can be with
It is model in the prior art, is also possible to the model of original creation), finally obtain the data fusion result of high quality.
Step S06 evaluates data fusion results, obtains evaluation result: in this step, to data fusion results into
Row evaluation, obtains evaluation result.The evaluation result includes four grades such as excellent, good, neutralization difference.The present invention is able to achieve multi-source heterogeneous
The fusion of data can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.
For the present embodiment, above-mentioned steps S04 can also be refined further, and the flow chart after refinement is as shown in Figure 2.
In Fig. 2, step S04 further comprises following steps:
Step S41 calculates the character match factor between the short text in short text and short text to be matched library: this step
In, calculate the character match factor between the short text in short text and short text to be matched library.Specifically, character match because
Son is calculated using following formula:
Wherein, wherein F1Indicate the character match factor, c1Indicate the number of characters that short text to be matched includes, c2Indicate short essay
The number of characters that short text in this library includes,pIndicate that matched number of characters, h indicate the number of transposition.The number of transposition is equal to not
The half of homosequential matching number of characters;The character match factor is bigger, indicates that the matching degree of text to be matched is higher.This step
Using character as basic unit, by determining matching character and transposition number, the accurate calculating of the character match factor is realized, after being
Continuous short text matching is laid a good foundation.
Step S42 calculates the word matching attribute between the short text in short text and short text to be matched library: in this step,
Calculate the word matching attribute between the short text in short text and short text to be matched library.Specifically, for short essay to be matched
Short text, is seen the vector for composition of writing words, A by the short text B in this A and short text libraryiFor i-th in short text to be matched
Word, BiFor i-th of word in the short text in short text library, if in short text B in short text A to be matched and short text library
The quantity of word is different, then the lower short text of vector dimension increase the operation of word first, and increased word is from presetting
Dictionary in select at random, keep its dimension identical as the higher text dimension of dimension, using following formula calculate word matching attribute:
Wherein, F2Indicate that word matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈
[0.9,1.3] increases word bring error, A for correctingiFor i-th of word in short text to be matched, BiFor in short text library
Short text in i-th of word.Word matching attribute is bigger, indicates that the matching degree of text to be matched is higher.
This step is converted into vector using word as basic unit, by short text, and converts dimension phase for text to be matched
Same vector, realizes the accurate calculating of word matching attribute, lays a good foundation for the matching of subsequent short text.
Step S43 is according to the character match factor and word matching attribute, to the short essay in short text to be matched and short text library
This is matched, and calculates short text matching attribute: in this step, according to the character match factor and word matching attribute, to be matched
Short text is matched with the short text in short text library, calculates short text matching attribute.
Specifically, short text matching attribute is calculated using following formula:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate short text to be matched with
Short text in short text library matches, if Y < Y0, then do not illustrate short text in short text to be matched and short text library not
Match.This step considers the part of speech similitude and Semantic Similarity of short text simultaneously, improves matching accuracy, is follow-up data
Fusion is laid a good foundation.
In the present embodiment, when carrying out data fusion, the product name of acquisition, Property Name and attribute value is constituted and gathered
L={ l1, l2..., lm, m indicates the number of element in set, calculates liAnd ljBetween short text matching attribute, i, j=1,
2 ..., m generate matching degree matrix according to short text matching attribute:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1,
2 ..., m.
In the present embodiment, if in matching degree matrix element value be less than matching threshold, be denoted as 0, by matching degree be greater than
Element with threshold value is merged, and for being greater than two elements of matching threshold, is tied the biggish element of matching degree as fusion
Fruit output.The present embodiment according to product name, Property Name and attribute value generation matching degree matrix, and according to logm according into
Row fusion, has obtained accurate fusion results.
When evaluating data fusion results, specifically: product name, Property Name and attribute value are constituted
Set L={ l1, l2..., lm, the element in set L is merged, and amalgamation result is exported.To data fusion results
Evaluation result is divided into four grades, excellent (90 points or more), good (70 points or more, 90 points or less), in (60 points or more,
70 points or less) and it is poor (60 points or less).
In short, the present invention is able to achieve the fusion of multi-source heterogeneous data, it is stronger that integrality, accuracy and consistency can be established
Knowledge base.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (7)
1. a kind of multi-source heterogeneous data fusion optimization method, which comprises the steps of:
A) data instance, classification and attribute are extracted and analyzed, establishes dictionary and short text library;
B multi-source heterogeneous data) are obtained from internet;
C standardization processing) is carried out to the multi-source heterogeneous data, generates short text;The short text is made of multiple words, described
Standardization processing includes segmenting and removing stop words;
D) using the short text as short text to be matched, the short text that will be stored in the short text to be matched and short text library
It is matched, obtains short text matching result;
E) data are merged according to the short text matching result, establish big data content model, obtain data fusion knot
Fruit;
F) the data fusion result is evaluated, obtains evaluation result;The evaluation result include it is excellent, good, neutralize it is poor.
2. multi-source heterogeneous data fusion optimization method according to claim 1, which is characterized in that the step D) further
Include:
D1 the character match factor between the short text in the short text to be matched and short text library) is calculated;
D2 the word matching attribute between the short text in the short text to be matched and short text library) is calculated;
D3) according to the character match factor and word matching attribute, to the short essay in the short text to be matched and short text library
This is matched, and short text matching attribute is calculated.
3. multi-source heterogeneous data fusion optimization method according to claim 2, which is characterized in that the character match factor
It is calculated using following formula:
Wherein, F1Indicate the character match factor, c1Indicate the number of characters that the short text to be matched includes, c2Described in expression
The number of characters that short text in short text library includes, p indicate that matched number of characters, h indicate the number of transposition.
4. multi-source heterogeneous data fusion optimization method according to claim 3, which is characterized in that institute's predicate matching attribute is adopted
It is calculated with following formula:
Wherein, F2Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈
[0.9,1.3] increases word bring error, A for correctingiFor i-th of word in the short text to be matched, BiFor short text
I-th of word in short text in library.
5. multi-source heterogeneous data fusion optimization method according to claim 4, which is characterized in that short text matching because
Son is calculated using following formula:
Wherein, Y indicates the matching attribute of short text;Set matching threshold Y0If Y >=Y0, then illustrate the short text to be matched with
Short text in short text library matches, if Y < Y0, then illustrate the short text to be matched with the short text in short text library not
Matching.
6. multi-source heterogeneous data fusion optimization method according to claim 5, which is characterized in that the step E) it is specific
Are as follows: the product name of acquisition, Property Name and attribute value are constituted into set L={ l1, l2..., lm, m indicates element in set
Number calculates liAnd ljBetween short text matching attribute, i, j=1,2 ..., m generate matching according to short text matching attribute
Spend matrix:
Wherein, Z indicates matching degree matrix, Y (li, lj) indicate liAnd ljBetween short text matching attribute, i, j=1,2 ..., m.
7. multi-source heterogeneous data fusion optimization method according to claim 6, which is characterized in that if the matching degree matrix
The value of middle element is less than matching threshold, then is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for big
In two elements of matching threshold, exported the biggish element of matching degree as fusion results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910294678.8A CN110110082A (en) | 2019-04-12 | 2019-04-12 | Multi-source heterogeneous data fusion optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910294678.8A CN110110082A (en) | 2019-04-12 | 2019-04-12 | Multi-source heterogeneous data fusion optimization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110110082A true CN110110082A (en) | 2019-08-09 |
Family
ID=67483866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910294678.8A Pending CN110110082A (en) | 2019-04-12 | 2019-04-12 | Multi-source heterogeneous data fusion optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110110082A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767325A (en) * | 2020-09-03 | 2020-10-13 | 国网浙江省电力有限公司营销服务中心 | Multi-source data deep fusion method based on deep learning |
CN113065000A (en) * | 2021-03-29 | 2021-07-02 | 泰瑞数创科技(北京)有限公司 | Multisource heterogeneous data fusion method based on geographic entity |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109146644A (en) * | 2018-09-05 | 2019-01-04 | 广州小楠科技有限公司 | A kind of e-commerce system |
CN109189886A (en) * | 2018-09-05 | 2019-01-11 | 广州小楠科技有限公司 | A kind of intelligent video recommender system |
CN109255049A (en) * | 2018-09-05 | 2019-01-22 | 广州小楠科技有限公司 | A kind of wisdom music recommender system |
CN109308311A (en) * | 2018-09-05 | 2019-02-05 | 广州小楠科技有限公司 | A kind of multi-source heterogeneous data fusion system |
-
2019
- 2019-04-12 CN CN201910294678.8A patent/CN110110082A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109146644A (en) * | 2018-09-05 | 2019-01-04 | 广州小楠科技有限公司 | A kind of e-commerce system |
CN109189886A (en) * | 2018-09-05 | 2019-01-11 | 广州小楠科技有限公司 | A kind of intelligent video recommender system |
CN109255049A (en) * | 2018-09-05 | 2019-01-22 | 广州小楠科技有限公司 | A kind of wisdom music recommender system |
CN109308311A (en) * | 2018-09-05 | 2019-02-05 | 广州小楠科技有限公司 | A kind of multi-source heterogeneous data fusion system |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111767325A (en) * | 2020-09-03 | 2020-10-13 | 国网浙江省电力有限公司营销服务中心 | Multi-source data deep fusion method based on deep learning |
CN113065000A (en) * | 2021-03-29 | 2021-07-02 | 泰瑞数创科技(北京)有限公司 | Multisource heterogeneous data fusion method based on geographic entity |
CN113065000B (en) * | 2021-03-29 | 2021-10-22 | 泰瑞数创科技(北京)有限公司 | Multisource heterogeneous data fusion method based on geographic entity |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156365B (en) | A kind of generation method and device of knowledge mapping | |
CN106649487B (en) | Image retrieval method based on interest target | |
CN111026842A (en) | Natural language processing method, natural language processing device and intelligent question-answering system | |
CN109783657A (en) | Multistep based on limited text space is from attention cross-media retrieval method and system | |
CN105631468B (en) | A kind of picture based on RNN describes automatic generation method | |
CN110008335A (en) | The method and device of natural language processing | |
GB2574087A (en) | Compositing aware digital image search | |
CN111626362B (en) | Image processing method, device, computer equipment and storage medium | |
CN109145152A (en) | A kind of self-adapting intelligent generation image-text video breviary drawing method based on query word | |
CN107967258B (en) | Method and system for emotion analysis of text information | |
WO2015021937A1 (en) | Method and device for user recommendation | |
CN108959531A (en) | Information search method, device, equipment and storage medium | |
CN109711465A (en) | Image method for generating captions based on MLL and ASCA-FR | |
CN111143617A (en) | Automatic generation method and system for picture or video text description | |
CN110287314B (en) | Long text reliability assessment method and system based on unsupervised clustering | |
CN110245228A (en) | The method and apparatus for determining text categories | |
CN103425686B (en) | A kind of information issuing method and device | |
CN107609055B (en) | Text image multi-modal retrieval method based on deep layer topic model | |
CN109992788A (en) | Depth text matching technique and device based on unregistered word processing | |
Hani et al. | Image caption generation using a deep architecture | |
CN106202206A (en) | A kind of source code searching functions method based on software cluster | |
Prata et al. | Social data analysis of Brazilian's mood from Twitter | |
CN110110082A (en) | Multi-source heterogeneous data fusion optimization method | |
CN108470026A (en) | The sentence trunk method for extracting content and device of headline | |
CN112749330A (en) | Information pushing method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190809 |
|
RJ01 | Rejection of invention patent application after publication |