CN110110082A

CN110110082A - Multi-source heterogeneous data fusion optimization method

Info

Publication number: CN110110082A
Application number: CN201910294678.8A
Authority: CN
Inventors: 黄红梅; 何卓华; 谢新屋
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-08-09

Abstract

The invention discloses a kind of multi-source heterogeneous data fusion optimization methods, include the following steps: A) data instance, classification and attribute are extracted and analyzed, establish dictionary and short text library；B multi-source heterogeneous data) are obtained from internet；C standardization processing) is carried out to multi-source heterogeneous data, generates short text；Short text has multiple words to constitute, and standardization processing includes segmenting and removing stop words；D) using short text as short text to be matched, the short text stored in short text to be matched and short text library is matched, short text matching result is obtained；E) data are merged according to short text matching result, establish big data content model, obtain data fusion result；F) data fusion results are evaluated, obtain evaluation result；Evaluation result include it is excellent, good, neutralize it is poor.The present invention can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.

Description

Multi-source heterogeneous data fusion optimization method

Technical field

The present invention relates to domain of data fusion, in particular to a kind of multi-source heterogeneous data fusion optimization method.

Background technique

Multisource data fusion technology, which refers to, is all integrated into one for all information investigated, analysis is got using correlation means It rises, and carries out unified evaluation to information, finally obtain the technology of unified information.The purpose that the technical research comes out is will be each Then the characteristics of different data information of kind is integrated, draws different data sources therefrom extracts unification, than single data More preferably, richer information.There are some multisource data fusion technologies by carrying out pretreatment and text matches to data at present, it is real The fusion of multi-source heterogeneous data is showed, but it can't establish integrality, accuracy and the strong knowledge base of consistency.

Summary of the invention

The technical problem to be solved in the present invention is that in view of the above drawbacks of the prior art, providing a kind of can establish completely The multi-source heterogeneous data fusion optimization method of the big data knowledge base of property, accuracy and the stronger high quality of consistency.

The technical solution adopted by the present invention to solve the technical problems is: constructing a kind of multi-source heterogeneous data fusion optimization side Method includes the following steps:

A) data instance, classification and attribute are extracted and analyzed, establishes dictionary and short text library；

B multi-source heterogeneous data) are obtained from internet；

C standardization processing) is carried out to the multi-source heterogeneous data, generates short text；The short text is made of multiple words, The standardization processing includes segmenting and removing stop words；

D short by being stored in the short text to be matched and short text library) using the short text as short text to be matched Text is matched, and short text matching result is obtained；

E) data are merged according to the short text matching result, establish big data content model, data is obtained and melts Close result；

F) the data fusion result is evaluated, obtains evaluation result；The evaluation result includes excellent, good, neutralization Difference.

In multi-source heterogeneous data fusion optimization method of the present invention, the step D) further comprise:

D1 the character match factor between the short text in the short text to be matched and short text library) is calculated；

D2 the word matching attribute between the short text in the short text to be matched and short text library) is calculated；

D3) according to the character match factor and word matching attribute, in the short text to be matched and short text library Short text is matched, and short text matching attribute is calculated.

In multi-source heterogeneous data fusion optimization method of the present invention, the character match factor uses following formula It is calculated:

Wherein, F₁Indicate the character match factor, c₁Indicate the number of characters that the short text to be matched includes, c₂It indicates The number of characters that short text in the short text library includes, p indicate that matched number of characters, h indicate the number of transposition.

In multi-source heterogeneous data fusion optimization method of the present invention, institute's predicate matching attribute using following formula into Row calculates:

Wherein, F₂Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor Son, σ ∈ [0.9,1.3] increase word bring error, A for correcting_iFor i-th of word in the short text to be matched, B_iFor I-th of word in short text in short text library.

In multi-source heterogeneous data fusion optimization method of the present invention, the short text matching attribute is using following public Formula is calculated:

Wherein, Y indicates the matching attribute of short text；Set matching threshold Y₀If Y >=Y₀, then illustrate the short essay to be matched This matches with the short text in short text library, if Y < Y₀, then illustrate the short essay in the short text to be matched and short text library This mismatch.

In multi-source heterogeneous data fusion optimization method of the present invention, the step E) specifically: by the quotient of acquisition The name of an article claims, Property Name and attribute value constitute set L={ l₁, l₂..., l_m, m indicates the number of element in set, calculates l_iWith l_jBetween short text matching attribute, i, j=1,2 ..., m generate matching degree matrix according to short text matching attribute:

Wherein, Z indicates matching degree matrix, Y (l_i, l_j) indicate l_iAnd l_jBetween short text matching attribute, i, j=1, 2 ..., m.

In multi-source heterogeneous data fusion optimization method of the present invention, if the value of element is small in the matching degree matrix In matching threshold, then it is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for being greater than matching threshold Two elements are exported the biggish element of matching degree as fusion results.

Implement multi-source heterogeneous data fusion optimization method of the invention, have the advantages that due to data instance, Classification and attribute extract and dictionary and short text library are established in analysis；Multi-source heterogeneous data are obtained from internet；It is different to multi-source Structure data carry out standardization processing, generate short text；Using short text as short text to be matched, by short text to be matched and short essay The short text stored in this library is matched, and short text matching result is obtained；Data are melted according to short text matching result It closes, establishes big data content model, obtain data fusion result；Data fusion results are evaluated, evaluation result is obtained；This Invention is able to achieve the fusion of multi-source heterogeneous data, can establish the big data of integrality, accuracy and the stronger high quality of consistency Knowledge base.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the flow chart in the multi-source heterogeneous data fusion optimization method one embodiment of the present invention；

Fig. 2 be using short text as short text to be matched in the embodiment, will be in short text to be matched and short text library The short text of storage is matched, and the specific flow chart of short text matching result is obtained.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In the multi-source heterogeneous data fusion optimization method embodiment of the present invention, the multi-source heterogeneous data fusion optimization method Flow chart is as shown in Figure 1.In Fig. 1, which includes the following steps:

Step S01 extracts data instance, classification and attribute and dictionary and short text library: this step are established in analysis In, data instance, classification and attribute are extracted and are analyzed, establishes dictionary and short text library in server beyond the clouds, wherein It is stored with the word of magnanimity in dictionary, the short text of magnanimity is stored in short text library, short text is made of several words.

Step S02 is from the multi-source heterogeneous data in internet: in this step, obtaining multi-source heterogeneous data from internet.

Step S03 carries out standardization processing to multi-source heterogeneous data, generates short text: in this step, to multi-source heterogeneous number According to standardization processing is carried out, cut down the ambiguity of isomeric data, generate short text, wherein short text is made of multiple words, standardization Processing includes segmenting and removing stop words.

Step S04 is using short text as short text to be matched, the short essay that will be stored in short text to be matched and short text library This is matched, and short text matching result is obtained: in this step, using short text as short text is generated, by the short essay to be matched This is matched with the short text stored in short text library, that is, short by storing in the short text to be matched and short text library Text compares, and then obtains short text matching result.

Step S05 merges data according to short text matching result, establishes big data content model, obtains data and melts It closes result: in this step, data being merged according to above-mentioned short text matching result, establishing big data content model (can be with It is model in the prior art, is also possible to the model of original creation), finally obtain the data fusion result of high quality.

Step S06 evaluates data fusion results, obtains evaluation result: in this step, to data fusion results into Row evaluation, obtains evaluation result.The evaluation result includes four grades such as excellent, good, neutralization difference.The present invention is able to achieve multi-source heterogeneous The fusion of data can establish the big data knowledge base of integrality, accuracy and the stronger high quality of consistency.

For the present embodiment, above-mentioned steps S04 can also be refined further, and the flow chart after refinement is as shown in Figure 2. In Fig. 2, step S04 further comprises following steps:

Step S41 calculates the character match factor between the short text in short text and short text to be matched library: this step In, calculate the character match factor between the short text in short text and short text to be matched library.Specifically, character match because Son is calculated using following formula:

Wherein, wherein F₁Indicate the character match factor, c₁Indicate the number of characters that short text to be matched includes, c₂Indicate short essay The number of characters that short text in this library includes,_pIndicate that matched number of characters, h indicate the number of transposition.The number of transposition is equal to not The half of homosequential matching number of characters；The character match factor is bigger, indicates that the matching degree of text to be matched is higher.This step Using character as basic unit, by determining matching character and transposition number, the accurate calculating of the character match factor is realized, after being Continuous short text matching is laid a good foundation.

Step S42 calculates the word matching attribute between the short text in short text and short text to be matched library: in this step, Calculate the word matching attribute between the short text in short text and short text to be matched library.Specifically, for short essay to be matched Short text, is seen the vector for composition of writing words, A by the short text B in this A and short text library_iFor i-th in short text to be matched Word, B_iFor i-th of word in the short text in short text library, if in short text B in short text A to be matched and short text library The quantity of word is different, then the lower short text of vector dimension increase the operation of word first, and increased word is from presetting Dictionary in select at random, keep its dimension identical as the higher text dimension of dimension, using following formula calculate word matching attribute:

Wherein, F₂Indicate that word matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈ [0.9,1.3] increases word bring error, A for correcting_iFor i-th of word in short text to be matched, B_iFor in short text library Short text in i-th of word.Word matching attribute is bigger, indicates that the matching degree of text to be matched is higher.

This step is converted into vector using word as basic unit, by short text, and converts dimension phase for text to be matched Same vector, realizes the accurate calculating of word matching attribute, lays a good foundation for the matching of subsequent short text.

Step S43 is according to the character match factor and word matching attribute, to the short essay in short text to be matched and short text library This is matched, and calculates short text matching attribute: in this step, according to the character match factor and word matching attribute, to be matched Short text is matched with the short text in short text library, calculates short text matching attribute.

Specifically, short text matching attribute is calculated using following formula:

Wherein, Y indicates the matching attribute of short text；Set matching threshold Y₀If Y >=Y₀, then illustrate short text to be matched with Short text in short text library matches, if Y < Y₀, then do not illustrate short text in short text to be matched and short text library not Match.This step considers the part of speech similitude and Semantic Similarity of short text simultaneously, improves matching accuracy, is follow-up data Fusion is laid a good foundation.

In the present embodiment, when carrying out data fusion, the product name of acquisition, Property Name and attribute value is constituted and gathered L={ l₁, l₂..., l_m, m indicates the number of element in set, calculates l_iAnd l_jBetween short text matching attribute, i, j=1, 2 ..., m generate matching degree matrix according to short text matching attribute:

In the present embodiment, if in matching degree matrix element value be less than matching threshold, be denoted as 0, by matching degree be greater than Element with threshold value is merged, and for being greater than two elements of matching threshold, is tied the biggish element of matching degree as fusion Fruit output.The present embodiment according to product name, Property Name and attribute value generation matching degree matrix, and according to logm according into Row fusion, has obtained accurate fusion results.

When evaluating data fusion results, specifically: product name, Property Name and attribute value are constituted Set L={ l₁, l₂..., l_m, the element in set L is merged, and amalgamation result is exported.To data fusion results Evaluation result is divided into four grades, excellent (90 points or more), good (70 points or more, 90 points or less), in (60 points or more, 70 points or less) and it is poor (60 points or less).

In short, the present invention is able to achieve the fusion of multi-source heterogeneous data, it is stronger that integrality, accuracy and consistency can be established Knowledge base.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of multi-source heterogeneous data fusion optimization method, which comprises the steps of:

B multi-source heterogeneous data) are obtained from internet；

C standardization processing) is carried out to the multi-source heterogeneous data, generates short text；The short text is made of multiple words, described Standardization processing includes segmenting and removing stop words；

D) using the short text as short text to be matched, the short text that will be stored in the short text to be matched and short text library It is matched, obtains short text matching result；

E) data are merged according to the short text matching result, establish big data content model, obtain data fusion knot Fruit；

F) the data fusion result is evaluated, obtains evaluation result；The evaluation result include it is excellent, good, neutralize it is poor.

2. multi-source heterogeneous data fusion optimization method according to claim 1, which is characterized in that the step D) further Include:

D3) according to the character match factor and word matching attribute, to the short essay in the short text to be matched and short text library This is matched, and short text matching attribute is calculated.

3. multi-source heterogeneous data fusion optimization method according to claim 2, which is characterized in that the character match factor It is calculated using following formula:

Wherein, F₁Indicate the character match factor, c₁Indicate the number of characters that the short text to be matched includes, c₂Described in expression The number of characters that short text in short text library includes, p indicate that matched number of characters, h indicate the number of transposition.

4. multi-source heterogeneous data fusion optimization method according to claim 3, which is characterized in that institute's predicate matching attribute is adopted It is calculated with following formula:

Wherein, F₂Indicate that institute's predicate matching attribute, the dimension of the higher short text vector of n representation dimension, σ indicate modifying factor, σ ∈ [0.9,1.3] increases word bring error, A for correcting_iFor i-th of word in the short text to be matched, B_iFor short text I-th of word in short text in library.

5. multi-source heterogeneous data fusion optimization method according to claim 4, which is characterized in that short text matching because Son is calculated using following formula:

Wherein, Y indicates the matching attribute of short text；Set matching threshold Y₀If Y >=Y₀, then illustrate the short text to be matched with Short text in short text library matches, if Y < Y₀, then illustrate the short text to be matched with the short text in short text library not Matching.

6. multi-source heterogeneous data fusion optimization method according to claim 5, which is characterized in that the step E) it is specific Are as follows: the product name of acquisition, Property Name and attribute value are constituted into set L={ l₁, l₂..., l_m, m indicates element in set Number calculates l_iAnd l_jBetween short text matching attribute, i, j=1,2 ..., m generate matching according to short text matching attribute Spend matrix:

Wherein, Z indicates matching degree matrix, Y (l_i, l_j) indicate l_iAnd l_jBetween short text matching attribute, i, j=1,2 ..., m.

7. multi-source heterogeneous data fusion optimization method according to claim 6, which is characterized in that if the matching degree matrix The value of middle element is less than matching threshold, then is denoted as 0, the element that matching degree is greater than the matching threshold is merged, for big In two elements of matching threshold, exported the biggish element of matching degree as fusion results.