CN107918657A - The matching process and device of a kind of data source - Google Patents

The matching process and device of a kind of data source Download PDF

Info

Publication number
CN107918657A
CN107918657A CN201711159895.3A CN201711159895A CN107918657A CN 107918657 A CN107918657 A CN 107918657A CN 201711159895 A CN201711159895 A CN 201711159895A CN 107918657 A CN107918657 A CN 107918657A
Authority
CN
China
Prior art keywords
entry
data
data source
basic
metamessage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711159895.3A
Other languages
Chinese (zh)
Other versions
CN107918657B (en
Inventor
王聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201711159895.3A priority Critical patent/CN107918657B/en
Publication of CN107918657A publication Critical patent/CN107918657A/en
Application granted granted Critical
Publication of CN107918657B publication Critical patent/CN107918657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses the matching process and device of a kind of data source, for improving the efficiency and accuracy of video data source fusion, and need not manually participate in being automatically performed.In the method, the metamessage of first entry is got from the first data source, and the metamessage of second entry is got from the second data source, metamessage includes:Multiple data attributes of respective entries;By entries match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of first entry, searched by entries match decision model in basic entry thesaurus and decision-tree model is trained by using historical data source sample data and is obtained with test with the matched basic entry of the metamessage of second entry, entries match decision model;When first entry and second entry match identical basic entry from basic entry thesaurus, the first data source and the second data source are associated with the same basic entry in basic entry thesaurus.

Description

The matching process and device of a kind of data source
Technical field
The present invention relates to the matching process and device of field of computer technology, more particularly to a kind of data source.
Background technology
Selection can meet the video data source of user's needs for the convenience of the user, it is desirable to be able to which polymerization comes from multiple video counts According to the video link in source, the fusion method of video data source is just generated at this time.
In the prior art, currently used video data source fusion method mainly includes the following two kinds:1), manually addition Mode, i.e., by a large amount of editorial staffs go it is artificial judge, further according to artificial judging result by multiple video links Condense together;2) by way of keyword match, such as can there will be same title by the matching way of title content Multiple video links condense together.
For the video data source fusion method of the above-mentioned prior art, both modes have the defects of respective:
1) for the mode manually added, this needs the artificial input of substantial amounts of editorial staff, is persistently runed, and mesh Preceding major video website has the Editing Team of nearly 100 people to carry out video data source polymerization, and this method not only takes time and effort, And each understanding of the editorial staff to video content is inconsistent, so the efficiency and accuracy of video data source fusion all can not It is guaranteed;
2) it is directed to the mode of keyword match, this method can only solve the naming method of title content all very feelings of specification Condition, therefore successful match rate is relatively low.
The content of the invention
An embodiment of the present invention provides the matching process and device of a kind of data source, for improving video data source fusion Efficiency and accuracy, and need not manually participate in being automatically performed.
In order to solve the above technical problems, the embodiment of the present invention provides following technical scheme:
In a first aspect, the embodiment of the present invention provides a kind of matching process of data source, including:
The metamessage of first entry is got from the first data source, and second entry is got from the second data source Metamessage, the metamessage includes:Multiple data attributes of respective entries;
Searched by entries match decision model matched with the metamessage of the first entry in basic entry thesaurus Basic entry, and by the entries match decision model search in the basic entry thesaurus with the second entry The matched basic entry of metamessage, the entries match decision model is by using historical data source sample data to decision tree mould Type is trained to be obtained with test, and historical data source sample data includes:Each entry sample in multiple data sources Historical data attribute;
When the first entry and the second entry match identical base strip from the basic entry thesaurus During mesh, first data source and second data source are associated with the same base strip in the basic entry thesaurus On mesh.
Second aspect, the embodiment of the present invention also provide a kind of coalignment of data source, including:
Metamessage acquisition module, counts for getting the metamessage of first entry from the first data source, and from second According to the metamessage that second entry is got in source, the metamessage includes;Multiple data attribute informations of respective entries;
Model fitting module, for by entries match decision model search in basic entry thesaurus with described first The matched basic entry of purpose metamessage, and searched by the entries match decision model in the basic entry thesaurus With the matched basic entry of the metamessage of the second entry, the entries match decision model passes through historical data source sample number Obtained according to being trained to decision-tree model with test, historical data source sample data includes:It is every in multiple data sources The historical data attribute information of a entry sample;
Data source aggregation module, for when the first entry and the second entry are from the basic entry thesaurus When matching identical basic entry, first data source and second data source are associated with the basic entry storage In same basic entry in storehouse.
The third aspect of the application, there is provided a kind of computer-readable recording medium, the computer-readable recording medium In be stored with instruction, when run on a computer so that computer performs the method described in above-mentioned each side.
As can be seen from the above technical solutions, the embodiment of the present invention has the following advantages:
In embodiments of the present invention, the metamessage of first entry is got from the first data source first, and from second The metamessage of second entry is got in data source, then by entries match decision model search in basic entry thesaurus with The matched basic entry of the metamessage of first entry, and by entries match decision model search in basic entry thesaurus with The matched basic entry of metamessage of second entry, entries match decision model are fought to the finish by using historical data source sample data Plan tree-model is trained to be obtained with test, and historical data source sample data includes:Each entry sample in multiple data sources Historical data attribute, when first entry and second entry match identical basic entry from basic entry thesaurus, First data source and the second data source are associated with the same basic entry in basic entry thesaurus.Due to of the invention real Apply and use entries match decision model to match basic entry thesaurus respectively for first entry and second entry in example, therefore Whole process can be automatically performed by way of machine learning, and first entry and second entry respectively can be real with basic entry Now precisely matching, can be with when first entry and second entry match identical basic entry from basic entry thesaurus First data source and the second data source are associated with the same basic entry in basic entry thesaurus, so as to realize Base strip purpose auto-polymerization is based between data source.
Brief description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those skilled in the art, other attached drawings can also be obtained according to these attached drawings.
Fig. 1 is a kind of process blocks schematic diagram of the matching process of data source provided in an embodiment of the present invention;
Fig. 2 is a kind of application scenarios schematic diagram of the matching process of data source provided in an embodiment of the present invention;
Fig. 3 is the generating process schematic diagram of decision-tree model provided in an embodiment of the present invention;
Fig. 4 is a kind of application scenarios schematic diagram of the polymerization methods of data source provided in an embodiment of the present invention;
Fig. 5 is a kind of display mode signal of data source polymerization result provided in an embodiment of the present invention on mobile terminals Figure;
Fig. 6-a are a kind of composition structure diagram of the coalignment of data source provided in an embodiment of the present invention;
Fig. 6-b are the composition structure diagram of the coalignment of another data source provided in an embodiment of the present invention;
Fig. 6-c are the composition structure diagram of the coalignment of another data source provided in an embodiment of the present invention;
Fig. 6-d are a kind of composition structure diagram of model fitting module provided in an embodiment of the present invention;
Fig. 6-e are the composition structure diagram of another model fitting module provided in an embodiment of the present invention;
Fig. 6-f are the composition structure diagram of the coalignment of another data source provided in an embodiment of the present invention;
Fig. 7 is that the matching process of data source provided in an embodiment of the present invention is applied to the composition structure diagram of server.
Embodiment
An embodiment of the present invention provides the matching process and device of a kind of data source, for improving video data source fusion Efficiency and accuracy, and need not manually participate in being automatically performed.
Goal of the invention, feature, advantage to enable the present invention is more obvious and understandable, below in conjunction with the present invention Attached drawing in embodiment, is clearly and completely described the technical solution in the embodiment of the present invention, it is clear that disclosed below Embodiment be only part of the embodiment of the present invention, and not all embodiments.Based on the embodiments of the present invention, this area Technical staff's all other embodiments obtained, belong to the scope of protection of the invention.
Term " comprising " and " having " in description and claims of this specification and above-mentioned attached drawing and they Any deformation, it is intended that cover it is non-exclusive include, so as to a series of process comprising units, method, system, product or set It is standby to be not necessarily limited to those units, but may include not list clearly or consolidate for these processes, method, product or equipment The other units having.
It is described in detail individually below.
The matching process of data source provided in an embodiment of the present invention, suitable for the scene merged to multiple data sources, Especially these data sources provide same item when polymerization.Wherein data source can be video data source, lteral data source and Image data source, for example, lteral data source can be novel data source, image data source can be animation data source.Please As shown in fig.1, the matching process of data source provided by one embodiment of the present invention, may include steps of:
101st, the metamessage of first entry is got from the first data source, and second is got from the second data source The metamessage of entry, metamessage include:Multiple data attributes of respective entries.
In embodiments of the present invention, the coalignment of data source can get multiple data sources, in each data source Include electronic data, such as electronic data can be video, picture and word, the different electronic data in data source pass through bar Mesh identifies, such as entry can refer exclusively to the e-sourcings such as a film, TV play, variety show.Data source Coalignment multiple data sources can be got by crawler technology, then for each data source carry out item analysis, example The data attribute of multiple entries can be such as extracted from each data source, these data attribute constituting bar purposes got member Information, wherein, the data attribute of entry refers to that the data content attribute related with entry, such as the data attribute of entry can wrap Include:The information such as the title of electronic data, content type, data requirement, data field value.Using data source be video data source as , the data attribute of video entry can include in video data source:Title, performers and clerks, time, type, broadcasting link etc. are believed Breath.By taking data source is lteral data source as an example, the data attribute of textual entry can include in lteral data source:Title, novel The information such as author, time, type, novel storage address, high priest.
In embodiments of the present invention, for ease of description, next with the coalignment of data source to the first data source and the Illustrate, do not limit exemplified by the matching process of two data sources, multiple data sources that the coalignment of data source is got In can also include other data sources, such as the 3rd data source and the 4th data source etc. can also be included in multiple data sources. By taking the processing of the first data source and the second data source as an example, for the first data source, the coalignment of data source is first from first The metamessage of first entry is got in data source, for the second data source, the coalignment of data source is from the second data source The metamessage of second entry is got, multiple data attributes of respective entries are included in the metamessage of each entry, it is right In multiple data attributes of entry, can be determined with the implementation of the data source under concrete scene.
102nd, searched by entries match decision model matched with the metamessage of first entry in basic entry thesaurus Basic entry, and by matched with the metamessage of second entry in the basic entry thesaurus of entries match decision model lookup Basic entry, entries match decision model are trained and tested to decision-tree model by using historical data source sample data Obtain, historical data source sample data includes:The historical data attribute of each entry sample in multiple data sources.
In embodiments of the present invention, the metamessage and second entry of first entry can be got by abovementioned steps 101 Metamessage, includes multiple data attributes of first entry, in the metamessage of second entry in the metamessage of first entry Include multiple data attributes of second entry, therefore the data attribute of the multiple data attributes and second entry of first entry is The foundation of entries match is carried out with basic entry thesaurus.To solve the problems, such as that the keyword of entry in data source is nonstandard, this The matching of metamessage in data source and basic entry thesaurus is realized in inventive embodiments using entries match decision model.Its In, entries match decision model is trained decision-tree model by using historical data source sample data and is obtained with test, The entries match decision model realizes that decision Tree algorithms are a kind of methods for approaching discrete function value, are one using decision tree The typical sorting technique of kind, is first handled data, is generated readable rule and decision tree using inductive algorithm, is then made New data is analyzed with decision-making, substantially decision tree is the process classified by series of rules to data.In detail , decision Tree algorithms can be used to be trained training set in the embodiment of the present invention to generate decision tree, and use decision tree Algorithm tests test set and is updated decision tree.Specifically, in decision-tree model training, can be by training set Data are divided into two parts, a part of data are used for do training generation decision tree, a part of data are used for testing, and pass through test Collection completes whole test process.Above-mentioned decision Tree algorithms include but not limited to:ID3, C4.5 and C5.0 spanning tree algorithm or with Machine forest algorithm.And when being trained to training set and input and output, current Open-Source Tools can be used, such as wake works Tool.
In embodiments of the present invention, entries match decision model can be used respectively for first entry and second entry, Matched, can be found respectively from basic entry thesaurus with basic entry thesaurus by entries match decision model Basic entry matched with the metamessage of first entry, the matched basic entry of the metamessage of second entry.Wherein, basic entry Substantial amounts of basic entry is stored with thesaurus, each basis entry can include:The multiple data attributes of base strip purpose, base Plinth entry thesaurus is to be used to carry out matched database, by taking data source is video data source as an example, base with entry in data source Plinth entry thesaurus can be that matchmaker provides storehouse, and the metamessage of multiple video entries is stored with the matchmaker provides storehouse, and matchmaker provides what storehouse preserved The metamessage of video entry includes:Title, performers and clerks, time, type, broadcasting link etc..
In some embodiments of the invention, step 102 is searched basic entry by entries match decision model and is deposited respectively In bank with before the matched basic entry of the metamessage of first entry, the matching process of data source provided in an embodiment of the present invention Further include following steps:
A1, according to each data attribute respectively divide historical data source sample data, obtains each data attribute Corresponding data division result;
A2, the information gain for calculating each data division result respectively, and select the data of information gain maximum to divide knot The corresponding data attribute of fruit is as cut-point;
A3, according to cut-point be divided into two sample data subsets by historical data source sample data, for each sample Data subset recalculates the information gain of each data attribute, and continues to divide sample number according to the principle of information gain maximum According to subset, until the sample data in a sample data subset belongs to identical class, the decision tree after the completion of output division Model.
Wherein, the process of establishing of decision-tree model is described in detail in step A1 to step A3, gets go through first History data source sample data, is illustrated, the process of the training of decision tree is such as exemplified by carrying out decision tree training using ID3 algorithms Under:All historical data sources sample data is traveled through first, is regarded each data attribute as a kind of dividing mode, is then calculated every kind of The comentropy of dividing mode, selects the classification of Global Information gain maximum, as cut-point, such as historical data source sample data Split using the data attribute of a certain information gain maximum, two sample datas subset N1 and N2 are obtained into, to N1 and N2 The segmentation of aforementioned manner is continued to execute, until the entry under each sample data subset has identical attribute.
Next the achievement process to decision tree is illustrated, to historical data by way of information gain is maximum Source sample data carries out recurrence division, and specific partiting step is as follows:Historical data source sample data is according to different data attributes Divided, obtain a kind of data division result after division every time, calculate the information gain of each data division result respectively, and The corresponding data attribute of data division result of information gain maximum is selected as cut-point.Point of the cut-point as decision tree The value of attribute is split, obtains decision tree branches, data set will be divided into multiple subsets, be recalculated for each subtree each The information gain value of attribute, and so on, until the sample data in a sample data subset belongs to identical class, stop Contribute, the decision-tree model after the completion of output division.
Further, in some embodiments of the invention, the decision-tree model after the completion of abovementioned steps A3 outputs division Afterwards, the matching process of data source provided in an embodiment of the present invention further includes:
B1, using priori data source sample data to decision-tree model carry out accuracy verification.
Wherein, priori data source sample data may be used as the verification of decision tree, which can be with By being collected to obtain to the historical sample data of multiple data sources, can be constructed by priori data source sample data The decision-tree model gone out has the matching accuracy of higher so that the entry in data source can more accurately match base strip Mesh.
In some embodiments of the invention, step 102 searches basic entry thesaurus by entries match decision model In with the matched basic entry of the metamessage of first entry, including:
C1, calculated by entries match decision model the basic entry of each in basic entry thesaurus respectively with first The matching score value of purpose metamessage;
C2, select the basic entry of maximum score value as matched with the metamessage of first entry from multiple matching score values Basic entry.
Wherein, entries match decision model can be used for entry in data source matched with base strip purpose score value prediction, example Such as, when including N number of basic entry in basic entry thesaurus, N number of base strip is predicted by entries match decision model respectively The matching score value of the metamessage of mesh and first entry, then can predict N number of matching score value, can be with by this N number of matching score value Therefrom select the basic entry of score value maximum as with the matched basic entry of the metamessage of first entry.
Further, in some embodiments of the invention, it have selected in abovementioned steps C2 and selected in multiple matching score values The basic entry of maximum score value is suitable for what is calculated as with the matched basic entry of the metamessage of first entry, such case Disposition when matching score value is different from, if maximum score value is multiple in multiple matching score values, step 102 passes through entry Match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of first entry, except including step C1 and C2, can also include:
C3, when maximum score value basic entry at least two basic entries when, obtain the corresponding power of each data attribute Weight;
C4, according to the respective total weight fraction of the basic entry of the corresponding weight calculation of each data attribute at least two, always Summation of the weight fraction for each data attribute in a basic entry with the weight product of the corresponding data attribute;
C5, select the basic entry of largest score to believe as the member with first entry from least two total weight fraction Cease matched basic entry.
Wherein, can not be again by being selected in step C2 when the basic entry of maximum score value is at least two basic entry The mode of maximum determines and the matched basic entry of first entry, it is therefore desirable to performs step C3 to the processing side of step C5 Formula.If maximum score value is all identical, the weight of each data attribute can be also introduced, takes the highest entry of weighted score, illustrated As follows, the metamessage of entry can be respectively with eight data attributes of entry:Data attribute 1, data attribute 2, data attribute 3, Data attribute 4, data attribute 5, data attribute 6, data attribute 7 and data attribute 8, power can be distributed for eight data attributes Weight, can determine for the weight size of each data attribute according to application scenarios.For example, weight order is as follows:Data attribute 1 >Data attribute 2>Data attribute 3>Data attribute 4>Data attribute 5>Data attribute 6>Data attribute 7>Data attribute 8, to each The weight fraction that data attribute is set, total weight fraction=data attribute 1*10000000+ data attribute 2*1000000+ data Attribute 3*10000+ data attribute 4*10000+ data attribute 5*1000+ data attribute 6*100+ data attribute 7*10+ data categories Property 8*1.Decision tree marking result and weight fraction are finally combined, the base strip of largest score is selected from multiple total weight fractions Mesh is as the basic entry with first entry best match.It should be noted that the number and weight of above-mentioned weight are corresponding Fraction is schematically illustrate, can as needed reset, only make herein schematically illustrate in practical applications.
In some embodiments of the invention, step 102 searches basic entry thesaurus by entries match decision model In with the matched basic entry of the metamessage of second entry, including:
D1, calculated by entries match decision model the basic entry of each in basic entry thesaurus respectively with Article 2 The matching score value of purpose metamessage;
D2, select the basic entry of maximum score value as matched with the metamessage of second entry from multiple matching score values Basic entry.
Wherein, it is similar to the matching process and previous embodiment of second entry.It is illustrated below, passes through entries match Decision model can predict second entry and match score value with base strip purpose, and obtained result may include:Fusion, it is newly-increased, can These three are doubted, fusion refers to belong to same purpose data, such as matching point by entries match decision model energy accurate judgement Value can export matching result as fusion more than or equal to 20 timesharing.It is newly-increased to refer to can determine whether by entries match decision model Be not belonging to same entry, for example, matching score value less than 10 timesharing, can export matching result be it is newly-increased, it is suspicious to refer to pass through bar Mesh match decision model can not judge whether to belong to same entry, for example, matching score value 10 to 19/when, can export Matching result is suspicious.
103rd, when first entry and second entry match identical basic entry from basic entry thesaurus, by One data source and the second data source are associated with the same basic entry in basic entry thesaurus.
In embodiments of the present invention, can the use of entries match decision model be first entry by abovementioned steps 102 A basic entry is allotted, likewise, a basic entry is matched for second entry using the entries match decision model, if When the coalignment of data source crawls multiple data sources, the coalignment of data source is according to foregoing first entry and second entry Matching way, the coalignment of data source can also be that each data source match corresponding basic entry.Next, number According to source coalignment can to each entries match to basic entry judge, with determine different data sources entry be It is no to have matched identical basic entry, wherein, bar target can be set for each basic entry in basic entry thesaurus Know (IDentifier, ID), can determine whether basic entry is identical by entry mark, in first entry and second entry In the case of matching identical base strip purpose from basic entry thesaurus, illustrate first entry and second entry with identical Entry metamessage, which belongs to the first data source, and second entry may belong to the second data source, therefore can be by One data source and the second data source are associated with the same basic entry in basic entry thesaurus, such as on mobile terminals When needing to show a certain basic entry, the first data source and the second data source arrived associated by the basis entry can be shown, from And facilitate user to get the first data source and the second data source at the same time by the basic entry in basic entry thesaurus, User can continue the operation that selection carries out the first data source or the second data source next step.For example, data source is Exemplified by video data source, user can see the first video data source and the second video data source to be a certain by mobile terminal A video entry provides broadcasting, and user can continue selection and click on the first video data source, or click on the second video data source.
In embodiments of the present invention, entries match in data source has been described in detail in step 102 and step 103 to basis Base strip purpose situation in entry thesaurus, and the entries match for being not stored in basic entry thesaurus in data source Basic entry when, the embodiment of the present invention additionally provides the mode of cluster to complete the association between data source.It is for example, of the invention Some embodiments in, the matching process of data source further includes following steps:
E1, when matching basic entry for first entry and second entry by entries match decision model, it is right First entry and second entry carry out cluster analysis;
E2, when first entry and second entry are divided into identical classification, by the first data source and the second data source It is associated with identical classification.
Wherein, when not matching basic entry for first entry and second entry by entries match decision model, nothing Method realizes association between data source by the entries match decision model again, at this time can to first entry and second entry into Row cluster analysis, cluster analysis are to study the statistical analysis technique of entry classification, and cluster analysis is made of some patterns, pattern It is the vector of a measurement, an or point in hyperspace, cluster analysis is based on similitude, in a cluster Pattern between than having more similitudes between the not pattern in same cluster, therefore work as first entry and second entry When being divided into identical classification, illustrate first entry and second entry can be included into identical type by cluster analysis First data source and the second data source, then be associated with identical classification by mesh.For example, first entry and second entry lead to Cluster analysis is crossed, when determining that the two entries all have identical adduction relationship, identical class can be divided into now.
Further, in some embodiments of the invention, step E1 carries out first entry and second entry cluster point Analysis, including:
E11, the intersection of acquisition first entry include data source, and the intersection of acquisition second entry includes data source;
If E12, first entry and second entry include data source with identical intersection, first entry and second are determined Entry is divided into identical classification.
Wherein, differ greatly for there are much each data attributes in the metamessage of different entries but be same entry really Situation, can be matched without using decision-tree model in the embodiment of the present invention, but matched by clustering algorithm.Lift Example explanation, by taking data source is video data source as an example, each video data source website has data source to intersect the characteristics of including, entry A1 Belong to video data source 1, entry A2 belongs to video data source 2, because title is different, cannot judge it is same by decision-tree model One entry.But entry A1 and entry A2 has identical intersecting to include video data data source, therefore it can directly determine One entry and second entry are divided into identical classification, video data source 1 and video data source 2 can be aggregated in one at this time Rise.
By description of the above example to the embodiment of the present invention, in embodiments of the present invention, first from the first number According to the metamessage that first entry is got in source, and get from the second data source the metamessage of second entry, Ran Houtong Cross entries match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of first entry, Yi Jitong Cross entries match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of second entry, entry Decision-tree model is trained by using historical data source sample data with decision model and is obtained with test, historical data source Sample data includes:The historical data attribute of each entry sample in multiple data sources, when first entry and second entry from When matching identical basic entry in basic entry thesaurus, the first data source and the second data source are associated with basic entry In same basic entry in thesaurus.Can it be first due to using entries match decision model in the embodiment of the present invention Mesh and second entry match basic entry thesaurus respectively, therefore whole process can be automatically complete by way of machine learning Into first entry and second entry can realize accurate matching with basic entry respectively, in first entry and second entry from base When matching identical basic entry in plinth entry thesaurus, the first data source and the second data source can be associated with base strip In same basic entry in mesh thesaurus, base strip purpose auto-polymerization is based between data source so as to realize.
For ease of being better understood from and implementing the such scheme of the embodiment of the present invention, corresponding application scenarios of illustrating below come It is specifically described.
By taking data source is video data source as an example, multiple video data sources one can be aggregated in the embodiment of the present invention Rise, pushed with unified video application to user.Such as the video application in the embodiment of the present invention can polymerize From the entry of different video data source to matchmaker provide storehouse in, wherein, entry can refer exclusively to a film, a TV play, one it is comprehensive Skill.After matchmaker's money storehouse refers to video data source polymerization, the items for information of server background is stored in, including:Title, performers and clerks, year Part, type, data source etc..
The polymerizable functional of the whole network video data source can be done in the embodiment of the present invention by server, it is necessary to by same film Broadcasting link condenses together, and shows user.After by crawler capturing to the whole network video, the embodiment of the present invention can pass through The method of machine learning, analyzes big data, is classified automatically, and can lead to too small amount of manually mark in operation, Algorithm is continued to optimize, is greatly improved matching accuracy rate, reduces artificial operation cost.The algorithm that the embodiment of the present invention is used is Decision Tree algorithms and clustering algorithm, specifically mainly include following process, history match sample are learnt first, uses decision tree The entries match decision model that algorithm generation is merged automatically.Fusion refers to merge different video data source in a video entry Under.During operation, by being labeled to suspicious video, these mark samples of machine learning, further Optimal Decision-making tree Algorithm.There is broadcasting link to intersect the in the case of of including for each video data source, merged by clustering algorithm, Optimal Decision-making tree The scene that algorithm does not cover.
As shown in Fig. 2, a kind of application scenarios schematic diagram of the matching process for data source provided in an embodiment of the present invention, this The matching process for the data source that inventive embodiments provide mainly includes decision Tree algorithms, clustering algorithm and machine learning.Such as Fig. 2 institutes Show, data flow side processing procedure includes in the embodiment of the present invention:Data source is obtained by reptile or interface -->Data are located in advance Reason -->Entries match -->Entry clusters -->Merge storage, illustrated in greater detail next is carried out to each process.
In the embodiment of the present invention, server can gather the entry in multiple data sources, such as multiple data sources include:Number According to source 1, data source 2 ..., data source N, after obtaining the entry of data source, data can be pre-processed, such as to data Cleaned.Each video website video defines different for the metamessage of video, such as the metamessage of video may include video Title, performers and clerks, the time, type, the build-in attribute such as language, situations such as title of film has season number, subtitle, time, Situations such as synonym, foreign language are had in performers and clerks.In order to follow-up entries match process can efficiently, accurately be run, it is necessary to first First these key messages are standardized, and remove dirty data.Dirty data refer to data in data source not to Fixed scope is interior or meaningless for practical business, or data format is illegal, and there are nonstandard in data source Coding and ambiguous service logic, for these dirty datas, can be cleaned, to improve the efficiency of entries match, wherein originally The pretreatment to entry data can be completed in inventive embodiments by the way of the replacement of character string canonical.
Pretreated data are obtained after the completion of data prediction, next according to pretreated data progress Match somebody with somebody.Decision-tree model can be used in entries match in the embodiment of the present invention, first according to existing empirical data, sorts out such as following table Lattice 1, wherein, 1 represents consistent, and 0 represents inconsistent, it is consistent with it is inconsistent be to represent to judge whether the metamessage of entry consistent.yes Represent matching, no- expressions mismatch, and maybe represents uncertain, and matching and mismatch are used to judge whether to belong to same entry.
Table 1 is the matching result of data attribute:
Next it is described in detail exemplified by carrying out decision tree training using ID3 algorithms, trained process is as follows:
1) all data, are traveled through, regard each data attribute (title, director etc.) as a kind of dividing mode.
2) comentropy of every kind of dividing mode, is calculated.
3) classification of Global Information gain maximum, is selected, as cut-point, wherein information gain maximum refers to comentropy most Greatly.
4), by the segmentation of cut-point, two nodes N1 and N2 are obtained.
Wherein, select that metadata node of information gain maximum to be split, such as split according to title, data It is divided into that title is the same, different two class of title.
5) 1-4 steps, are continued to execute to N1 and N2, the entity under each node, all with identical attribute, its In, a kind of each classification of node on behalf.
As shown in figure 3, the generating process schematic diagram for decision-tree model provided in an embodiment of the present invention.Score value is matched big In equal to 20 timesharing, matching result can be exported as fusion.It is newly-increased to refer to can determine whether to be not belonging to by entries match decision model Same entry, for example, matching score value less than 10 timesharing, can export matching result be it is newly-increased, it is suspicious to refer to pass through entries match Decision model can not judge whether to belong to same entry, for example, matching score value 10 to 19/when, matching knot can be exported Fruit is suspicious.
In the decision tree shown in Fig. 3, first using title or alias as cut-point, when title or identical alias again with Director either performer as cut-point when title or alias differ again using basic Entry ID as cut-point, output divides Fusion is represented when number is 21, represents newly-increased when the fraction of output is 6.
Work as when director works as director or identical performer using the time as cut-point again when either performer is identical again with basic entry ID is cut-point.
When the time is identical again using language and type as cut-point, the fraction exported when differing in the time can for 17 expressions Doubt.When basic Entry ID is identical, the fraction of output represents to merge for 20, when basic Entry ID differs again to direct and drill Member is used as cut-point for sky.
When language is identical with type using type as cut-point, segmentation is used as using version when language and type differ Point.Work as director and performer for it is empty when cut-point is used as using time and language and area and type, when the person of leading and performer for sky when It is similar for cut-point with brief introduction.
The fraction exported when type is identical represents to merge for 24, when type is differed using season number as cut-point.Work as version The fraction exported when this is identical represents suspicious for 19, and the fraction exported when version differs represents suspicious for 18.When the time and Language represents to merge with area and the fraction that exports when identical of type for 22, when time and language and area are differed with type The fraction of output represents newly-increased for 9.The fraction exported when brief introduction is similar represents to merge for 20, is exported when brief introduction is dissimilar Fraction represents newly-increased for 7.
The fraction that this season number exports when identical represents to merge for 23, and the fraction that this season number exports when differing represents for 16 can Doubt.
After generating decision-tree model by the above method, as shown in Fig. 2, data source can be put in storage when entries match Into basic entry thesaurus, data source polymerization is carried out when the bar destination field of data source mismatches, then by clustering algorithm, It can be adjusted when data source corresponds to more entries by operation system, such as personnel can carry out word by operation system The error correction information of section/data source, is then saved in the data of pretreatment.Data source polymerization, bar are being carried out by clustering algorithm Field when mesh information is completed polymerize and then is put in storage data source into basic entry thesaurus.Can be with when field polymerize Trickle adjustment is carried out by operation system based on experience value, such as introduces third-party platform and has associated sentencing for good basic Entry ID Break, such as the video website for crawling, some entries have contained basic entry, can directly match, because matchmaker provides storehouse In entry be also associated with basic Entry ID, and score to the result of decision tree.
Next the base strip purpose matching process in storehouse is provided with matchmaker to entry in data source to be illustrated, it is matched Process is that the video data source crawled matchmaker already present with system is provided storehouse to be matched one by one, by decision-tree model, is found out The best result of storehouse matching is provided with matchmaker.Wherein, matchmaker, which provides storehouse, can provide base strip purpose normal data, other video data sources need and The basic entry that matchmaker provides storehouse is merged, and decision-tree model can be used for providing entry in data source and matchmaker into the basic entry in storehouse Matched, for example, it is best result to set 24 points in decision-tree model, it is that the output result of decision-tree model is converted to Point.
If score is identical, the weight of each data attribute can be also introduced, takes the highest entry of weighted score, weight order is such as Under:Title>Director>Performer>Time>Season number (season, portion)>Type>Area>Language.
Weight fraction is set to each field, can be calculated by equation below:
Total weight fraction=title * 10000000+ director * 1000000+ performer * 10000+ time * 10000+ season numbers (season, Portion) * 1000+ type * 100+ area * 10+ language * 1.
Decision tree marking result and weight fraction are finally combined, obtains the entry of best match.
It should be noted that in embodiments of the present invention, the matching score value that decision-tree model exports all entries is first passed through, Basic entry of the entry of decision tree result best result as entry in matched data source is taken, if more than 1 of best result entry, Basic entry of total highest entry of weight fraction as entry in matched data source is taken again.
Next the entry cluster process in the embodiment of the present invention is described in detail, confirms exist in actual data Many each fields differ greatly, but are really same purpose situation, for this kind of situation, can not using decision-tree model Matched, need exist for using clustering algorithm.
A kind of as shown in figure 4, application scenarios schematic diagram of the polymerization methods for data source provided in an embodiment of the present invention.Respectively Video data source website has data source to intersect the characteristics of including, the entry A1 of video source L4, the entry A2 of video source L5, base strip The entry A3 of mesh thesaurus, because title difference passes through decision-tree model, it is impossible to which judgement is same entry.But video source L4 The entry A2 of entry A1 and video source L5 has identical data source L2, here mainly judge broadcasting broadcast address whether one The entry A3 of sample, the entry A2 of video source L5 and basic entry thesaurus has identical video source L1, is carried out by data source Cluster, the entry A1 of video source L4, the entry A2 of video source L5 belong to the entry A3 of basic entry thesaurus.
In embodiments of the present invention, next machine learning and model optimization are briefly described, use data with existing (i.e. historical data) trains decision-tree model, is clustered when that can not match all situations, then by clustering algorithm, in addition Error correction, periodically optimization renewal decision-tree model so that follow-up new Data Matching success rate can also be carried out by operation system Higher.As shown in figure 5, a kind of display mode for data source polymerization result provided in an embodiment of the present invention on mobile terminals is shown It is intended to.User uses mobile terminal accessing video application (APPlication, APP), and mobile terminal can detect user's Input instruction, mobile terminal are sent playing request to server according to the input instruction, are returned with request server and regarded accordingly Frequency resource, after server receives the playing request of mobile terminal, server can be searched by entries match decision model The metamessage of each entry in multiple data sources, is closed for the entry in multiple data sources with same basic entry Connection, such as same portion TV play resource of the Server Consolidation from multiple data sources, by the broadcasting link of different video data source It is polymerize, can be condensed together for the same collection video content of different data sources.Server completes the integration of video resource Afterwards, integrated results can be sent to mobile terminal by server, and mobile terminal is shown to user's exhibition by video application Show the same video resource from different data sources, such as shown in Fig. 5, mobile terminal can show the three of same video resource A video data source:Video data source 1, video data source 2 and video data source 3, user can select according to actual conditions Played using which data source, while also play the effect promoted to cooperation video website.Fig. 5 is the one of the embodiment of the present invention Kind can be carried out video data for example, the main effect for emphasizing different data sources fusion on each collection video content The polymerization in source, such as " 31 " contain the 31st collection video content of multiple video data sources here.
By foregoing illustration, the embodiment of the present invention is by can be complete in video data source Auto-matching process It is automatically performed, seldom to the demand of manpower, in addition the embodiment of the present invention polymerize all video data sources of the whole network, can ensure have The video data source of same item is grouped together in together, so as to push the video data source after polymerization to user, is convenient to user Select video data source.
It should be noted that for foregoing each method embodiment, in order to be briefly described, therefore it is all expressed as a series of Combination of actions, but those skilled in the art should know, the present invention and from the limitation of described sequence of movement because According to the present invention, some steps can use other orders or be carried out at the same time.Secondly, those skilled in the art should also know Know, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
For ease of preferably implementing the such scheme of the embodiment of the present invention, the phase for implementing such scheme is also provided below Close device.
Refer to shown in Fig. 6-a, a kind of coalignment 600 of data source provided in an embodiment of the present invention, can include:Member Data obtaining module 601, model fitting module 602, data source aggregation module 603, wherein,
Metamessage acquisition module 601, for getting the metamessage of first entry from the first data source, and from second The metamessage of second entry is got in data source, the metamessage includes;Multiple data attribute informations of respective entries;
Model fitting module 602, for being searched by entries match decision model in basic entry thesaurus with described the The matched basic entry of metamessage of one entry, and the basic entry is searched by the entries match decision model and is stored Pass through historical data source sample with the matched basic entry of the metamessage of the second entry, the entries match decision model in storehouse Notebook data is trained decision-tree model and is obtained with test, and historical data source sample data includes:In multiple data sources In each entry sample historical data attribute information;
Data source aggregation module 603, for being stored when the first entry and the second entry from the basic entry When identical basic entry is matched in storehouse, first data source and second data source are associated with the basic entry In same basic entry in thesaurus.
In some embodiments of the present application, refer to shown in Fig. 6-b, the coalignment 600 of the data source, is also wrapped Include:
Sample data division module 604, is looked into respectively for the model fitting module 602 by entries match decision model Look in basic entry thesaurus with before the matched basic entry of the metamessage of the first entry, dividing according to each data attribute It is other that historical data source sample data is divided, obtain the data division result corresponding to each data attribute;
Cut-point determining module 605, for calculating the information gain of each data division result respectively, and selects information to increase The beneficial maximum corresponding data attribute of data division result is as cut-point;
Model training module 606, for historical data source sample data to be divided into two according to the cut-point Sample data subset, the information gain of each data attribute is recalculated for each sample data subset, and is increased according to information Beneficial maximum continues to divide sample data subset, until the sample data in a sample data subset belongs to identical class, it is defeated Go out the decision-tree model after the completion of division.
In some embodiments of the present application, refer to shown in Fig. 6-c, relative to shown in Fig. 6-b, of the data source With device 600, further include:
Model checking module 607, for the model training module 606 output division after the completion of decision-tree model it Afterwards, accuracy verification is carried out to the decision-tree model using priori data source sample data.
In some embodiments of the present application, refer to shown in Fig. 6-d, the model fitting module 602, including:
Score value computing module 6021 is matched, is stored for calculating the basic entry by the entries match decision model The basic entry of each in storehouse matching score value with the metamessage of the first entry respectively;
Basic entry selecting module 6022, for selecting the basic entry of maximum score value to make from multiple matching score values For with the matched basic entry of the metamessage of the first entry.
In some embodiments of the present application, refer to shown in Fig. 6-e, relative to shown in Fig. 6-d, the Model Matching mould Block 602, further includes:Weight computation module 6023, wherein,
The weight computation module 6023, for when the basic entry of maximum score value is at least two basic entry, obtaining Take the corresponding weight of each data attribute information;At least two according to the corresponding weight calculation of each data attribute information A total weight fraction of base strip purpose, total weight fraction are each data attribute and the corresponding data in a basic entry The summation of the weight product of attribute;
The basis entry selecting module 6022, is additionally operable to select largest score from least two total weight fraction Basic entry as with the matched basic entry of the metamessage of the first entry.
In some embodiments of the present application, refer to shown in Fig. 6-f, relative to shown in Fig. 6-a, of the data source Further included with device 600:Cluster module 608, wherein,
The cluster module 608, for being the first entry and institute when not having by the entries match decision model When stating second entry and matching basic entry, cluster analysis is carried out to the first entry and the second entry;
The data source aggregation module 603, be additionally operable to when the first entry and the second entry be divided into it is identical Classification when, first data source and second data source are associated with the identical classification.
Further, in some embodiments of the invention, the cluster module 608, specifically for obtaining described first The intersection of entry includes data source, and the intersection of the acquisition second entry includes data source;If the first entry and institute State second entry with identical intersection include data source when, determine that the first entry and the second entry are divided into phase Same classification.
By description of the above example to the embodiment of the present invention, in embodiments of the present invention, first from the first number According to the metamessage that first entry is got in source, and get from the second data source the metamessage of second entry, Ran Houtong Cross entries match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of first entry, Yi Jitong Cross entries match decision model search in basic entry thesaurus with the matched basic entry of the metamessage of second entry, entry Decision-tree model is trained by using historical data source sample data with decision model and is obtained with test, historical data source Sample data includes:The historical data attribute of each entry sample in multiple data sources, when first entry and second entry from When matching identical basic entry in basic entry thesaurus, the first data source and the second data source are associated with basic entry In same basic entry in thesaurus.Can it be first due to using entries match decision model in the embodiment of the present invention Mesh and second entry match basic entry thesaurus respectively, therefore whole process can be automatically complete by way of machine learning Into first entry and second entry can realize accurate matching with basic entry respectively, in first entry and second entry from base When matching identical basic entry in plinth entry thesaurus, the first data source and the second data source can be associated with base strip In same basic entry in mesh thesaurus, base strip purpose auto-polymerization is based between data source so as to realize.
Fig. 7 is a kind of server architecture schematic diagram provided in an embodiment of the present invention, which can be because of configuration or property Energy is different and produces bigger difference, can include one or more central processing units (central processing Units, CPU) 1122 (for example, one or more processors) and memory 1132, one or more storage applications The storage medium 1130 of program 1142 or data 1144 (such as one or more mass memory units).Wherein, memory 1132 and storage medium 1130 can be it is of short duration storage or persistently storage.One can be included by being stored in the program of storage medium 1130 A or more than one module (diagram does not mark), each module can include operating the series of instructions in server.More into One step, central processing unit 1122 could be provided as communicating with storage medium 1130, and storage medium is performed on server 1100 Series of instructions operation in 1130.
Server 1100 can also include one or more power supplys 1126, one or more wired or wireless nets Network interface 1150, one or more input/output interfaces 1158, and/or, one or more operating systems 1141, example Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
Can be based on the clothes shown in the Fig. 7 as the matching process step of the data source performed by server in above-described embodiment Business device structure.
In addition it should be noted that, device embodiment described above is only schematical, wherein described as separation The unit of part description may or may not be it is physically separate, can be as the component that unit is shown or It can not be physical location, you can with positioned at a place, or can also be distributed in multiple network unit.Can be according to reality Border needs to select some or all of module therein to realize the purpose of this embodiment scheme.It is in addition, provided by the invention In device embodiment attached drawing, the connection relation between module represents there is communication connection between them, specifically can be implemented as one Bar or a plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can with Understand and implement.
Through the above description of the embodiments, it is apparent to those skilled in the art that the present invention can borrow Software is helped to add the mode of required common hardware to realize, naturally it is also possible to include application-specific integrated circuit, specially by specialized hardware Realized with CPU, private memory, special components and parts etc..Under normal circumstances, all functions of being completed by computer program can Easily realized with corresponding hardware, moreover, for realizing that the particular hardware structure of same function can also be a variety of more Sample, such as analog circuit, digital circuit or special circuit etc..But it is more for the purpose of the present invention in the case of software program it is real It is now more preferably embodiment.Based on such understanding, technical scheme substantially in other words makes the prior art The part of contribution can be embodied in the form of software product, which is stored in the storage medium that can be read In, such as the floppy disk of computer, USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), magnetic disc or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the present invention.
In conclusion the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to upper Embodiment is stated the present invention is described in detail, it will be understood by those of ordinary skill in the art that:It still can be to upper State the technical solution described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic;And these Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical solution.

Claims (15)

  1. A kind of 1. matching process of data source, it is characterised in that including:
    The metamessage of first entry is got from the first data source, and the member of second entry is got from the second data source Information, the metamessage include:Multiple data attributes of respective entries;
    The matched basis of metamessage in basic entry thesaurus with the first entry is searched by entries match decision model Entry, and believed by the member that the entries match decision model is searched in the basic entry thesaurus with the second entry Cease matched basic entry, the entries match decision model by using historical data source sample data to decision-tree model into Row training is obtained with test, and historical data source sample data includes:The history of each entry sample in multiple data sources Data attribute;
    When the first entry and the second entry match identical basic entry from the basic entry thesaurus, First data source and second data source are associated with the same basic entry in the basic entry thesaurus.
  2. 2. according to the method described in claim 1, it is characterized in that, described search basis respectively by entries match decision model In entry thesaurus with before the matched basic entry of the metamessage of the first entry, the method further includes:
    Historical data source sample data is divided respectively according to each data attribute, it is right to obtain each data attribute institute The data division result answered;
    The information gain of each data division result is calculated respectively, and selects the data division result of information gain maximum corresponding Data attribute is as cut-point;
    Historical data source sample data is divided into two sample data subsets according to the cut-point, for each sample Data subset recalculates the information gain of each data attribute, and continues to divide sample number according to the principle of information gain maximum According to subset, until the sample data in a sample data subset belongs to identical class, the decision tree after the completion of output division Model.
  3. 3. according to the method described in claim 2, it is characterized in that, it is described output division after the completion of decision-tree model after, The method further includes:
    Accuracy verification is carried out to the decision-tree model using priori data source sample data.
  4. 4. according to the method described in claim 2, it is characterized in that, described search basic entry by entries match decision model In thesaurus with the matched basic entry of the metamessage of the first entry, including:
    By the entries match decision model calculate in the basic entry thesaurus each basic entry respectively with it is described The matching score value of the metamessage of first entry;
    The basic entry of maximum score value is selected to be matched as the metamessage with the first entry from multiple matching score values Basic entry.
  5. 5. according to the method described in claim 4, it is characterized in that, described search basic entry by entries match decision model With the matched basic entry of the metamessage of the first entry in thesaurus, further include:
    When the basic entry of maximum score value is at least two basic entry, the corresponding weight of each data attribute is obtained;
    At least two basic respective total weight fractions of entry, institute according to the corresponding weight calculation of each data attribute State summation of total weight fraction for each data attribute in a basic entry with the weight product of the corresponding data attribute;
    The basic entry of largest score is selected from least two total weight fraction as the metamessage with the first entry Matched basis entry.
  6. 6. method according to any one of claim 1 to 5, it is characterised in that the method further includes:
    It is that the first entry and the second entry match basic entry when not having by the entries match decision model When, cluster analysis is carried out to the first entry and the second entry;
    When the first entry and the second entry are divided into identical classification, by first data source and described Two data sources are associated with the identical classification.
  7. 7. according to the method described in claim 6, it is characterized in that, described carry out the first entry and the second entry Cluster analysis, including:
    The intersection for obtaining the first entry includes data source, and the intersection of the acquisition second entry includes data source;
    If there is identical intersection to include data source for the first entry and the second entry, determine the first entry and The second entry is divided into identical classification.
  8. A kind of 8. coalignment of data source, it is characterised in that including:
    Metamessage acquisition module, for getting the metamessage of first entry from the first data source, and from the second data source In get the metamessage of second entry, the metamessage includes;Multiple data attribute informations of respective entries;
    Model fitting module, for by entries match decision model search in basic entry thesaurus with the first entry The matched basic entry of metamessage, and by the entries match decision model search in the basic entry thesaurus with institute The matched basic entry of metamessage of second entry is stated, the entries match decision model passes through historical data source sample data pair Decision-tree model is trained to be obtained with test, and historical data source sample data includes:Each bar in multiple data sources The historical data attribute information of mesh sample;
    Data source aggregation module, for being matched when the first entry and the second entry from the basic entry thesaurus During to identical basic entry, first data source and second data source are associated with the basic entry thesaurus Same basic entry on.
  9. 9. device according to claim 8, it is characterised in that the coalignment of the data source, further includes:
    Sample data division module, basic entry is searched for the model fitting module respectively by entries match decision model With before the matched basic entry of the metamessage of the first entry, being gone through respectively to described according to each data attribute in thesaurus History data source sample data is divided, and obtains the data division result corresponding to each data attribute;
    Cut-point determining module, for calculating the information gain of each data division result respectively, and selects information gain maximum The corresponding data attribute of data division result as cut-point;
    Model training module, for historical data source sample data to be divided into two sample datas according to the cut-point Subset, recalculates the information gain of each data attribute for each sample data subset, and according to information gain maximum after Continuous division sample data subset, until the sample data in a sample data subset belongs to identical class, output has been divided Decision-tree model after.
  10. 10. device according to claim 8, it is characterised in that the coalignment of the data source, further includes:
    Model checking module, after the decision-tree model after the completion of model training module output division, uses priori Data source sample data carries out accuracy verification to the decision-tree model.
  11. 11. device according to claim 9, it is characterised in that the model fitting module, including:
    Score value computing module is matched, it is each in the basic entry thesaurus for being calculated by the entries match decision model A basis entry matching score value with the metamessage of the first entry respectively;
    Basic entry selecting module, for selected from multiple matching score values the basic entries of maximum score value as with it is described The matched basic entry of metamessage of first entry.
  12. 12. device according to claim 10, it is characterised in that the model fitting module, further includes:Weight calculation mould Block, wherein,
    The weight computation module, for when the basic entry of maximum score value is at least two basic entry, obtaining per number According to the corresponding weight of attribute information;According at least two base strips described in the corresponding weight calculation of each data attribute information The total weight fraction of purpose, power of the total weight fraction for each data attribute in a basic entry with the corresponding data attribute The summation of weight product;
    The basis entry selecting module, is additionally operable to select the basic entry of largest score from least two total weight fraction As with the matched basic entry of the metamessage of the first entry.
  13. 13. the device according to any one of claim 8 to 12, it is characterised in that the coalignment of the data source is also Including:Cluster module, wherein,
    The cluster module, for being the first entry and the Article 2 when not having by the entries match decision model When mesh matches basic entry, cluster analysis is carried out to the first entry and the second entry;
    The data source aggregation module, is additionally operable to when the first entry and the second entry are divided into identical classification When, first data source and second data source are associated with the identical classification.
  14. 14. device according to claim 13, it is characterised in that the cluster module, specifically for obtaining described first The intersection of entry includes data source, and the intersection of the acquisition second entry includes data source;If the first entry and institute State second entry with identical intersection include data source when, determine that the first entry and the second entry are divided into phase Same classification.
  15. 15. a kind of computer-readable recording medium, including instruction, when run on a computer so that computer performs such as Method described in claim 1-7 any one.
CN201711159895.3A 2017-11-20 2017-11-20 Data source matching method and device Active CN107918657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711159895.3A CN107918657B (en) 2017-11-20 2017-11-20 Data source matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711159895.3A CN107918657B (en) 2017-11-20 2017-11-20 Data source matching method and device

Publications (2)

Publication Number Publication Date
CN107918657A true CN107918657A (en) 2018-04-17
CN107918657B CN107918657B (en) 2021-10-08

Family

ID=61897424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711159895.3A Active CN107918657B (en) 2017-11-20 2017-11-20 Data source matching method and device

Country Status (1)

Country Link
CN (1) CN107918657B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764348A (en) * 2018-05-30 2018-11-06 口口相传(北京)网络技术有限公司 Collecting method based on multiple data sources and system
CN109447276A (en) * 2018-09-17 2019-03-08 烽火通信科技股份有限公司 A kind of machine learning method, system, equipment and application method
CN110096504A (en) * 2019-03-29 2019-08-06 北京奇安信科技有限公司 Streaming events feature matching method and device
CN110929111A (en) * 2019-11-19 2020-03-27 支付宝(杭州)信息技术有限公司 Automatic generation method, device and equipment for matching pattern for matching private data
CN110942078A (en) * 2018-09-22 2020-03-31 北京微播视界科技有限公司 Method and device for aggregating point of interest data, media file server and storage medium
CN111241056A (en) * 2019-12-31 2020-06-05 国网浙江省电力有限公司电力科学研究院 Power energy consumption data storage optimization method based on decision tree model
CN111563545A (en) * 2020-04-27 2020-08-21 平安医疗健康管理股份有限公司 Code matching method and device for medical entity, computer equipment and storage medium
CN112348583A (en) * 2020-11-04 2021-02-09 贝壳技术有限公司 User preference generation method and generation system
CN112836087A (en) * 2021-01-26 2021-05-25 湖南快乐阳光互动娱乐传媒有限公司 Video attribute information acquisition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823907A (en) * 2014-03-19 2014-05-28 北京奇虎科技有限公司 Method, device and engine for integrating on-line video resource addresses
WO2015094311A1 (en) * 2013-12-20 2015-06-25 Thomson Licensing Quote and media search method and apparatus
US20150296228A1 (en) * 2014-04-14 2015-10-15 David Mo Chen Systems and Methods for Performing Multi-Modal Video Datastream Segmentation
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106127114A (en) * 2016-06-16 2016-11-16 北京数智源科技股份有限公司 Intelligent video analysis method
CN106484774A (en) * 2016-09-12 2017-03-08 北京歌华有线电视网络股份有限公司 A kind of correlating method of multisource video metadata and system
CN106886565A (en) * 2017-01-11 2017-06-23 北京众荟信息技术股份有限公司 A kind of basic house type auto-polymerization method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094311A1 (en) * 2013-12-20 2015-06-25 Thomson Licensing Quote and media search method and apparatus
CN103823907A (en) * 2014-03-19 2014-05-28 北京奇虎科技有限公司 Method, device and engine for integrating on-line video resource addresses
US20150296228A1 (en) * 2014-04-14 2015-10-15 David Mo Chen Systems and Methods for Performing Multi-Modal Video Datastream Segmentation
CN106096748A (en) * 2016-04-28 2016-11-09 武汉宝钢华中贸易有限公司 Entrucking forecast model in man-hour based on cluster analysis and decision Tree algorithms
CN106127114A (en) * 2016-06-16 2016-11-16 北京数智源科技股份有限公司 Intelligent video analysis method
CN106484774A (en) * 2016-09-12 2017-03-08 北京歌华有线电视网络股份有限公司 A kind of correlating method of multisource video metadata and system
CN106886565A (en) * 2017-01-11 2017-06-23 北京众荟信息技术股份有限公司 A kind of basic house type auto-polymerization method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHAO,XIAOJIAN等: "Video recommendation over multiple information sources", 《MULTIMEDIA SYSTEM》 *
石燕志: "一种多源视频融合系统设计方法", 《中国安防》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764348A (en) * 2018-05-30 2018-11-06 口口相传(北京)网络技术有限公司 Collecting method based on multiple data sources and system
CN108764348B (en) * 2018-05-30 2020-07-10 口口相传(北京)网络技术有限公司 Data acquisition method and system based on multiple data sources
CN109447276A (en) * 2018-09-17 2019-03-08 烽火通信科技股份有限公司 A kind of machine learning method, system, equipment and application method
CN109447276B (en) * 2018-09-17 2021-11-02 烽火通信科技股份有限公司 Machine learning system, equipment and application method
CN110942078B (en) * 2018-09-22 2024-01-12 北京微播视界科技有限公司 Method, device, media file server and storage medium for aggregating interest point data
CN110942078A (en) * 2018-09-22 2020-03-31 北京微播视界科技有限公司 Method and device for aggregating point of interest data, media file server and storage medium
CN110096504A (en) * 2019-03-29 2019-08-06 北京奇安信科技有限公司 Streaming events feature matching method and device
CN110096504B (en) * 2019-03-29 2021-08-20 奇安信科技集团股份有限公司 Streaming event feature matching method and device
CN110929111A (en) * 2019-11-19 2020-03-27 支付宝(杭州)信息技术有限公司 Automatic generation method, device and equipment for matching pattern for matching private data
CN110929111B (en) * 2019-11-19 2023-03-31 支付宝(杭州)信息技术有限公司 Automatic generation method, device and equipment for matching pattern for matching private data
CN111241056B (en) * 2019-12-31 2024-03-01 国网浙江省电力有限公司营销服务中心 Power energy data storage optimization method based on decision tree model
CN111241056A (en) * 2019-12-31 2020-06-05 国网浙江省电力有限公司电力科学研究院 Power energy consumption data storage optimization method based on decision tree model
CN111563545A (en) * 2020-04-27 2020-08-21 平安医疗健康管理股份有限公司 Code matching method and device for medical entity, computer equipment and storage medium
CN112348583A (en) * 2020-11-04 2021-02-09 贝壳技术有限公司 User preference generation method and generation system
CN112348583B (en) * 2020-11-04 2022-12-06 贝壳技术有限公司 User preference generation method and generation system
CN112836087A (en) * 2021-01-26 2021-05-25 湖南快乐阳光互动娱乐传媒有限公司 Video attribute information acquisition method and device

Also Published As

Publication number Publication date
CN107918657B (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN107918657A (en) The matching process and device of a kind of data source
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN112667877A (en) Scenic spot recommendation method and equipment based on tourist knowledge map
CN108205766A (en) Information-pushing method, apparatus and system
CN110674312B (en) Method, device and medium for constructing knowledge graph and electronic equipment
CN110909182A (en) Multimedia resource searching method and device, computer equipment and storage medium
CN110019616A (en) A kind of POI trend of the times state acquiring method and its equipment, storage medium, server
CN114418035A (en) Decision tree model generation method and data recommendation method based on decision tree model
CN106557480A (en) Implementation method and device that inquiry is rewritten
CN108629358A (en) The prediction technique and device of object type
CN113010705B (en) Label prediction method, device, equipment and storage medium
CN110928986A (en) Legal evidence sorting and recommending method, device, equipment and storage medium
CN115712780A (en) Information pushing method and device based on cloud computing and big data
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN103761286B (en) A kind of Service Source search method based on user interest
CN113326432A (en) Model optimization method based on decision tree and recommendation method
CN116362790A (en) Client type prediction method, client type prediction device, electronic device, medium and program product
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN112598405B (en) Business project data management method and system based on big data
CN116150470A (en) Content recommendation method, device, apparatus, storage medium and program product
CN107291722B (en) Descriptor classification method and device
CN108345620A (en) Brand message processing method, device, storage medium and electronic equipment
CN116823410A (en) Data processing method, object processing method, recommending method and computing device
CN104615605B (en) The method and apparatus of classification for prediction data object
CN116957128A (en) Service index prediction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant