CN108804544A

CN108804544A - Internet video display multi-source data fusion method and device

Info

Publication number: CN108804544A
Application number: CN201810475686.8A
Authority: CN
Inventors: 张家栋; 胡俊杰; 宁伟
Original assignee: Shenzhen Small Frog Data Technology Co Ltd
Current assignee: Shenzhen Small Frog Data Technology Co Ltd
Priority date: 2018-05-17
Filing date: 2018-05-17
Publication date: 2018-11-13

Abstract

This application discloses a kind of internet video display multi-source data fusion method and devices.This method includes：It obtains from more than two internet video display platforms and the relevant data of video display, after being pre-processed to the data, obtains standardized entity；Calculate the attributes similarity between the entity of different internet video display platforms, wherein the attribute includes essential attribute and multimedia attribute；The weight that the attribute is calculated based on comentropy, the similarity based on the attribute with the entity of the weight calculation difference video display platform；The entity is grouped with the similarity based on the entity, more than two entities in one group are merged.This method has considered each attribute of entity, the similarity of multi-source entity is calculated by attributes similarity, and as the important reference of fusion multi-source entity, the factor of consideration is more comprehensive, as a result closer to the truth, more rationally.

Description

Internet video display multi-source data fusion method and device

Technical field

This application involves Data fusion technique field, more particularly to a kind of internet video display multi-source data fusion method and Device.

Background technology

With society and economic fast development, people's lives level is greatly improved.In people's daily life and In national economy, the proportion of video display show business is increasing, and practitioner is also more and more.In particular, with mobile interchange The rapid development of net emerges a large amount of internet video display platforms, for example, iqiyi.com, youku.com, Tencent's video, bean cotyledon, opal, when Light net etc..These internet video display platforms attract a large number of users, and have accumulated the mass data contributed by user, including video display The various description informations of works and performer.Due to the polyphyly of movie data, data format difference, the data of each video display platform Also imperfect；Also, it the data mutual redundancy of each video display platform and complements each other.In the prior art not by multiple numbers It is fully merged according to the information in source, to build the knowledge base of a complete video display industry.

Invention content

The application's aims to overcome that the above problem or solves or extenuate to solve the above problems at least partly.

According to the one side of the application, a kind of internet video display multi-source data fusion method is provided, including：

Data collection step：It obtains from more than two internet video display platforms and the relevant data of video display, to described After data are pre-processed, standardized entity is obtained；

Attributes similarity calculates step：The attributes similarity between the entity of different internet video display platforms is calculated, Wherein, the attribute includes essential attribute and multimedia attribute；

Entity similarity calculation step：The weight of the attribute is calculated based on comentropy, is based on the attribute and the power The similarity of the entity of re-computation difference video display platform；With

Entity fusion steps：The entity is grouped by the similarity based on the entity, by two in one group A above entity merges.

This method has considered each attribute of entity, and the similar of multi-source entity is calculated by attributes similarity Degree, and as the important reference of fusion multi-source entity, the factor of consideration is more comprehensive, as a result closer to the truth, More rationally.

Optionally, the data collection step includes：

Data acquisition step：It obtains from more than two internet video display platforms and the relevant data of video display；And data Pre-treatment step, wherein the data prediction step includes：

Data cleansing step：The data are cleaned；With

Data normalization processing step：By the Property Name of the solid data of described two above internet video display platforms or Attribute value replaces with standardized Property Name or attribute value, to obtain the standardized entity.

Optionally, the attributes similarity includes with one kind or combination thereof in properties：Keyword attribute, set Attribute, brief string attribute, long text attribute and image content attribute.

Optionally, the entity similarity calculation step includes：

The weight of the attribute is calculated based on comentropy, it is described two above mutual based on the attribute and the weight calculation The similarity of the different entities of networking video display platform is sentenced if the maximum similarity between two entities is more than given threshold Fixed two entities are identical entity.

Optionally, the entity fusion steps include：

The entity is grouped by the similarity based on the entity, will be located at according to the entity attributes in each group More than two entities in one group merge, wherein in the attribute include with one or more of properties：Monodrome Attribute, aggregate attribute and cumulative attribute.

According to further aspect of the application, a kind of internet video display multisource data fusion device is additionally provided, including：

Data collection module is disposed for obtaining relevant with video display from more than two internet video display platforms Data after being pre-processed to the data, obtain standardized entity；

Attributes similarity computing module is disposed between the entity for calculating different internet video display platforms Attributes similarity, wherein the attribute includes essential attribute and multimedia attribute；

Entity similarity calculation module is configured to calculate the weight of the attribute based on comentropy, is based on the attribute The similarity of the entity of video display platforms different with the weight calculation；With

Entity Fusion Module is configured to the similarity based on the entity and is grouped the entity, will be located at one More than two entities in group merge.

The present apparatus has considered each attribute of entity, and the similar of multi-source entity is calculated by attributes similarity Degree, and as the important reference of fusion multi-source entity, the factor of consideration is more comprehensive, as a result closer to the truth, More rationally.

Optionally, the data collection module includes：

Data acquisition module is disposed for obtaining relevant with video display from more than two internet video display platforms Data；And data preprocessing module,

Wherein, the data preprocessing module includes：

Data cleansing module is disposed for cleaning the data；With

Data normalization processing module is disposed for the solid data of described two above internet video display platforms Property Name or attribute value replace with standardized Property Name or attribute value, to obtain the standardized entity.

Optionally, the entity similarity calculation module is configured to calculate the weight of the attribute based on comentropy, is based on The similarity of the different entities of the attribute and the described two above internet video display platforms of the weight calculation, if two realities Maximum similarity between body is more than given threshold, then judges that two entities are identical entity.

According to further aspect of the application, a kind of computer equipment, including memory, processor and storage are additionally provided In the memory and the computer program that can be run by the processor, wherein the processor executes the computer Method as described above is realized when program.

According to further aspect of the application, a kind of computer readable storage medium is additionally provided, it is preferably non-volatile Readable storage medium storing program for executing, is stored with computer program, and the computer program is realized as described above when executed by the processor Method.

According to the accompanying drawings to the detailed description of the specific embodiment of the application, those skilled in the art will be more Above-mentioned and other purposes, the advantages and features of the application are illustrated.

Description of the drawings

Some specific embodiments of the application are described in detail by way of example rather than limitation with reference to the accompanying drawings hereinafter. Identical reference numeral denotes same or similar component or part in attached drawing.It should be appreciated by those skilled in the art that these What attached drawing was not necessarily drawn to scale.In attached drawing：

Fig. 1 is the schematic stream according to a kind of one embodiment of internet video display multi-source data fusion method of the application Cheng Tu；

Fig. 2 is schematic flow chart of the sequence to series neural network model；

Fig. 3 is the schematic frame according to a kind of one embodiment of internet video display multisource data fusion device of the application Figure；

Fig. 4 is the block diagram of one embodiment of the computer equipment of the application；

Fig. 5 is the block diagram of one embodiment of the computer readable storage medium of the application.

Specific implementation mode

The embodiment of the application provides a kind of internet video display multi-source data fusion method.Fig. 1 is according to the application A kind of internet video display multi-source data fusion method one embodiment schematic flow chart.This method includes：

S100 data collection steps：It obtains from more than two internet video display platforms and the relevant data of video display, it is right After the data are pre-processed, standardized entity is obtained；

S200 attributes similarities calculate step：The attribute calculated between the entity of different internet video display platforms is similar Degree, wherein the attribute includes essential attribute and multimedia attribute；

S300 entity similarity calculation steps：The weight of the attribute is calculated based on comentropy, is based on the attribute and institute State the similarity of the entity of weight calculation difference video display platform；With

S400 entity fusion steps：The entity is grouped by the similarity based on the entity, will be located in one group More than two entities merge.

Optionally, the S100 data collection steps include：

Data acquisition step：It obtains from more than two internet video display platforms and the relevant data of video display；And data Pre-treatment step.Wherein, the data prediction step includes：Data cleansing step：The data are cleaned；And data Normalizing steps：The Property Name of the solid data of described two above internet video display platforms or attribute value are replaced with Standardized Property Name or attribute value, to obtain the standardized entity.

Wherein, data acquisition step collects films and television programs and the performer of internet video display platform by data acquisition technology Etc. data.Internet video display platform includes but not limited to：Iqiyi.com, youku.com, Tencent's video, bean cotyledon, opal, TIME dotCom, Baidu Encyclopaedia, naughty ticket ticket, new film studio, Egg Tarts data etc..Due to these internet video display platforms can all reach the standard grade daily new films and television programs, The information of films and television programs and performer before new actor information or update is added, therefore uses the step, it is fixed in each platform When obtain data, can get largely with the relevant data of video display.Optionally, which may be used the mode of incremental crawler Acquire latest data.Using the step, collected movie data library can be constantly updated.

In data cleansing step, since the solid data on internet video display platform is usually write by user, the data In contain many noises, it is therefore desirable to carry out data cleansing.For example, data encoding is unified for UTF8 international standard codes；The Chinese Word conversion between simplified and traditional Chinese is Simplified Chinese；Idle character is removed, for example, the formats escape character such as html, the spcial characters such as emoticon, The uncertainty attributes value such as " unknown " and " unknown ".

In data normalization processing step, the Property Name of the solid data of internet video display platform is inconsistent, exists Phenomena such as " ambiguity of the same name ", " several synonymous ".For example, " title " and " name " that different platforms is respectively adopted, practical significance All indicate works or the title of performer.For this purpose, the step defines standardized entity attribute title, and each video display platform category Property title replace with standardized entity attribute title, to ensure " one one justice ".Further, by the attribute value of solid data It is standardized, for example, gender attribute value, some platform is marked with " man " and " female ", and other platform uses " male " " female " is marked, and gender attribute value is unified for English character and is labeled as keyword, to meet International standardization Requirement.Films and television programs are shown date and performer's date of birth by the standardization for date property value for example, may be used Phase is unified for date format YYYY-MM-DD, if some part is unknown, is indicated with " x ".

In S200 attributes similarities calculate step, optionally, the attributes similarity includes with one kind in properties Or combination thereof：Keyword attribute, aggregate attribute, brief string attribute, long text attribute and image content attribute.

1 for keyword attribute：If the attribute of keyword is identical, similarity is denoted as 1, is otherwise denoted as 0.Example Such as, keyword can be the name or gender of performer.

2 aggregate attributes：By comparing two set identical element numbers, to calculate similarity.For example, films and television programs are led It drills and the aggregate attributes such as performer.Specifically, two set A and B are given, three kinds of calculations of similarity sim (A, B) are as follows：

Wherein, ∩ indicates that set intersection, ∪ indicate set union, | | indicate the element number of set.

3 brief string attributes：The editing distance for calculating two character strings, determines similarity.For example, work title, on Reflect date, performer's date of birth etc..Specifically, two character strings s1 and s2 are given, editing distance EditDistance is based on (s1, s2), similarity sim (s1, s2) are defined as：

Wherein, length indicates the length of character string.

4 long text attributes：For example, long text includes：The brief introduction description information of films and television programs and performer includes in long text Therefore complicated natural language structure for long text attribute, cannot calculate similarity by simply comparing.Therefore, existing Movie data integration technology does not consider these long text attributes.The present invention is based on depth learning technologies, deeply understand long text Semanteme, and then calculate similarity.Specifically, the similarity calculation process of long text attribute includes three steps：

(4.1) learn the semanteme of word：The present invention is based on term vector model (Word Embedding), learn each word Vector indicate, to indicate the implicit semantic of word.Common term vector model includes continuous bag of words (Continuous Bag-of-Words, CBOW) model and Skip-Gram models.Both models are all based on the total hair relationship of word, i.e., go out simultaneously Existing relationship learns the semanteme of word.Difference is the probability that CBOW estimates that the word occurs according to the context of word, and Skip- Gram calculates the probability that context occurs according to given word.Vector based on word indicates, can facilitate between calculating word Relationship, such as Words similarity or word distance, and the basis of study long text (such as paragraph, article) semanteme.

For term vector model (Word Embedding), in addition to CBOW and Skip-Gram based on neural network, also There are other schemes, can complete same purpose, including principal component analysis (Principal Component Analysis, PCA), matrix decomposition, hidden semantic analysis (Latent Semantic Analysis, LSA) etc..

(4.2) learn the semanteme of long text：The application learns text using sequence to sequence (Seq2Seq) neural network model This semanteme.

Fig. 2 is schematic flow chart of the sequence to series neural network model.Seq2Seq neural network models are suitable for handle List entries X is converted to the task of output sequence Y, is widely used in natural language processing, for example, machine translation and text are plucked It wants.Seq2Seq neural network models main two stages：The semanteme for encoding (Encoder) study list entries X, obtains one End-state；Decode the end-state of (Decoder) using coding, prediction output sequence Y.Encoder and Decoder are respectively One Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN).

In this application, Seq2Seq neural network models output and input be same text sequence of terms, pay attention to this In each word have already passed through semantic study, be converted into vectorial w_t.The core cell (Cell) of RNN receives two inputs, i.e., on One state h_t-1The term vector w currently inputted_t, then pass through internal linear transformation and activation primitive, output current state h_t。 In coding stage end-state h is obtained by long text semanteme and Structure learning_n, the vectorization expression of text is as inputted, Because of vector h_nInput text can be predicted in decoding stage.It is worth noting that：Only in Seq2Seq model training ranks Section just needs to decode learning model parameter；Training is completed after obtaining Seq2Seq models, in true application, it is only necessary to it encodes, I.e. given long text sequence X, the semantic vector for calculating text indicate h_n。

The core cell of the RNN of the application can be general neural network element, shot and long term mnemon (Long Short-Term Memory, LSTM) or gating cycle unit (Gated Recurrent Unit, GRU) etc..LSTM and GRU The advantages of be：In learning RNN model process, not only can the shot and long term of learning text structure rely on, and gradient is avoided to explode Or the problems such as gradient disappearance.Common activation primitive includes tanh, sigmoid, relu, maxout etc..

Neural network model can also use the various variants of Seq2Seq neural network models to realize, including be based on paying attention to The model (attention-based models) of power, and code and decode the Recognition with Recurrent Neural Network model in stage, Ke Yili With the variant of any Recognition with Recurrent Neural Network model, such as the Recognition with Recurrent Neural Network of two-way Recognition with Recurrent Neural Network, multilayer and right Recognition with Recurrent Neural Network carries out regularization etc..

(4.3) long text similarity is calculated：Give the semantic vector h of two texts₁And h₂, the application utilize cosine function Calculate two vectorial similarity sim (h₁,h₂), i.e.,：

Wherein, | | | | indicate norm.

5 image content attributes：For image content attribute, for example, films and television programs poster, performer's stage photo, head portrait etc., this Shen It please be based on the similarity that perceptual hash algorithm (Perceptual Hash Algorithm, pHash) calculates picture, specific steps It is as follows：

(5.1) dimension of picture is reduced：Picture is reduced by down-sampling (down-sampling) to improve computational efficiency For 32*32 pixels.

(5.2) simplify color：Picture is converted to gray level image, to be further simplified calculation amount.

(5.3) DCT is calculated：The discrete cosine transform (DCT) for calculating picture, obtains the DCT coefficient matrix of 32*32.Wherein, DCT is kind of an image compression algorithm, for image to be transformed to frequency domain from pixel domain.

(5.4) DCT is reduced：Retain the matrix of the 8*8 in the upper left corners 32*32, which presents the low-limit frequency in picture.

(5.5) average value is calculated：Calculate the mean value of 8*8DCT.

(5.6) binaryzation DCT：According to the DCT matrixes of 8*8, it is equal to be more than or equal to DCT for 64 hash values of setting 0 or 1 Value is set as " 1 ", less than being set as " 0 " for DCT mean values.Although binaryzation result can not indicate the low frequency of authenticity, It can roughly indicate the relative scale relative to average value frequency.As long as the overall structure of picture remains unchanged, hash end values With regard to constant, the influence brought is adjusted this makes it possible to avoid gamma correction or color histogram.

(5.7) structure hash values vector：The DCT matrixes of binaryzation 8*8, it is arranged in one 64 integer vectors, shape At the fingerprint of the picture.Ordering is not important, as long as ensureing that all pictures all use same order.

(5.8) picture similarity is calculated：The pHash vectors for giving two kinds of pictures are v1 and v2, are based on Hamming distance HammingDistance (v1, v2), similarity sim (v1, v2) are defined as：

Wherein, length indicates the length of vector.

In picture material similarity calculation, perceptual hash algorithm (Perceptual Hash Algorithm, pHash) It can be substituted by other algorithms, for example, mean value hash algorithm (Average Hash Algorithm, aHash), gradient Hash are calculated Method (Difference Hash Algorithm, dHash) and small echo hash algorithm (Wavelet Hash Algorithm, wHash).Treated, and dimension of picture is not necessarily 8*8, can there is other values, for example, 4*4,16*16 etc..

Optionally, the S300 entities similarity calculation step includes：

To calculate the similarity of multi-source entity, the application is based on comentropy computation attribute weight, because Attribute information entropy is The science of attribute information amount size is measured, you can to indicate whether the attribute can identify/distinguish corresponding entity.Given entity Attribute X, the probability mass function of statistical attribute X first, i.e. the probability of the various values of X.It is assumed that the value of attribute X is discrete, Connection attribute can be with discretization.Assume that attribute X takes the probability of i-th of value is p_i, the comentropy H (X) of attribute X is defined as：

Wherein, N indicates the value number of attribute X.

Comentropy based on attribute, attribute X_jWeight w_jIt is defined as：

Wherein, M is the attribute sum for participating in the fusion of video display entity.

Further, two entity E are given₁And E₂, according to the result of calculation of image content attribute, two entities are in attribute X_j Similarity be denoted as s_j, then entity similarity sim (E₁,E₂) be defined as：

Based on entity similarity calculating method, the most like entity of each entity can be calculated, if maximum similarity More than given threshold value, it is determined as that two entities are identical, is otherwise different entities.Similarity threshold can be set according to actual conditions It sets, to reach best effects, in the present embodiment, similarity threshold is set as 0.8.

In addition to being based on comentropy, the other technologies of information theory can also be utilized to substitute, belonged to for example, being calculated based on information gain Property weight.

Optionally, the S400 entities fusion steps include：

Multi-source entity is grouped according to the result of calculation of step S300, if only there are one entities for a grouping, the reality Fusion results are added in body.If a grouping there are multiple entities, need to merge multiple entities, i.e. entity merges.If Multiple entities have multiple values in some attribute, are accepted or rejected using following rule：

(4.1) if it is single-value attribute：Such as films and television programs title and actor name, majority principle value is first pressed, if It can't determine, then press data source priority value.

(4.2) if it is aggregate attribute：Such as the alias of the director and performer and performer of films and television programs, to all categories Property value takes union.

(4.3) if it is accumulation attribute：Such as the number of fans for thumbing up number and performer of films and television programs, all properties value is tired out Add.

The present processes have considered each attribute of entity：To merge works and the performer of multi-source movie data, The present invention is not merely with the essential attribute of entity, for example, work title and alias, produce time, the date of showing, country of production/ Area, type, director, performer, playwright, screenwriter, the name and alias of performer, date of birth, occupation etc. also utilize the multimedia of entity Attribute, for example, the image contents such as the brief introduction description information of films and television programs and performer, works poster, performer's stage photo.

The application is based on deep learning and image processing techniques, compares the multimedia attribute of entity：For films and television programs and The long text and image content of performer, the present invention utilize deep learning and image processing techniques, calculate the similarity of multi-source entity, And as the important reference of fusion multi-source entity.The deficiency of primary attribute can be made up on one side, it on the other hand can be with The application range for enhancing the present invention, to adapt to the long text and multi-medium data of complicated applications.

The application is based on comentropy computation attribute weight：The present invention is weighed by the comentropy of computation attribute come defined attribute Weight, i.e., attribute weight is directly proportional to Attribute information entropy, because Attribute information entropy is the science measurement of attribute information amount size, you can To indicate whether the attribute can identify/distinguish corresponding entity.Comentropy has been widely used for processing data information sum number According to excavation.For example, in Decision Tree Algorithm, according to Attribute information entropy (or information gain), chooses attribute and generate decision Set branch.

The application's embodiments further provides a kind of internet video display multisource data fusion device.Fig. 3 is according to this Shen A kind of schematic block diagram of one embodiment of internet video display multisource data fusion device please.The device includes：

Data collection module 100 is disposed for obtaining from more than two internet video display platforms and video display phase The data of pass after being pre-processed to the data, obtain standardized entity；

Attributes similarity computing module 200, be disposed for calculating different internet video display platforms the entity it Between attributes similarity, wherein the attribute includes essential attribute and multimedia attribute；

Entity similarity calculation module 300 is configured to calculate the weight of the attribute based on comentropy, is based on the category Similarity of the property with the entity of the weight calculation difference video display platform；With

Entity Fusion Module 400 is configured to the similarity based on the entity and is grouped the entity, will be located at More than two entities in one group merge.

Optionally, the data collection module 100 includes：

Wherein, the data preprocessing module includes：

Data cleansing module is disposed for cleaning the data；With

Optionally, entity Fusion Module 400 is configured to the similarity based on the entity and is grouped the entity, root More than two entities in one group are merged according to the entity attributes in each group, wherein in the attribute packet It includes with one or more of properties：Single-value attribute, aggregate attribute and cumulative attribute.

The embodiment of the present application also provides a kind of computing devices, and with reference to Fig. 4, which includes memory 1120, place It manages device 1110 and is stored in the computer program that can be run in the memory 1120 and by the processor 1110, the computer Program is stored in the space 1130 for program code in memory 1120, which executes by processor 1110 Shi Shixian is for executing any one steps of a method in accordance with the invention 1131.

The embodiment of the present application also provides a kind of computer readable storage mediums.With reference to Fig. 5, the computer-readable storage medium Matter includes the storage unit for program code, which is provided with the journey for executing steps of a method in accordance with the invention Sequence 1131 ', the program are executed by processor.

The embodiment of the present application also provides a kind of computer program products including instruction.When the computer program product exists When being run on computer so that computer is executed according to the present processes step.

The embodiment of the present application also provides a kind of computer program products including instruction.When the computer program product exists When being run on computer so that computer executes any one in the above method.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When computer loads and executes the computer program instructions, whole or portion Ground is divided to generate according to the flow or function described in the embodiment of the present application.The computer can be all-purpose computer, dedicated computing Machine, computer network obtain other programmable devices.The computer instruction can be stored in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk Solid State Disk (SSD)) etc..

Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, depend on the specific application and design constraint of technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It is not considered that exceeding scope of the present application.

One of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with It is completed come instruction processing unit by program, the program can be stored in computer readable storage medium, and the storage is situated between Matter is non-transitory (English：Non-transitory) medium, such as random access memory, read-only memory, flash Device, hard disk, solid state disk, tape (English：Magnetic tape), floppy disk (English：Floppy disk), CD (English： Optical disc) and its arbitrary combination.

The preferable specific implementation mode of the above, only the application, but the protection domain of the application is not limited thereto, Any one skilled in the art is in the technical scope that the application discloses, the change or replacement that can be readily occurred in, It should all cover within the protection domain of the application.Therefore, the protection domain of the application should be with scope of the claims Subject to.

Claims

1. a kind of internet video display multi-source data fusion method, including：

Data collection step：It obtains from more than two internet video display platforms and the relevant data of video display, to the data After being pre-processed, standardized entity is obtained；

Attributes similarity calculates step：Calculate the attributes similarity between the entity of different internet video display platforms, wherein The attribute includes essential attribute and multimedia attribute；

Entity similarity calculation step：The weight of the attribute is calculated based on comentropy, is based on the attribute and the weight meter Calculate the similarity of the entity of different video display platforms；With

Entity fusion steps：The entity is grouped by the similarity based on the entity, by two in one group with On the entity merge.

2. according to the method described in claim 1, it is characterized in that, the data collection step includes：

Data acquisition step：It obtains from more than two internet video display platforms and the relevant data of video display；With

Data prediction step, wherein the data prediction step includes：

Data cleansing step：The data are cleaned；With

Data normalization processing step：By the Property Name or attribute of the solid data of described two above internet video display platforms Value replaces with standardized Property Name or attribute value, to obtain the standardized entity.

3. according to the method described in claim 1, it is characterized in that, the attributes similarity include in properties one kind or Combination thereof：Keyword attribute, aggregate attribute, brief string attribute, long text attribute and image content attribute.

4. according to the method described in claim 1, it is characterized in that, the entity similarity calculation step includes：

The weight of the attribute is calculated based on comentropy, is based on the attribute and the described two above internets of the weight calculation The similarity of the different entities of video display platform, if the maximum similarity between two entities is more than given threshold, judgement should Two entities are identical entity.

5. method according to claim 1 to 4, which is characterized in that the entity fusion steps include：

The entity is grouped by the similarity based on the entity, will be located at one group according to the entity attributes in each group In more than two entities merge, wherein in the attribute include with one or more of properties：Monodrome category Property, aggregate attribute and cumulative attribute.

6. a kind of internet video display multisource data fusion device, including：

Data collection module is disposed for obtaining from more than two internet video display platforms and the relevant number of video display According to after being pre-processed to the data, obtaining standardized entity；

Attributes similarity computing module is disposed for calculating the attribute between the entity of different internet video display platforms Similarity, wherein the attribute includes essential attribute and multimedia attribute；

Entity similarity calculation module is configured to calculate the weight of the attribute based on comentropy, is based on the attribute and institute State the similarity of the entity of weight calculation difference video display platform；With

Entity Fusion Module is configured to the similarity based on the entity and is grouped the entity, will be located in one group More than two entities merge.

7. device according to claim 6, which is characterized in that the data collection module includes：

Data acquisition module is disposed for obtaining from more than two internet video display platforms and the relevant number of video display According to；And data preprocessing module,

Wherein, the data preprocessing module includes：

Data cleansing module is disposed for cleaning the data；With

Data normalization processing module is disposed for the category of the solid data of described two above internet video display platforms Property title or attribute value replace with standardized Property Name or attribute value, to obtain the standardized entity.

8. the device described according to claim 6 or 7, which is characterized in that the entity similarity calculation module is configured to be based on Comentropy calculates the weight of the attribute, is based on the attribute and the described two above internet video display platforms of the weight calculation The similarities of different entities judge two entities if maximum similarity between two entities is more than given threshold For identical entity.

9. a kind of computer equipment, including memory, processor and storage can be transported in the memory and by the processor Capable computer program, wherein the processor is realized when executing the computer program such as any one of claim 1 to 5 The method.

10. a kind of computer readable storage medium, preferably non-volatile readable storage medium, are stored with computer journey Sequence, the computer program realize the method as described in any one of claim 1 to 5 when executed by the processor.