Summary of the invention
In view of the deficiencies in the prior art, the purpose of the present invention is to provide a kind of mistakes based on video text message
Method is checked in source, when grabbing video playing link, is carried out video grabber misarrangement based on text information, is solved TV internet view
" of the same name not homologous " problem occurred in frequency aggregated application.
To achieve the above objectives, the technical solution adopted by the present invention is that:
A kind of wrong source investigation method based on video text message, which comprises the steps of:
Step 1, each video website is searched for by programm name, determines whether each video website has the corresponding programm name
Video source;
Step 2, video playing link is grabbed from the corresponding video website of video source: by the way of periodically grabbing,
At least grab the following contents:
Video playing corresponding with video source links,
For demarcating the text information of the video content of the video source;
Step 3, the crawl result of storing step 2 forms crawl historical record;
Step 4, the text information in crawl historical record is analyzed, finds out wrong source, and accordingly delete crawl history
The corresponding crawl result in mistake source in record;
Step 5, the crawl historical record in the error-free source eventually formed according to step 4, by agreement television convergence, with
The mode of program presents and realizes broadcasting.
Based on the above technical solution, in step 1, the programm name includes but is not limited to: TV play title or
Movie name.
Based on the above technical solution, step 1 is realized by shell script.
Based on the above technical solution, in step 2, the mode periodically grabbed refers to:
The update cycle of program and the priority of program are preset,
The update cycle of the priority height of program then corresponding program is short,
The update cycle of the low then corresponding program of the priority of program is long,
The range of choice of the update cycle of program are as follows: 1 hour to 1 week;
Grasping manipulation is periodically carried out according to the update cycle of program.
Based on the above technical solution, described for demarcating the text envelope of the video content of the video source in step 2
Breath includes at least: director, protagonist, age, program category, area, programme contribution number, single collection duration, alias and brief introduction.
Based on the above technical solution, in step 3, crawl historical record is formed and with list of meta data metadata
The form of list stores;
It include that at least one metadata is recorded in list of meta data metadata list;
Every metadata record at least stores the following contents: programm name, for demarcating the video content of the video source
Text information.
Based on the above technical solution, in step 3, the formation grabs historical record, the specific steps are as follows:
Judge whether the programm name of the video source newly grabbed is present in crawl historical record,
If it does not exist, then the video source that a metadata record storage newly grabs is created;
If existing, the video source that a newly-built metadata record storage newly grabs, after be held after the completion of grasping manipulation
Row step 4.
Based on the above technical solution, in step 4, the specific steps are as follows:
Similarity mode is carried out to the identical metadata record of programm name in crawl historical record:
The text of the video content for demarcating the video source in two metadata record identical to programm name
Information is carried out similarity mode item by item;
The result of comprehensive items similarity mode;
If similarity meets or exceeds criterion, then it is assumed that the video source newly grabbed and metadata record in deposit earlier
The video source of storage is the same program;
Text information element according to the text information element of the video source newly grabbed, in completion metadata record;
If similarity is not up to criterion, then it is assumed that the video source newly grabbed is wrong source, should give exclusion, will newly grab
Video source as new program.
Based on the above technical solution, the similarity criterion are as follows:
Regard text information element as parameter in a vector respectively;
Similarity judgement is to be compared each parameter in above-mentioned vector respectively, obtains similar value, then by similar value
It is added and obtains the similarity of the text information of the video content for demarcating the video source;
The similar value is normalized with similarity, finally obtained for demarcating in the video of the video source
The form of the similarity of the text information of appearance percentage indicates.
Method is checked in wrong source of the present invention based on video text message, when grabbing video playing link, is based on
Text information carries out video grabber misarrangement, solves the problems, such as " of the same name not homologous " occurred in TV internet video aggregated application.
Specific embodiment
Below in conjunction with attached drawing, invention is further described in detail.
As shown in Figures 1 to 3, method is checked in the wrong source of the present invention based on video text message, is included the following steps:
Step 1, each video website is searched for by programm name (also known as title), determines whether each video website has correspondence
The video source of the programm name;
Step 2, video playing link is grabbed from the corresponding video website of video source: by the way of periodically grabbing,
At least grab the following contents:
Video playing corresponding with video source links,
For demarcating the text information of the video content of the video source;
Step 3, the crawl result of storing step 2 forms crawl historical record (referred to as historical record);
Step 4, the text information in crawl historical record is analyzed, finds out wrong source, and accordingly delete crawl history
The corresponding crawl result in mistake source in record;
Step 5, the crawl historical record in the error-free source eventually formed according to step 4, by agreement television convergence, with
The mode of program presents and realizes broadcasting.
Based on the above technical solution, in step 1, the programm name includes but is not limited to: TV play title or
Movie name.
Further, programm name can be some or certain several keywords in TV play title or movie name.
Further, programm name can be simplified form of Chinese Character, Chinese-traditional, Korean, Japanese or English.
Based on the above technical solution, step 1 is realized by shell script.Wherein:
Video site list comprising default in shell script, each video website of described search are the video website by default
List scans for one by one;
The video site list of the default is stored in shell script;
And/or: it include customized video site list in shell script, each video website of described search is by customized
Video site list scan for one by one;
The local of equipment where the customized video site list is stored in shell script;
And/or: it include the video site list in cloud in shell script, each video website of described search is the view by cloud
Frequency site list scans for one by one;
The video site list in the cloud is stored in one or more Cloud Servers.
Based on the above technical solution, in step 2, the mode periodically grabbed refers to:
The update cycle of program and the priority of program are preset,
The update cycle of the priority height of program then corresponding program is short,
The update cycle of the low then corresponding program of the priority of program is long,
The range of choice of the update cycle of program are as follows: 1 hour to 1 week;
Grasping manipulation is periodically carried out according to the update cycle of program.
Wherein;
The priority of program is ranked up according to the retrieval frequent degree of recent programm name;
It is described in the recent period include but is not limited to: the same day, it is three days nearest, nearest one week or one month nearest;
And/or: the priority of program is ranked up according to the issuing date distance of program;
And/or: the priority of program is ranked up according to user's history viewing record;
The content recorded in the user's history viewing record includes but is not limited to: the duration of user's viewing and the day of viewing
Time phase, the type of user's viewing, the production company of user's viewing, the director of user's viewing or user's viewing protagonist;
Preferably, at least should include user's viewing duration and viewing date-time, according to the date of viewing
Time, which calculates, learns viewing on weekdays or weekend, is working day or weekend further according to the same day, in conjunction with user's viewing when
It is long, it is ranked up according to the duration of program.
Based on the above technical solution, described for demarcating the text envelope of the video content of the video source in step 2
Breath includes at least: director, protagonist, age, program category, area, programme contribution number, single collection duration, alias and brief introduction.
Further, described " director, protagonist, age, program category, area, programme contribution number, single collection duration, alias and letter
It is situated between " it is text information element (the text information element in the text information of the video content for demarcating the video source), such as
Wherein some or certain several text information element elements lack, then the text information element of the missing is left a blank, or is filled with "None" word,
Or " missing " word etc. is filled with to show difference.
Based on the above technical solution, in step 3, crawl historical record is formed and with list of meta data metadata
The form of list stores;
It include that at least one metadata is recorded in list of meta data metadata list, it may be assumed that several metadata notes
Record constitutes crawl historical record of the present invention;
Every metadata record at least stores the following contents: programm name (title), for demarcating the view of the video source
The text information of frequency content.
Metadata definition: about the information of the tissue of data, data field and its relationship, in short, metadata be exactly about
The data of data.
Another embodiment are as follows: every metadata record storage the following contents: programm name (title), video playing
Link, the text information of the video content for demarcating the video source.Need: how video playing link, which is handled, is not
Present invention key content to be protected, therefore the content for being related to video playing link is no longer described in detail.
Based on the above technical solution, in step 3, the formation grabs historical record, the specific steps are as follows:
Judge whether the programm name of the video source newly grabbed is present in crawl historical record,
If it does not exist, then the video source that a metadata record storage newly grabs is created;
If existing, the video source that a newly-built metadata record storage newly grabs, after be held after the completion of grasping manipulation
Row step 4.
Based on the above technical solution, in step 4, the specific steps are as follows:
Similarity mode is carried out to the identical metadata record of programm name in crawl historical record:
The text of the video content for demarcating the video source in two metadata record identical to programm name
Information is carried out similarity mode item by item;
The result of comprehensive items similarity mode;
If similarity meets or exceeds criterion, then it is assumed that the video source newly grabbed and metadata record in deposit earlier
The video source of storage is the same program;
Text information element according to the text information element of the video source newly grabbed, in completion metadata record;
If similarity is not up to criterion, then it is assumed that the video source newly grabbed is wrong source, should give exclusion, will newly grab
Video source as new program.
Based on the above technical solution, the similarity criterion are as follows:
By text information element (guidance drills, acts the leading role, the age, program category, area, programme contribution number, single collection duration, alias,
Brief introduction) regard parameter in a vector as respectively;
Similarity judgement is to be compared each parameter in above-mentioned vector respectively, obtains similar value, then by similar value
It is added and obtains the similarity of the text information of the video content for demarcating the video source;
The similar value is normalized with similarity, finally obtained for demarcating in the video of the video source
The form of the similarity of the text information of appearance percentage indicates.
Based on the above technical solution, each parameter be compared respectively mainly have three ways, such as it is following:
Mode 1: also referred to as discrete class Boolean type compares, if to refer to that the parameter to compare only exists identical or not for which
With two kinds as a result, the similar value then provided is only there are two types of value;
Citing: if the director for two programs that compare is identical, otherwise it is 0 that the similar value of " director " this parameter, which is 1,;
It is merely illustrative, according to algorithm operational effect, when comparison result is not identical, 0 will not be generally taken, may not also take 1 when identical, but
It is that there are two types of values for inevitable of comparison result;
Mode 2: also referred to as continuity type compares, and when which refers to the parameter difference to compare, is normalized and reflects
Processing is penetrated, similar value is some value on [0,1];
The normalized frequently with method have Method of Cosine, sigmoid function, index method;
Citing: if the age for two programs that compare is identical, output is 1, provides one according to Method of Cosine if not identical
A similar value;Such as the metadata record middle age in historical record on behalf of 2016, then the age information in new storage information
It is bigger closer to 2016 Nian Zeqi similarity values, such as it is 0.2 that similar value in 2015, which is 0.9,2000,;
Mode 3: also referred to as simhash type compares, and which refers to for rich text information, using well known
Simhash method obtains the cryptographic Hash of two rich text information first, then calculates the Hamming distances of cryptographic Hash, last basis
The Hamming distances are obtained similar value as normalized by the digit of cryptographic Hash;
Citing: cryptographic Hash is calculated separately to the brief introduction of two programs A and B, it is assumed that be expressed as hashA=with 6
110001, hashB=101011;The then Hamming distances of two cryptographic Hash are as follows: hamingD (hashA, hashB)=count_1 (A
Xor B)=count_1 (100001)=2.The value range of Hamming distances is relevant to the digit of cryptographic Hash, therefore can be with
Normalized is made to the distance, which can simply state are as follows: when Hamming distances are 6, similar value is 1,;For 0 phase
It is 0 like value, is quantified when other values using being uniformly distributed on the section [0, maxbit (hash)].In this example, 2 similar value
Are as follows:
1*bit [2, maxbit (hash)]/count [0, maxbit (hash)]=1*bit [2,6]/count [0,6]=
1*3/7=0.43
By the calculating of three of the above mode, after obtaining two vectors relatively after the similar value of each parameter, by items multiplied by
Weight factor is simultaneously added, and obtains final similarity.The experience that the weight factor is accumulated from long campaigns video traffic;
Citing: the weight factor of director is 0.2, and the weight factor of protagonist is 0.3, it will be understood that, two same reputation and integrity
Mesh, and (compared to direct it is identical) act the leading role it is identical its it is similar a possibility that it is bigger.Because of the opposite protagonist of element number in director's set
For to lack.The inference is a kind of a kind of possibility retrodicted out from result, is really that situation should be much more complex.
The forming process of historical record is described in detail below by way of citing (which includes similarity comparison processes).The act
Content described in step 3 and step 4 in the corresponding specific embodiment of example.
If an existing record is as follows in historical record:
Metadata_ORG{
Programm name: Hero Shooting Vulture,
Director: Li Guoli,
It acts the leading role: (actor1: Lin Yichen, actor2: Hu Ge ...),
Age: 2008,
Program category: (tag1: swordsman),
Area: China's Mainland,
Programme contribution number: null,
Single collection duration: null,
Alias: (name1:08 editions are penetrated carving),
Brief introduction: Southern Song Dynasty's period, monarch ...)
};
If the metadata record that crawl two is newly put in storage:
Metadata1{
Programm name: Hero Shooting Vulture,
Director: Li Tiansheng,
It acts the leading role: (actor1: Zhang Zhilin, actor2: Zhu Yin ...),
Age: 1994,
Program category: (tag1: swordsman),
Area: Hong Kong,
Programme contribution number: 35,
Single collection duration: null,
Alias: (name1: Hero Shooting Vulture),
Brief introduction: story occurs ...)
};
Metadata2{
Programm name: Hero Shooting Vulture,
Director: Li Guoli,
It acts the leading role: (actor1: Hu Ge, actor2: Lin Yichen ...),
Age: 2007,
Program category: (tag1: swordsman, tag2: love, tag3: ancient costume),
Area: China's Mainland,
Programme contribution number: 50,
Single collection duration: 43,
Alias: (name1: new Hero Shooting Vulture, name2:08 editions are penetrated carving),
Brief introduction: Southern Song Dynasty's period, monarch ...)
};
Calculating process is as follows:
Step 1 is tabled look-up in historical record finds that metadata1 is identical as the programm name of metadata_ORG, then starts
Calculate the similarity degree of two records.
Step 2 regards two records as two vectors comprising several parameters, and parameter " is led in calculating metadata1 first
Drill " value " Li Tiansheng " and metadata_ORG in parameter " director " value " Li Guoli " between similar value.Assuming that the parameter
Usage mode 1 calculates (discrete class Boolean type compares), since director is different, calculated result 0.1;And so on, according to parameter
Type selects the similar value of each parameter of one of three kinds of modes calculating.Assuming that finally obtaining following similar value calculated result:
SimVector1=(director: 0.1, it acts the leading role: 0.1, the age: 0.2, program category: 0.7, area: 0.2, programme contribution
Number: null, single to collect duration: null, alias: 0.1, brief introduction: 0.8)
Items in simVector1 multiplied by weight factor and are added by step 3.Weight factor can also be regarded as one to
Amount, it is assumed that weight factor vector are as follows:
WeightVector (director: 0.2. is acted the leading role: 0.3, the age: and 0.05, program category: 0.1, area: 0.1, programme contribution
Number: 0.05, single duration that collects: 0.05, alias: 0.05, brief introduction: 0.1) then final similarity:
SimValue1=simVector1*weightVector=0.1*0.2+0.1*0.3+0.2*0. 05+0.7*0.1+
0.2*0.1+0*0.05+0*0.05+0.1*0.05+0.8*0.1=0.235
Step 4 judges whether similarity reaches criterion.If criterion is similarity less than 0.5, then it is assumed that be wrong
Source, then metadata1 is judged as wrong source, should give record where excluding metadata_ORG, and a newly-built record
“metadata1”。
Step 5 is tabled look-up in historical record finds that metadata2 is identical as the programm name of metadata_ORG, then starts
Calculate the similarity degree of two records.
Step 6 repeats the above steps 2-3, it is assumed that the similarity simValue2=0.65 being calculated at this time
Step 7 judges whether similarity reaches criterion.If criterion is similarity less than 0.5, then it is assumed that be wrong
Source, then metadata2 is judged as homologous, at this time according to content completion metadata_ of the update rule in metadata2
Content in ORG.
It summarizes: as can be seen that metadata2 and metadata_ORG describe same program, but it is not all
Parameter is all identical as in historical record, such as age, performer's sequence, partial parameters incompleteness.Just by similarity calculation at this time
It both can find out similarity with higher, and with the information completion metadata_ORG in metadata2;, when similarity not
Reach criterion (metadata1) and then thinks that the video source of text information calibration for wrong source, should give exclusion.
The content being not described in detail in this specification belongs to the prior art well known to professional and technical personnel in the field.