Embodiment
In embodiments of the present invention, based on the ontology knowledge of semantic association information architecture search news video website, utilize said ontology knowledge from the internet, to search out the news video website.The evaluation of promptness is carried out in said news video website, utilize the assessment result of said promptness to set the time interval of picking up of said news video website.Then, utilize the time interval of picking up of said news video website, pick up the content in the said news video website in real time, obtain the news video in the said content through the searching method of setting.
For ease of the understanding to the embodiment of the invention, will combine accompanying drawing below is that example is done further and explained with several specific embodiments, and each embodiment does not constitute the qualification to the embodiment of the invention.
Embodiment one
The principle schematic of the searching method of a kind of news video that this embodiment provides is as shown in Figure 1, and the concrete treatment scheme of the searching method of this news video is as shown in Figure 2, comprises following treatment step:
Step 21, based on the ontology knowledge of semantic association information architecture search news video website; Utilize above-mentioned ontology knowledge, first search technique and website subject identifying method from the internet, to search out the news video website, and with the news video web site stores in the news video site databases.
At first, utilize the news video data in advance of small quantities of seed website to set up the news video database, the descriptor of each news video of storage and each news video in this news video database.Above-mentioned seed website comprises websites such as " www.xinhuanet.com's news ", " rising fast net news ".
In embodiments of the present invention, also to set up the news video site databases in advance, each news video website of storage in this news video site databases, and the evaluation information of each news video website, pick up information such as time interval.
Ontology knowledge based on semantic association information architecture search news video website.The structure principle schematic of this ontology knowledge is as shown in Figure 3.Above-mentioned semantic association information spinner will comprise: the searching key word that search engine itself provides, search for the content keyword of the news video website of discovery, search for the content institutional framework keyword of the news video website of discovery and the content description keyword of having searched for the news video website of discovery.The content keyword of above-mentioned news video website comprises: the keyword in the title of the content of news video website, the content description keyword of above-mentioned news video website comprises: the focus video title.Therefore, mainly comprise four kinds of keywords in the above-mentioned ontology knowledge, i.e. searching key word, content keyword, content institutional framework keyword and content description keyword.
To each keyword in the above-mentioned ontology knowledge; Utilize the searching request of first search technique structure to the search engine in the internet; The Search Results that the above-mentioned search engine of extraction setting quantity returns; Extract the URL (Universal Resource Locator, URL) that comprises in the return results.Identify the URL of the news video website that comprises among the above-mentioned URL through the website subject identifying method.
Treatment scheme such as Fig. 4 of a kind of above-mentioned website subject identifying method that this embodiment provides are said, and concrete processing procedure mainly comprises:
At first utilize the pattern information of the URL that comprises in the above-mentioned return results, like the information such as length, the degree of depth and form of URL, using technology such as decision tree or rule set to identify above-mentioned URL is website URL or webpage URL.
For each the website URL that identifies; Grasp all webpages in the ground floor of website; Utilize the broadcast page recognition technology to calculate the ratio of the video playback page or leaf in above-mentioned all webpages; If this ratio, thinks then that this website URL is irrelevant with news video website theme, gets rid of this website URL less than predefined video playback page or leaf threshold value; Otherwise, think that above-mentioned website URL is relevant with news video website theme.
Utilize the corresponding literal (anchor literal) that links of video playback page or leaf in the above-mentioned website relevant that the news video database of setting up is in advance carried out fuzzy query, count total analog result number with news video website theme.Calculate average every corresponding analog result number of link literal,, think that then this website and news video website theme are irrelevant if this analog result number is counted threshold value less than predefined analog result; Otherwise, think that above-mentioned website URL is relevant with news video website theme, promptly identifying above-mentioned website is the news video website.
Then, with the news video web site stores that identifies in the news video site databases of setting up in advance.
In embodiments of the present invention; The news video website that can also utilize above-mentioned website subject identifying method to be identified; The ontology knowledge of above-mentioned structure is carried out the evaluation that new url produces power, degree of subject relativity two aspects; The concrete processing flow chart of this evaluation procedure is as shown in Figure 5, mainly comprises following process:
To each keyword in the above-mentioned ontology knowledge, utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the above-mentioned search engine of extraction setting quantity returns extracts the URL that comprises in the return results.
Obtain the URL of the news video website that comprises among the above-mentioned URL through the website subject identifying method; The quantity of calculating the URL of this news video website accounts for the ratio of the total quantity of the URL that comprises in the above-mentioned return results; If this ratio is less than predetermined subject degree of correlation threshold value; Think that then the theme of this keyword and news video website is irrelevant, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword is relevant with the theme of news video website.Continuation is carried out the relevant evaluation of new url generation power to this keyword.
In the news video site databases, search the URL of above-mentioned all news video websites of identifying; The quantity that calculates the URL of the news video website that is not included in the news video site databases according to lookup result accounts for the ratio between the total quantity of URL of above-mentioned news video website; If this ratio produces capacity threshold less than predefined new url; Think that then this keyword does not have new url and produces ability, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword has topic relativity and new url produces ability.
In general, it is better that above-mentioned website degree of subject relativity threshold value and new url generation capacity threshold all is made as 0.1 effect.
Step 22, promptness, novelty and original evaluation are carried out in the news video website of storing in the news video database, utilize the promptness assessment result of news video website to set the time interval of picking up of news video website.
The news video website of storing in the news video database is carried out the evaluation of promptness, novelty and original three aspects.
This embodiment provides, and a kind of that the treatment scheme that promptness estimates is carried out in the news video website of storing in the news video database is as shown in Figure 6, and concrete processing procedure comprises:
Obtain the news video on the same day of some in the above-mentioned seed website, the news video database is carried out fuzzy query according to the news video on the above-mentioned same day.The news video quantity similar with the news video above-mentioned same day that comprise in each news video website in the statistics news video database, a plurality of similar news video that belongs to same news video website that same news video searches out only writes down once.
Descending sort is carried out by the news video quantity similar with the news video above-mentioned same day that comprise in all news video websites; Rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%; The last 10% be made as 1 fen is that 0 news video website directly was made as 0 fen for the news video quantity similar with the news video above-mentioned same day that comprise in addition.
At last, the promptness evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the content promptness of each news video website.
Utilize the promptness assessment result of news video website to set the time interval of picking up of news video website.According to the above-mentioned news video quantity similar with the news video on the said same day that comprise time interval of picking up of each news video website is set, the website that the news video quantity similar with the news video said same day that comprise is many is corresponding, and to pick up the time interval short.
A kind of feasible establishing method of picking up the time interval is: it is set in 5 minutes news video website of promptness score, and to pick up the time interval be 5 minutes; Being made as 10 minutes of score 4 minutes; Score 3 is divided into establishes 20 minutes; Being made as 40 minutes of score 2 minutes, being made as 80 minutes of score 1 minute, being made as 1 day of score 0 minute.
This embodiment provides, and a kind of that the treatment scheme that novelty estimates is carried out in the news video website of storing in the news video database is as shown in Figure 7, and concrete processing procedure comprises:
Utilize content-based duplicate detection technology that the news video that from each news video website, newly obtains is carried out cluster, from each cluster, select the discovery time comparison news video early of some to keep.Then, count total number of clicks of all news videos in each the news video website that remains, and then calculate the number of clicks of average each news video.
Number of clicks by above-mentioned average each news video is carried out descending sort to each news video website; Rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%; The last 10% be made as 1 fen is that 0 news video website directly was made as 0 fen for average each video number of clicks in addition.
At last, the novelty evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the novelty of each news video website.
This embodiment provides, and a kind of that the original treatment scheme of estimating is carried out in the news video website of storing in the news video database is as shown in Figure 8, and concrete processing procedure comprises:
Utilize content-based duplicate detection technology that the news video that from each news video website, newly obtains is carried out cluster; From each cluster, select the discovery time comparison news video early of some to keep the follow-up news video of remaining news video.Count total video quantity and repeated quantity that each news video website comprises, and then calculate the repeated ratio of each news video website.All news video websites are arranged in the ascending order of repeated ratio; Rank preceding 10% be made as 5 minutes, rank 10%~30% be made as 4 minutes, rank 30~70% be made as 3 minutes, being made as 2 fens of rank 70%~90%; The last 10% be made as 1 fen is that 100% news video website directly was made as 0 fen for the repeated ratio in addition.
At last, the original evaluation result of above-mentioned each news video website is deposited in the news website database, as the tolerance foundation of the originality of each news video website.
The treatment scheme of a kind of above-mentioned content-based duplicate detection technology that this embodiment provides is as shown in Figure 9, and concrete processing procedure comprises as follows:
At first extract the key frame of video of the some of each news video; Use Harris (Harris) operator to detect angle point to each key frame of video; Utilize the proper vector of the angle point subregion of SIFT (conversion of yardstick invariant features) the above-mentioned key frame of video of latent structure, and utilize PCA (principal component analysis (PCA)) to reduce the dimension of above-mentioned proper vector.Between the key frame of video in twos of two news videos, use KNN (K arest neighbors) algorithm, nearest preceding K the proper vector of computed range is right; BIC (Bayes's information measure) algorithm is used for the characteristic value sequence X={x1 of an above-mentioned K proper vector to forming; X2 ..., the comparison of xN} (N=2K); If have trip point in the above-mentioned characteristic value sequence X sequence, judge that then two key frame of video do not repeat; Otherwise, judge that two key frame of video repeat.
Count the quantity of the key frame of video of two repetitions between the news video, the key frame of video that calculates repetition accounts for the ratio of total key frame of video, if greater than the key frame of video threshold value of setting, judge that then two news videos are repetitions; Otherwise, judge that two news videos do not repeat.
Step 23, utilize time interval of picking up of news video website, pick up the news video in the news video website in real time, the news video of picking up is deposited in the news video database through the searching method of setting.
The treatment scheme of a kind of content of picking up the news video website of storing in the news video database in real time that this embodiment provides is shown in figure 10, and concrete processing procedure is following:
At first from the news video site databases, obtain the URL and the promptness assessment result of each news video website; Utilize the promptness assessment result of news video website to set the time interval of picking up of news video website; A kind of feasible time interval establishing method of picking up is: it is set in 5 minutes news video website of promptness score, and to pick up the time interval be 5 minutes; Being made as 10 minutes of score 4 minutes, score 3 is divided into establishes 20 minutes, being made as 40 minutes of score 2 minutes; Being made as 80 minutes of score 1 minute, being made as 1 day of score 0 minute.
Judge successively according to certain arrangement sequence whether each news video website in the news video site databases has surpassed the corresponding time interval of picking up apart from the time interval of picking up when finishing last time; If surpass, then the content of a corresponding news video site promoter new round is picked up process; Otherwise, judge whether the time interval when end was picked up apart from last time in next website has surpassed the corresponding time interval of picking up.
For each news video website to be picked up, through the searching method of setting the content in the above-mentioned news video website to be picked up, the searching method of above-mentioned setting comprises: the methods such as BFS method that the degree of depth is limited.
Utilize the limited BFS method of the degree of depth that above-mentioned news video website is traveled through, concrete degree of depth restriction can be the constant of an overall situation, also can change with the difference of news video website.For each webpage that runs in the above-mentioned ergodic process; At first utilize the broadcast page recognition technology to judge whether it is the video playback page or leaf; Utilize webpage noise remove technology to remove the noise information that it comprises for the video playback page or leaf; The noise here comprises: ground unrest, random noise, and residual noise.With information remaining in the video playback page or leaf as news video.
Utilize above-mentioned content-based duplicate detection technology to carry out duplicate detection to this news video, the news video for duplicate detection is passed through utilizes the image quality that improves news video based on the inverse iteration sciagraphy in video compress territory.After utilizing existing instrument that news video is carried out the transcoding processing, obtain the news video of MP4 or FLV (FLV stream media format) encapsulation format.Then, news video and corresponding descriptor are deposited in the news video database.When end is picked up in the news video website, will deposit in the concluding time in the news video site databases.
News video in the above-mentioned news video site databases can use for the video on-demand system towards the TV news door.Can the description and the related information of news video be pushed to Portal (door) website.Behind user's STB (Set Top Box, STB) the visit Portal website, can see up-to-date news video tabulation, the user can browse the news video in the news video tabulation, order and program request.
Embodiment two
The structural representation of the searcher of a kind of news video that this embodiment provides is shown in figure 11, comprises following module:
News video site search module 11 is used for the ontology knowledge based on semantic association information architecture search news video website, utilizes said ontology knowledge from the internet, to search out the news video website;
News video website evaluation module 12 is used for the evaluation of promptness is carried out in the news video website that said news video site search module searches for out, utilizes the assessment result of said promptness to set the time interval of picking up of said news video website;
News video acquisition module 13; Be used to utilize the time interval of picking up of news video website that said news video website evaluation module sets; Searching method through setting is picked up the content in the said news video website in real time, obtains the news video in the said content.
The searcher of described news video can also comprise:
Ontology knowledge evaluation module 14; Be used for to each keyword of above-mentioned ontology knowledge; Utilize the searching request of first search technique structure to the search engine in the internet, the Search Results that the above-mentioned search engine of extraction setting quantity returns extracts the URL that comprises in the return results.
Obtain the URL of the news video website that comprises among the above-mentioned URL through the website subject identifying method; The quantity of calculating the URL of this news video website accounts for the ratio of the total quantity of the URL that comprises in the above-mentioned return results; If this ratio is less than predetermined subject degree of correlation threshold value; Think that then the theme of this keyword and news video website is irrelevant, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword is relevant with the theme of news video website.Continuation is carried out the relevant evaluation of new url generation power to this keyword.
In the news video site databases, search the URL of above-mentioned all news video websites of identifying; The quantity that calculates the URL of the news video website that is not included in the news video site databases according to lookup result accounts for the ratio between the total quantity of URL of above-mentioned news video website; If this ratio produces capacity threshold less than predefined new url; Think that then this keyword does not have new url and produces ability, this keyword is weeded out from above-mentioned ontology knowledge; Otherwise, think that this keyword has topic relativity and new url produces ability.
Described news video site search module 11 specifically can comprise:
Search module 111; Be used for each keyword to said ontology knowledge; Utilize the searching request of first search technique structure to the search engine in the internet; The Search Results that the said search engine of extraction setting quantity returns extracts the uniform resource position mark URL that comprises in the return results;
Identification module 112 is used for identifying through the website subject identifying method URL of the news video website that URL that said search module extracts comprises, with the news video web site stores that identifies at the news video site databases of setting up in advance.
Described news video website evaluation module 12 specifically can comprise:
Statistical module 121; Be used in seed website, obtaining the news video on the same day of some; News video according to the said same day is carried out fuzzy query to the news video database; The news video quantity similar with the news video said same day that comprise in each news video website in the statistics news video database deposits the evaluation result of this news video quantity as the promptness of news video website in the news video site databases in;
Setting module 122 is used for setting according to the said news video quantity similar with the news video on the said same day that comprise time interval of picking up of each news video website, and the news video website that news video quantity is many is corresponding, and to pick up the time interval short.
Described news video acquisition module 13 specifically can comprise:
Pick up module 131; Be used for when the news video website of news video site databases picked up apart from last time time when finishing surpassed said news video website pick up the time interval after, through the searching method of setting the content in the said news video website is picked up;
Identification module 132; Be used for utilizing the broadcast page recognition technology to judge whether it is the video playback page or leaf to each webpage of picking up from said news video website; After removing its noise information that comprises for the video playback page or leaf of judging, with the information of remainder as news video;
Detect and enforcement module 133; Be used for utilizing content-based duplicate detection technology to carry out duplicate detection to said news video; Utilization strengthens the quality of the news video that duplicate detection passes through based on the inverse iteration sciagraphy in video compress territory; Then, said news video and corresponding descriptor are deposited in the news video database.
Described news video website evaluation module 12 can also comprise:
Novelty evaluation module 123 is used for utilizing content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster, from each cluster, selects the discovery time comparison news video early of some to keep.Then, count total number of clicks of all news videos in each the news video website that remains, and then calculate the number of clicks of average each news video.
Set time interval of picking up of each news video website according to the said news video quantity similar with the news video on the said same day that comprise, the news video website that news video quantity is many is corresponding, and to pick up the time interval short.
Number of clicks by above-mentioned average each news video is carried out the novelty evaluation to each news video website; The novelty evaluation result of each news video website is deposited in the news website database, as the tolerance foundation of the novelty of each news video website.
Original evaluation module 124; Be used for utilizing content-based duplicate detection technology that the news video that newly obtains from each news video website is carried out cluster; From each cluster, select the discovery time comparison news video early of some to keep the follow-up news video of remaining news video.Count total video quantity and repeated quantity that each news video website comprises, and then calculate the repeated ratio of each news video website.
Repeated ratio in above-mentioned each news video website is carried out originality evaluation to each news video website; The original evaluation result of each news video website is deposited in the news website database, as the tolerance foundation of the originality of each news video website.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method; Be to instruct relevant hardware to accomplish through computer program; Described program can be stored in the computer read/write memory medium; This program can comprise the flow process like the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only storage memory body (Read-Only Memory, ROM) or at random store memory body (Random AccessMemory, RAM) etc.
In sum, the embodiment of the invention has solved the internet news video effectively and has searched for automatically, accurately, timely and integrated problem, can identify the news video website quickly and accurately, can find automatically, in time and integrated news video.
The embodiment of the invention proposes a kind of towards the internet news video search of TV news door and integrated system and method; Abundant and high-quality internet news video resource can be provided for the video on-demand system towards the TV news door, can necessary news video material and descriptor be provided for the TV news door.
The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.