CN102984162A - Identifying method and collecting system for credible websites - Google Patents

Identifying method and collecting system for credible websites Download PDF

Info

Publication number
CN102984162A
CN102984162A CN2012105184712A CN201210518471A CN102984162A CN 102984162 A CN102984162 A CN 102984162A CN 2012105184712 A CN2012105184712 A CN 2012105184712A CN 201210518471 A CN201210518471 A CN 201210518471A CN 102984162 A CN102984162 A CN 102984162A
Authority
CN
China
Prior art keywords
current site
website
download
confidence level
sample size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105184712A
Other languages
Chinese (zh)
Other versions
CN102984162B (en
Inventor
于春功
张超旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qizhi Business Consulting Co ltd
Beijing Qihoo Technology Co Ltd
360 Digital Security Technology Group Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201210518471.2A priority Critical patent/CN102984162B/en
Publication of CN102984162A publication Critical patent/CN102984162A/en
Application granted granted Critical
Publication of CN102984162B publication Critical patent/CN102984162B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an identifying method and a collecting system for credible websites. The collecting system comprises a credible sample database and a server; the server is suitable for extracting journals which are downloaded from present websites in a set time period, counting sample number and user number of download links of the present websites performing download operation in the set time period according to user identifications and the downloaded file identifications in the extracted downloaded journals, obtaining the credibility of the present websites according to the counted sample number and user number of the present websites, and identifying whether the present websites are official websites or not according to the obtained credibility of the present websites and the counted sample number; and the credible sample database is suitable for collecting the official websites which are identified by the server. By adopting the technical scheme of the invention, the official websites with higher credibility can be identified, so that the reliable download websites are provided for users with download demands, the risk of the users downloading malware samples is reduced, and the network security assurance of the users is improved.

Description

The recognition methods of credible website and gathering system
Technical field
The present invention relates to network field, relate in particular to a kind of recognition methods and gathering system of credible website.
Background technology
The Internet era most software all by the Internet redistribution, wherein, download website, forum, official website's download link are the important channel of software issue.At present, most of download website, forum all allow the user freely to submit content to.For example, a lot of download websites, forum all provide transmitting assembly, and common website user just can think oneself the software upload issued by these transmitting assemblies, downloads for other users.And the lawless person can utilize this point just, the malice such as transmitted virus, wooden horse, forced bonding plug-in unit sample.This has brought huge network security hidden danger on the one hand, is had the user of download demand to cause very large security risk again on the other hand.
And the software reliability of official website's issue is very high.Therefore, for the user's that ensures the download demand network security, need to identify the download link of the website that all confidence levels are higher in the Internet, download for user security.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to a kind of recognition methods and gathering system that overcomes the problems referred to above or solve at least in part or slow down the credible website of the problems referred to above is provided.
According to an aspect of the present invention, provide a kind of recognition methods of credible website, having comprised:
Extract the download log of current site in a setting-up time section, according to the user ID in the described download log and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
The confidence level of obtaining current site according to described sample size and the number of users of current site, and whether identify described current site according to the confidence level of current site and sample size be official website;
Wherein, the confidence level of obtaining current site according to described sample size and the number of users of current site further comprises: the confidence level of described current site and described sample size are inversely proportional to, and are directly proportional with described number of users.
According to another aspect of the present invention, a kind of gathering system of credible website is provided, comprise: authentic specimen database and server, wherein: described server, be suitable for extracting the download log of current site in a setting-up time section, identify and count current site was carried out the download link of down operation in described setting-up time section sample size and number of users according to the user ID in the described download log of extracting and download file, the confidence level of obtaining current site according to described sample size and the number of users of the current site that counts, whether identify described current site according to the confidence level of the current site of obtaining and the sample size that counts is official website;
Described authentic specimen database is suitable for collecting the official website of judging through described server.
Alternatively, described server comprises:
Extraction module is used for extracting the download log of current site in a setting-up time section;
Statistical module is used for user ID and download file sign according to the described download log of described extraction module extraction, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Acquisition module is for the described sample size of the current site that goes out according to described statistical module counts and the confidence level that number of users obtains current site;
Identification module, whether being used for the confidence level of the current site obtained according to described acquisition module and sample size that described statistical module counts goes out, to identify described current site be official website.
Owing to have some compressed packages in the link of download site, might comprise some malicious scripts etc., or the file that is utilized by rogue program etc., and recognition methods and gathering system by credible website of the present invention, can identify the higher official website of confidence level, on the one hand, improve server and collect the efficient of correct credible website, avoid server to download to the file that some are utilized by Malware, provide reliable download site for the user that the download demand is arranged on the one hand in addition, download to the maliciously risk of sample thereby reduced the user, improved user's network security guarantee.
Above-mentioned explanation only is the general introduction of technical solution of the present invention, for can clearer understanding technological means of the present invention, and can be implemented according to the content of specification, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Description of drawings
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing only is used for the purpose of preferred implementation is shown, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts with identical reference symbol.In the accompanying drawings:
Fig. 1 has schematically shown the according to an embodiment of the invention recognition methods flow chart of credible website;
Fig. 2 has schematically shown according to an embodiment of the invention another flow chart of the recognition methods of credible website;
Fig. 3 schematically shows except upgrading the schematic flow sheet that sample threshold is carried out the confidence level judgement in the recognition methods of the credible website of another embodiment according to the present invention.
Fig. 4 has schematically shown the according to an embodiment of the invention block diagram of the recognition device of credible website;
Fig. 5 has schematically shown the according to an embodiment of the invention another block diagram of the recognition device of credible website;
Fig. 6 has schematically shown the according to an embodiment of the invention block diagram of the gathering system of credible website.
Embodiment
The invention will be further described below in conjunction with accompanying drawing and concrete execution mode.
The embodiment of the invention can be applied to computer system/server, and it can be with numerous other universal or special computingasystem environment or configuration operation.The example that is suitable for well-known computing system, environment and/or the configuration used with computer system/server includes but not limited to: personal computer system, server computer system, thin client, thick client computer, hand-held or laptop devices, the system based on microprocessor, set-top box, programmable consumer electronics, NetPC Network PC, Xiao type Ji calculate machine Xi Tong ﹑ large computer system and comprise the distributed cloud computing technology environment of above-mentioned any system, etc.
Computer system/server can be described under the general linguistic context of the computer system executable instruction (such as program module) of being carried out by computer system.Usually, program module can comprise routine, program, target program, assembly, logic, data structure etc., and they are carried out specific task or realize specific abstract data type.Computer system/server can be implemented in distributed cloud computing environment, and in the distributed cloud computing environment, task is by carrying out by the teleprocessing equipment of communication network link.In distributed cloud computing environment, program module can be positioned on the Local or Remote computing system storage medium that comprises memory device.
All can produce a large amount of new files every day on the Internet, and wherein major part is new software and upgrade patch bag, and the software that these are new and upgrade patch bag can be collected the file in the white list database of server end.Include in time that these are new software and upgrade patch bag and to enter in the white list database, at first to check the publication channel of these softwares, usually can determine publication channel by the official website of checking these softwares, then these official websites be monitored.
The white list database of server end can also be collected renewal to the white list of legal procedure, specifically can be realized by following mode.
The first mode: by the technical staff periodically by craft, utilize spider or web crawlers and/or user to upload legal procedure is collected; By manual or automatically screen performance of program and or the program behavior and being kept in the described white list of described legal procedure by instrument.
The second mode: according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior are analyzed, to upgrade white list.
The system of the credible website of identification of the embodiment of the invention, can be by obtaining the download log of download file, and download log is analyzed, current site extracted in the download log, from current site, confirm official website, take the website and filter out plug-in in the official website and/or private at last.Analyze by the download log to software, can get access to more accurately Download Info.
Fig. 1 has schematically shown the according to an embodiment of the invention recognition methods flow chart of credible website.As shown in Figure 1, in the present embodiment, the identification process of credible website can comprise the steps:
Step S11 extracts the download log of current site in a setting-up time section;
When certain client device in the Internet when certain download site is downloaded some software, can gather the download behavior of client device, and the download behavior of client device is recited as the download log of software.Can record the Download Info of some softwares in this download log, such as the download path of software, the site information that software is downloaded etc. by these Download Infos, can get access to the concrete condition that software is downloaded.
For example, there is the site information of two softwares to be respectively http://www.badiu.com/xxxx and http://www.baidu.com/yyyy in the download log, can from the site information that these two softwares are downloaded, extracts candidate website logo information and be www.baidu.com.Certainly, can also extract by other means website logo information, the present invention is not limited this.Wherein, current site can be download website website or forum website etc.
Generally comprise following information in the download log: the signature of the software that client device is downloaded, client device are downloaded the path of software, the site information of software download and the software document name of download.Certainly, can also comprise some other information in the described download log, such as download time of software etc., the embodiment of the invention to this not in addition restriction ratio as, can also comprise the cryptographic Hash (hash value) of user id, download file, the parent page of download file, the URL(UniformResource Locator of user's download file current page in the download log, URL(uniform resource locator)) etc.The cryptographic Hash of download file is used for the unique identification download file.Cryptographic Hash also can be called the md5 value, if download file is compressed package files, also will comprise the md5 value of the file in the compressed package in the download log.
Step S12 according to the user ID in the download log of step S11 extraction and download file sign, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Step S13 adds up the sample size of the current site that obtains and the confidence level that number of users obtains current site according to step S12;
In general, fewer from the kind of official website's download file in a setting-up time section, because the renewal speed of the download file that provides in the official website is slower, and version compares less.If each file that same person is downloaded from a website relatively at random, and a lot of clients have all been downloaded same file from this website in the setting-up time section, can judge that then this document is relatively believable, be official website and the website of this document is provided.
Have as can be known above-mentionedly, supposing has m user to download n kind sample from a certain website in a period of time, if the n value is smaller, m is larger, and the n value is just more credible.Based on this, a kind of mode of obtaining the confidence level of current site can be: the confidence level of current site and sample size (obtaining by step S12) are inversely proportional to, and are directly proportional with number of users (obtaining by step S 12).
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Step S14, whether the sample size identification current site of the current site that the confidence level of the current site of obtaining according to step S13 and step S12 statistics obtain is official website.
Suppose to calculate confidence level with above-mentioned formula (1), if the n value less than default sample number threshold value, and W value can judge then that greater than the confidence level threshold value of presetting current site is official website.
Wherein, sample number threshold value and confidence level threshold value can rule of thumb be obtained.Such as, sampling given figure threshold value 〉=6 situation under, the confidence level threshold value 〉=the 85%(accuracy is arranged in 1.5 the download link) all be the official website download link, account for the 75%(recall ratio of whole official website download website).Turn down the sample number threshold value, will reduce accuracy, promote recall ratio; Otherwise, heighten the sample number threshold value, can improve accuracy, reduce recall ratio.Heighten the confidence level threshold value, can promote accuracy, reduce recall ratio.
In other embodiments of the invention, if when judging current site as official website, can also further grasp download link by this official website by step S14.And, can also further the download link that grasps be saved in the white list.Grasping manipulation can be finished by diverse network reptile business and/or website monitoring business.
May comprise also that plug-in website, private take the third party websites such as website in the official website that can identify by step S14.Consider that plug-in website sample, private take the particularity of website sample, need external linked network station, private to take the website and process separately.Therefore, alternatively, after step S14, can also be further from the official website that identifies, get rid of plug-in website, private takes the website, need to determine credible website.If when judging current site as credible website, can also be further by this credible website crawl download link.And, can also further the download link that grasps be saved in the white list.
Plug-in website and the private removal that takes the website can utilize Bayes classifier to finish.In the embodiment of the invention, utilize Bayes's text classifier that the Word message in the webpage is done characteristic statistics, calculate the probability that given webpage belongs to plug-in official website, if this probable value thinks then that greater than the probability threshold value of setting it is plug-in official website.
Except needs are removed plug-in website, can also remove private take the website concrete grammar can be as follows:
At first, obtain the private reference sample that takes the website, utilize Bayes's text classifier that the web page contents that private takes website reference sample reference sample is carried out the text participle, thereby and take categories of websites in private respectively and add up the word frequency of the phrase of getting and obtain two reference vectors:
V-SOFT={word1_count,word2_count,…,wordn_count}
Secondly, obtain a webpage to be sorted, the content of this webpage to be sorted carried out the text participle, obtain vector:
V-UNKNOWN={word1_count,word2_count,…,wordn_count}
Afterwards, calculate respectively the distance to V-SOFT by V-UNKNOWN, compare with respective threshold according to the above-mentioned distance that obtains, above-mentioned distance is during less than corresponding threshold value, illustrate that then webpage to be sorted takes the classification of website the closer to private, whether private takes the website thereby can differentiate, and is classified in this website to be sorted in this way, certainly the manner private that is not limited only to classify takes the website, can also be used for other websites of classification.
At last, take website, plug-in website by rejecting private in the official website.
The recognition methods of the credible website of the embodiment of the invention, can identify the higher official website of confidence level, thereby for the user that the download demand is arranged provides reliable download site, reduced the user and downloaded to the maliciously risk of sample, improved user's network security guarantee.
Fig. 2 has schematically shown according to an embodiment of the invention another flow chart of the recognition methods of credible website.As shown in Figure 2, the recognition methods of credible website can comprise:
Step S21, determine the address of corresponding log store server according to the url of current site; Usually, when the user carries out the resource downloading operation to current site, the a series of data messages that produce, these information are documented on the log store server with the form of daily record, and the description to associative operations such as resources on date, time, user and the download current site is all being put down in writing in the daily record of every delegation.
Step S22, according to the address of described log store server address, extract the download log of current site in a setting-up time;
In order fast and effeciently to assess the credibility of current site, preferably, process from log store server intercepting part download log, when intercepting, can carry out the division of time period take time point as foundation, extracting in the section sometime is download log in the setting-up time section, in order to analyze fast and effectively.The length of this setting-up time section is not done and is particularly limited, and can arrange according to data operation efficient and the credible reliability of judging.
Step S23, from the download log of extracting, obtain user ID and download file sign;
Because in the download log, mostly all comprising the resource that is downloaded on the user ID (id) of downloading the current site resource and the current site is download file sign (id), can identify on current site by user ID, download the user of resource in the setting-up time section, and can identify the file of being downloaded by the user on the current site by the download file sign.
Step S24, according to the user ID in the setting-up time section of extracting and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
As previously mentioned, owing to just extracted the download log of setting-up time section content in the present embodiment, therefore, when statistical analysis, correspondingly, only in the setting-up time section, the user ID in the download log and download file sign are carried out, can add up by the registered user name of login and download current site resource, also can add up according to the IP address of anonymous access current site and downloaded resources.
Step S25, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of current site;
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Intelligible, the embodiment of the invention also can adopt other similar nonlinear confidence level computational methods, obtains the confidence level of current site, does not repeat them here.
Step S26, judge whether confidence level is not less than the confidence level threshold value of setting, if it is execution in step S27; Otherwise, execution in step 29;
Whether step S27, judgement sample quantity are not less than the sample threshold of setting, and if so, then execution in step 30; Otherwise, execution in step 29.
Step 29, judge that current site is unofficial website;
Step 30, judge that current site is official website.
After step S30, can remove and obtain credible website after private in the official website takes the third party websites such as website, plug-in website, and after collecting credible website, can be periodically by craft, utilize spider or web crawlers and/or user to upload the file of credible website is collected; Follow-up by manual or automatically screen performance of program and or the program behavior and be kept at the white list database of the relevant program of file by instrument.
Can further according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior be analyzed, to upgrade white list.
Fig. 3 schematically shows except upgrading the schematic flow sheet that sample threshold is carried out the confidence level judgement in the recognition methods of the credible website of another embodiment according to the present invention.As shown in Figure 3, in the present embodiment, from above-mentioned embodiment illustrated in fig. 2ly different be, in order to improve the accuracy rate of credible judgement, prevent that situation about misjudging from occurring, process for the confidence level of the setting-up time section of different durations, meanwhile upgrade sample threshold, it can comprise the steps:
Step S31, in the current setting-up time section, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of current setting-up time section content current site;
In embodiments of the present invention, can calculate confidence level by following formula (1):
W=m/n formula (1)
In the above-mentioned formula (1), W is the confidence level of current site, and m is the number of users that carried out the download link of down operation in the setting-up time section, and n is the sample size that carried out the download link of down operation in the setting-up time section.
Intelligible, the embodiment of the invention also can adopt other similar nonlinear confidence level computational methods, obtains the confidence level of current site, does not repeat them here.
Step S32, judge whether the confidence level for corresponding in the current setting-up time section is not less than the confidence level threshold value of setting, if it is execution in step S33; Otherwise, execution in step S34;
Step S33, judge the sample threshold that whether is not less than setting for sample size in the current setting-up time section, if so, execution in step S35 then; Otherwise, execution in step S34.
Step S34, judge that current site is unofficial website;
Step S35, in another setting-up time section, be inversely proportional to according to confidence level and the sample size of current site, be directly proportional with number of users, obtain the confidence level of another setting-up time section content current site, and execution in step S36;
Among the step S35, obtain in another setting-up time section confidence level can referring among above-mentioned Fig. 1 for the computational methods of confidence level in the current slot, do not repeat them here.
Step S36, judge whether the confidence level for corresponding in this another setting-up time section is not less than the confidence level threshold value of setting, if it is execution in step S37; Otherwise, execution in step 34;
Step S37, renewal sample threshold;
Step S38, judge for sample size in this another setting-up time section whether be not less than sample threshold after the renewal, if so, then execution in step 39; Otherwise, execution in step S35.
Step S39, judge that current site is official website.
After step S39, can remove and obtain credible website after private in the official website takes the third party websites such as website, plug-in website, and after collecting credible website, can be periodically by craft, utilize spider or web crawlers and/or user to upload the file of credible website is collected; Follow-up by manual or automatically screen performance of program and or the program behavior and be kept at the white list database of the relevant program of file by instrument.
Can further according to the legal procedure feature in the existing known white list and corresponding program behavior thereof, unknown program feature and program behavior be analyzed, to upgrade white list.
Because this programme can improve the believable probability of source web of the file of collecting, so can improve the efficient of the collection of white list (credible website).
Need to prove that the embodiment with reference to shown in Figure 3 can have in the time of a plurality of settings, and add up respectively the confidence level of a plurality of correspondences, according to the confidence level of these a plurality of correspondences, carry out the credibility of current site and judge that procedure detailed does not repeat them here.
In addition, according to the description among the step S14, turn down the sample number threshold value, will reduce accuracy, promote recall ratio; Otherwise, heighten the sample number threshold value, can improve accuracy, reduce recall ratio.Heighten the confidence level threshold value, can promote accuracy, reduce recall ratio.Therefore, only carry out the judgement of website credibility by upgrading sample threshold in the present embodiment.
Can also carry out the judgement of website credibility by upgrading the confidence level threshold value in another embodiment, not repeat them here.
Fig. 4 has schematically shown the according to an embodiment of the invention block diagram of the recognition device of credible website.As shown in Figure 4, in the present embodiment, the recognition device of credible website can comprise extraction module 41, statistical module 42, acquisition module 43 and identification module 44.Extraction module 41 is used for extracting the download log of current site in a setting-up time section.Statistical module 42 is used for counting current site was carried out the download link of down operation in described setting-up time section sample size and number of users according to the user ID of the described download log of extraction module 41 extractions and download file sign.Acquisition module 43 is for the described sample size of the current site that counts according to statistical module 42 and the confidence level that number of users obtains current site.Whether identification module 44 to identify described current site be official website if being used for the confidence level of the current site obtained according to acquisition module 43 and sample size that statistical module 42 counts.Identification module 44 also is used for obtaining credible website behind the described official website cleaning third party website of identification.
Wherein, identification module 44 can also be used at sample size less than default sample number threshold value, and the confidence level of current site judges that current site is official website during greater than default confidence level threshold value.
In embodiments of the present invention, Fig. 5 has schematically shown the according to an embodiment of the invention another block diagram of the recognition device of credible website.The recognition device of credible website can also comprise handling module 45.Handling module 45 links to each other with identification module 44, is used for when 44 of moulds of identification are judged current site as official website, by described official website crawl download link; Described handling module 45 also is used for when described identification module 44 is judged described current site as credible website, by described credible website crawl download link.Further, the recognition device of credible website can also comprise preservation module 46.Preserve module 46 and link to each other with above-mentioned handling module 45, be used for the download link that handling module 45 grasps is saved in the white list database.
Wherein, the confidence level of current site can be inversely proportional to described sample size, is directly proportional with described number of users.
Wherein, current site can be download website website or forum website etc.
The recognition device of the credible website of the embodiment of the invention, by carrying out the recognition methods of above-mentioned credible website, can identify the higher official website of confidence level, thereby for the user that the download demand is arranged provides reliable download site, reduce the user and downloaded to the maliciously risk of sample, improved user's network security guarantee.
Fig. 6 has schematically shown the according to an embodiment of the invention block diagram of the gathering system of credible website.As shown in Figure 5, in the present embodiment, the gathering system of credible website can comprise: server 51 and authentic specimen database 52.
Server 51 comprises that CPU or DSP etc. have the processor cluster 511 of data processing function, to carry out: extract the download log of current site in a setting-up time section, according to the user ID in the described download log of extracting and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users, the confidence level of obtaining current site according to described sample size and the number of users of the current site that counts, whether identify described current site according to the confidence level of the current site of obtaining with the sample size that counts is official website;
At server 51, can pass through its CPU or DSP control wired network adapter or wireless network card access current site to extract the download log of current site.
Authentic specimen database 52 is used for collecting the official website of judging through described server 51.
Alternatively, described server comprises:
Extraction module is used for extracting the download log of current site in a setting-up time section;
Statistical module is used for user ID and download file sign according to the described download log of described extraction module extraction, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Acquisition module is for the described sample size of the current site that goes out according to described statistical module counts and the confidence level that number of users obtains current site;
Identification module, whether being used for the confidence level of the current site obtained according to described acquisition module and sample size that described statistical module counts goes out, to identify described current site be official website.
Alternatively, described identification module also is used at described sample size less than default sample number threshold value, and the confidence level of described current site judges that described current site is official website during greater than default confidence level threshold value.
Alternatively, described server also comprises: handling module, link to each other with described identification module, and be used for when described identification module is judged described current site as official website, by described official website crawl download link.
Alternatively, described identification module also is used for obtaining credible website behind the described official website cleaning third party website of identification.
Alternatively, described handling module also is used for when described identification module is judged described current site as credible website, by described credible website crawl download link.
Alternatively, described server also comprises: preserve module, link to each other with described handling module, be used for the download link of described handling module crawl is saved in the white list database.
In the present embodiment, the technical description of relevant official website's recognition device and each functional module thereof can referring to above-described embodiment, not repeat them here.
The gathering system of the credible website of the embodiment of the invention, can be by obtaining the download log of download file, and download log analyzed, extract current site in the download log, from current site, confirm official website, at last plug-in in the official website and/or the private third party websites such as website that take are filtered out.Analyze by the download log to software, can get access to more accurately Download Info.
Alleged " embodiment ", " embodiment " or " one or more embodiment " mean herein, and special characteristic, structure or the characteristic described in conjunction with the embodiments comprise at least one embodiment of the present invention.In addition, the word example that note that here " in one embodiment " not necessarily refers to same embodiment entirely.
In the specification that provides herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be put into practice in the situation of these details not having.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation of the scope that does not break away from claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or step in the claim.Being positioned at word " " before the element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computer of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to come imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title with these word explanations.
In addition, shall also be noted that the language that uses in this specification mainly selects for purpose readable and instruction, rather than select in order to explain or to limit theme of the present invention.Therefore, in the situation of the scope and spirit that do not depart from appended claims, many modifications and changes all are apparent for those skilled in the art.For scope of the present invention, be illustrative to disclosing of doing of the present invention, and nonrestrictive, scope of the present invention is limited by appended claims.

Claims (12)

1. the recognition methods of a credible website is characterized in that, comprising:
Extract the download log of current site in a setting-up time section, according to the user ID in the described download log and download file sign, count current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
The confidence level of obtaining current site according to described sample size and the number of users of current site, and whether identify described current site according to the confidence level of current site and sample size be official website;
Wherein, the confidence level of obtaining current site according to described sample size and the number of users of current site further comprises: the confidence level of described current site and described sample size are inversely proportional to, and are directly proportional with described number of users.
2. the recognition methods of credible website as claimed in claim 1 is characterized in that, also comprises:
If when judging described current site as official website, by described official website crawl download link, the download link that grasps is saved in the white list database.
3. the recognition methods of credible website according to claim 1 is characterized in that, also comprises:
From the described official website of identification, behind the cleaning third party website, obtain credible website.
4. method as claimed in claim 3 is characterized in that, also comprises:
If when judging described current site as credible website, by described credible website crawl download link, the download link that grasps is saved in the white list database.
5. the recognition methods of credible website as claimed in claim 1 is characterized in that, whether be official website, further comprise if identifying described current site according to the confidence level of current site and sample size:
If described sample size is less than default sample number threshold value, and the confidence level of described current site judges then that greater than default confidence level threshold value described current site is official website.
6. the gathering system of a credible website comprises: authentic specimen database and server, wherein:
Server, be suitable for extracting the download log of current site in a setting-up time section, identify and count current site was carried out the download link of down operation in described setting-up time section sample size and number of users according to the user ID in the described download log of extracting and download file, the confidence level of obtaining current site according to described sample size and the number of users of the current site that counts, whether identify described current site according to the confidence level of the current site of obtaining and the sample size that counts is official website;
The authentic specimen database is suitable for collecting the official website of judging through described server.
7. gathering system as claimed in claim 6 is characterized in that, described server comprises:
Extraction module is used for extracting the download log of current site in a setting-up time section;
Statistical module is used for user ID and download file sign according to the described download log of described extraction module extraction, counts current site was carried out the download link of down operation in described setting-up time section sample size and number of users;
Acquisition module is for the described sample size of the current site that goes out according to described statistical module counts and the confidence level that number of users obtains current site;
Identification module, whether being used for the confidence level of the current site obtained according to described acquisition module and sample size that described statistical module counts goes out, to identify described current site be official website.
8. gathering system as claimed in claim 7 is characterized in that,
Described identification module also is used at described sample size less than default sample number threshold value, and the confidence level of described current site judges that described current site is official website during greater than default confidence level threshold value.
9. gathering system as claimed in claim 7 is characterized in that, described server also comprises:
Handling module links to each other with described identification module, is used for when described identification module is judged described current site as official website, by described official website crawl download link.
10. gathering system according to claim 7 is characterized in that,
Described identification module also is used for obtaining credible website behind the described official website cleaning third party website of identification.
11. gathering system as claimed in claim 10 is characterized in that,
Described handling module also is used for when described identification module is judged described current site as credible website, by described credible website crawl download link.
12. such as claim 9 or 11 described gathering systems, it is characterized in that described server also comprises:
Preserve module, link to each other with described handling module, be used for the download link of described handling module crawl is saved in the white list database.
CN201210518471.2A 2012-12-05 2012-12-05 The recognition methods of credible website and gathering system Active CN102984162B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210518471.2A CN102984162B (en) 2012-12-05 2012-12-05 The recognition methods of credible website and gathering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210518471.2A CN102984162B (en) 2012-12-05 2012-12-05 The recognition methods of credible website and gathering system

Publications (2)

Publication Number Publication Date
CN102984162A true CN102984162A (en) 2013-03-20
CN102984162B CN102984162B (en) 2016-05-18

Family

ID=47857905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210518471.2A Active CN102984162B (en) 2012-12-05 2012-12-05 The recognition methods of credible website and gathering system

Country Status (1)

Country Link
CN (1) CN102984162B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103414758A (en) * 2013-07-19 2013-11-27 北京奇虎科技有限公司 Method and device for processing logs
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN108768934A (en) * 2018-04-11 2018-11-06 北京立思辰新技术有限公司 Rogue program issues detection method, device and medium
CN113010764A (en) * 2021-04-15 2021-06-22 杭州恒声科技有限公司 Public opinion monitoring system, method, computer equipment and storage medium
CN117376033A (en) * 2023-12-06 2024-01-09 浙江网商银行股份有限公司 File processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000036539A2 (en) * 1998-12-12 2000-06-22 The Brodia Group Trusted agent for electronic commerce
CN101919219A (en) * 2007-09-19 2010-12-15 阿尔卡特朗讯美国公司 Method and apparatus for preventing phishing attacks
CN102355469A (en) * 2011-10-31 2012-02-15 北龙中网(北京)科技有限责任公司 Method for displaying credibility certification for website in address bar of browser
CN102984161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Identification method and device for reliable website

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000036539A2 (en) * 1998-12-12 2000-06-22 The Brodia Group Trusted agent for electronic commerce
CN101919219A (en) * 2007-09-19 2010-12-15 阿尔卡特朗讯美国公司 Method and apparatus for preventing phishing attacks
CN102355469A (en) * 2011-10-31 2012-02-15 北龙中网(北京)科技有限责任公司 Method for displaying credibility certification for website in address bar of browser
CN102984161A (en) * 2012-12-05 2013-03-20 北京奇虎科技有限公司 Identification method and device for reliable website

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103414758A (en) * 2013-07-19 2013-11-27 北京奇虎科技有限公司 Method and device for processing logs
CN103414758B (en) * 2013-07-19 2017-04-05 北京奇虎科技有限公司 log processing method and device
CN104765882A (en) * 2015-04-29 2015-07-08 中国互联网络信息中心 Internet website statistics method based on web page characteristic strings
CN108768934A (en) * 2018-04-11 2018-11-06 北京立思辰新技术有限公司 Rogue program issues detection method, device and medium
CN108768934B (en) * 2018-04-11 2021-09-07 北京立思辰新技术有限公司 Malicious program release detection method, device and medium
CN113010764A (en) * 2021-04-15 2021-06-22 杭州恒声科技有限公司 Public opinion monitoring system, method, computer equipment and storage medium
CN113010764B (en) * 2021-04-15 2023-08-22 德观智能控制设备涿州有限公司 Public opinion monitoring system, public opinion monitoring method, computer equipment and storage medium
CN117376033A (en) * 2023-12-06 2024-01-09 浙江网商银行股份有限公司 File processing method and device

Also Published As

Publication number Publication date
CN102984162B (en) 2016-05-18

Similar Documents

Publication Publication Date Title
CN102984161B (en) The recognition methods of a kind of reliable website and device
US11399288B2 (en) Method for HTTP-based access point fingerprint and classification using machine learning
US9954886B2 (en) Method and apparatus for detecting website security
US10721245B2 (en) Method and device for automatically verifying security event
CN109862003B (en) Method, device, system and storage medium for generating local threat intelligence library
CN107257390B (en) URL address resolution method and system
CN105138709B (en) Remote evidence taking system based on physical memory analysis
EP3101580B1 (en) Website information extraction device, system, website information extraction method, and website information extraction program
CN109862021B (en) Method and device for acquiring threat information
US20220200959A1 (en) Data collection system for effectively processing big data
CN102984162A (en) Identifying method and collecting system for credible websites
CN111740923A (en) Method and device for generating application identification rule, electronic equipment and storage medium
CN114528457A (en) Web fingerprint detection method and related equipment
CN110149318B (en) Mail metadata processing method and device, storage medium and electronic device
CN112565308B (en) Malicious application detection method, device, equipment and medium based on network traffic
CN113014549A (en) HTTP-based malicious traffic classification method and related equipment
CN108710670A (en) A kind of log analysis method, device, electronic equipment and readable storage medium storing program for executing
CN112732693B (en) Intelligent internet of things data acquisition method, device, equipment and storage medium
CN113568626A (en) Dynamic packaging method, application package starting method, device and electronic equipment
US9584537B2 (en) System and method for detecting mobile cyber incident
CN116738369A (en) Traffic data classification method, device, equipment and storage medium
US11556819B2 (en) Collection apparatus, collection method, and collection program
EP3361405A1 (en) Enhancement of intrusion detection systems
CN110413909B (en) Machine learning-based intelligent identification method for online firmware of large-scale embedded equipment
CN111181756B (en) Domain name security judgment method, device, equipment and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee after: Beijing Qizhi Business Consulting Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220318

Address after: 100016 1773, 15 / F, 17 / F, building 3, No.10, Jiuxianqiao Road, Chaoyang District, Beijing

Patentee after: Sanliu0 Digital Security Technology Group Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Beijing Qizhi Business Consulting Co.,Ltd.