WO2023003457A1

WO2023003457A1 - Method for increasing web security and device and system for doing the same

Info

Publication number: WO2023003457A1
Application number: PCT/NL2022/050375
Authority: WO
Inventors: Viraj Krishan BALGOBIND
Original assignee: Chaitanya B.V.; Dmattos B.V.
Priority date: 2021-07-20
Filing date: 2022-06-30
Publication date: 2023-01-26
Also published as: NL2028798B1

Abstract

Method for determining a safety score for a web address, wherein the method is a computer implemented method comprising the steps of: − providing a retrieved content collection present at the web address; − providing a domain identifier associated with the web address; − providing a content database comprising a plurality of known content collections each with a corresponding domain identifier; − for one or more known content collection stored in the content database: − generating a content match score between the retrieved content collection and the known content collection, wherein the content match score is indicative of a similarity between the retrieved content collection and the known content collection; − selecting a known domain identifier corresponding to the known content collection with the highest content match score; − generating an identity match score by measuring a similarity between the retrieved domain identifier and the selected known domain identifier; − generating a safety score for the web address on the basis of the identity match score, wherein the safety score is indicative of a measurement of a discrepancy between the highest content match score and the identity match score.

Description

METHOD FOR INCREASING WEB SECURITY AND DEVICE AND SYSTEM FOR DOING THE SAME

The invention relates to a method for generating a safety score for a web address and a system configured for doing the same.

The use of modern computer systems such as personal computers, laptops, tablets and mobile phones is becoming more and more ingrained in the daily lives of people which means a lot of fraud sensitive information is exchanged using these systems. Because of this these systems and their users are targeted by criminals who are skimming people to obtain sensitive information such as credit card information, bank account details, login credentials and many more types of sensitive information. This is often done by tricking users into visiting malicious websites which try to imitate websites from well-known and trusted companies such as banking websites, mailing website, delivery system websites, social media websites and many more. For example, a malicious website might pretend to be a banking website used by the user and present a fake login screen or fake payment screen to steal user credentials or to perform fraudulent payments.

Security systems such as DNS filters are often in place to prevent users from accessing malicious websites. These DNS filters block access to malicious websites based on a blacklist of known malicious websites. A problem with this however is that only when a website is known to be malicious it is added to the blacklist. This means there is a period between the time a new malicious website is put online and when said website is added to the black list wherein the user is not protected by the DNS filter and can still visit the malicious website.

An object, next to other objects, of the present invention is to obtain an indication whether a website is possibly malicious.

To meet this object, next to other objects, a method for determining a safety score for a web address is provided, wherein the method comprises the steps of:

- providing a retrieved content collection present at the web address;

- providing a domain identifier associated with the web address;

- providing a content database comprising a plurality of known content collections each with a corresponding domain identifier;

- for one or more known content collection stored in the content database: generating a content match score between the retrieved content collection and the known content collection, wherein the content match score is indicative of a similarity between the retrieved content collection and the known content collection;

- selecting a known domain identifier corresponding to the known content collection with the highest content match score;

- generating an identity match score by measuring a similarity between the retrieved domain identifier and the selected known domain identifier;

- generating a safety score for the web address on the basis of the identity match score, wherein the safety score is indicative of a measurement of a discrepancy between the highest content match score and the identity match score.

By determining the safety score for the web address as described above it is possible to get an indication whether or not the content of the web address matches with a known website, while its identity does not match. I.e. the safety score is indicative of whether or not malicious websites trying to imitate existing websites. This has as advantage that by observing the safety score a user or a system can be aware of a possibility of the web address being malicious, therewith increasing the security of the computer system of the user.

In short, for determining the safety score for a web site at a web address to be checked, the method selects from the content database of known websites, the website, preferably an identifier thereof, of which the content collection matches the retrieved content collection of the web address best. The website is thus preferably identified in the content database on the basis of similarity of the retrieved content collection, which may include elements determining the appearance of the website. By comparing the identified domain identifier to the retrieved domain identifier, the safety score is determined. A website having for instance an appearance, as preferably defined by the content collection, most similar to an appearance of a web site at a web address in the content database, is then expected to have the same, or substantially the same, domain identifier.

It is noted that the content match score may range between zero and one, wherein a content match score of close to zero is indicative of a retrieved content collection not matching with a known content collection and wherein a higher content match score is indicative for a similarity to exist between the retrieved content collection and the known content collection.

It is noted that a web address may refer to a Uniform Resource Locator (URL), a domain name or any other identifier of web resource. A web address may include a basis domain name and a path name. A basis domain name may be a hostname, a nodename or a domain name and may include a top level domain name. A path name may refer to a location or resource relative to the domain name. An example of a web address is www.example.com/example 1/1 wherein www.example.com is a basis domain name and /examplel/1 is a path name.

It is noted that a domain identifier may comprise one or more of: a web address or part thereof, a domain name root, a public key, a web certificate or part thereof or any other identifier suitable for identifying a website or web address.

It is further noted that the content match score may also have a different range, may have a reverse range and or may have an at least partly negative range.

It is noted that the identity match score may range between zero and one, wherein an identity match score of or close to zero is indicative of the retrieved domain identifier and the known domain identifier not matching while an identity match score of or close to one may be indicative for a similarity between the retrieved domain identifier and the known domain identifier.

It is further noted that the identity match score may also have a different range, may have a reverse range and or may have an at least partly negative range.

It is noted that the safety score may range between, for example, minus one and one, wherein, for example: a relatively high safety score is indicative for both the highest content match score and the identity match score being relatively high (in other words, that a similarity exists between both content collections and both corresponding domain identifiers which may indicate the web address is relatively safe); a relatively low safety score is indicative for the highest content match score being relatively high, but the identity match score being relatively low (in other words, that there is a similarity between both content collections but not between the identity match score which may indicate the web address may be malicious); and, the safety score being near zero is indicative for both the content match score and the identify match score being low (in other words, that there is no similarity between both content collections and domain identifiers which may indicate that the web address may be unknown but may be not malicious).

In an embodiment, the method further comprises the step of determining if the highest content match score is above a predetermined match threshold. The steps to determining the identity match score and generating the safety score may then only be performed in reaction of the highest content match score being above the predetermined match threshold. If a score is too low, the risk of a user confusing one website with the other is low. An alternative safety score may be set in this case, as will be explained below. This saves computing time.

The method may also include, in reaction of the highest content match score is below the predetermined match threshold, setting the safety score to an alternative safety score. If a website is not similar to any known website, for instance of a group of relevant websites such as banking sites, the risk is assessed as low as the user will then not confuse the (to be) retrieved website with any of the web sites in the content collection.

An advantage of this embodiment is that a safety score is always determined even if no match is found between the retrieved content collection and the known content collections.

In a further embodiment setting the safety score to an alternative safety score comprises skipping the steps of selecting the known content element with the highest content match score, generating the identity match score and generating the safety score and setting the safety score to a value indicative for the highest content match score being below the predetermined match threshold.

An advantage of this embodiment is that some steps are skipped when the content match score is not high enough to establish with sufficient certainty that web address is trying to imitate a website associated with one of the known content collections. As a result the method is more efficient, for instance in terms of required resources, in these cases.

In another further embodiment value indicative for the highest content match score being below the predetermined match threshold a relatively high safety score value, for example 0.7, indicative for the web address probably not being malicious.

An advantage of this embodiment is that none malicious website would have a low content match score with every known content collection (since they do not copy any of the known websites) will still receive a high safety score without needing to spend computing resources such as memory, bandwidth and processing power on determining the identity match score.

In another further embodiment, the method further comprises: in reaction of the highest content match score being below the predetermined match threshold, providing one or more search content collections by performing a reverse search using the retrieved content collection and using the one or more search content collections as the one or more known content collections. An advantage of this embodiment is that, by performing a reverse search, retrieved content collections that do not match with any of the known content collections can be matched with content collections of web addresses which have similar content collections that have a high search ranking. As web addresses with high search rankings are less likely to be malicious, a reasonable accurate safety score can still be determined.

In a further embodiment, wherein the step of providing one or more search content collections comprises and the step of using the one or more search content collections as the one or more known content collections comprises performing the steps of: providing a connection with one or more search engines; performing a reverse search with the one or more search engines using the retrieved content collection; retrieving one or more search results comprising web addresses from the reverse search, wherein optionally the retrieved one or more search results are selected by considering their search ranking to be relatively high; retrieving one or more search content collections corresponding to the web addresses from the reverse search; retrieving one or more search domain identifiers corresponding to the web addresses from the reverse search; for each of the one or more search content collections:

- generating a content match score between the retrieved content collection and the search content collection; selecting the domain identifier corresponding to the search content collection with the highest content match score; and, wherein the steps of generating the identity match score and generating the safety score are done using the selected domain identifier corresponding to the search content collection with the highest content match score.

It is noted that the search ranking corresponds to an order in which the one or more search results are presented by the reverse search. For example, the first five or ten search results have a higher search ranking compared to the successive search results.

It is noted that the one or more search engine may be search engines of third parties, for example Google Search, Bing, DuckDuckGo, TinEye etc. An advantage of this embodiment is that and websites which have a relatively high search ranking are more likely to correspond to non-malicious websites, which means a reasonable accurate safety score can still be determined using the content collections of said websites.

In an embodiment, a content collection comprises one or more content elements and the step of generating the content match score comprises generating a similarity score for one or more content element in the retrieved content collection with one or more known content elements of the known content collection and wherein the content match score is a function of the similarity scores.

An advantage of this embodiment is that, because the content match score is a function of the similarity scores between content elements, it is also possible to determine a safety score if, for example, a retrieved content collection comprises only a few content elements compared to a known content collection. This is, for example, relevant when a malicious website shows only a login form of a banking website, while the normal banking website shows more content elements (for example commercials or links to other parts of the website) next to a login form. In this case the content collection from the malicious website does not totally match the content collection of the banking website (as it does not contain all content elements of the banking website), but the content elements that are present at the malicious websites may still be similar to one or more of the content elements of the banking website.

In a further embodiment, each content element has a content type and the step of generating the similarity score comprises:

- determining the content type of the retrieved content element;

- selecting one or more similarity functions corresponding to the content type from a list of similarity functions wherein each similarity function is associated with one or more content types;

- calculating one or more similarity score by applying the one or more selected similarity function to the retrieved content element and one or more known content elements that have the same content type.

An advantage of this embodiment is that by adapting the similarity function to the content type a more accurate safety score can be determined. For example, a first similarity function appropriate for images can be used to determine similarities between images and a second similarity function appropriate for text can be used to determine similarities between text items. In an even further embodiment, a similarity function comprises a weight factor associated with the content type.

It is noted that the weight factor may be predetermined based on a risk associated with a content type. For example, a visual content type such as an image or a color may pose a higher risk, compared to text, as images and colors are more often associated with, for example, a brand. E.g. a malicious website may be a more convincing imitation of a banking website by using a logo of the bank instead of only using the name of the bank. As another example an input field on a website may be used to gather login information of a user and has therefore a relatively high associated risk factor and thus a higher weight factor.

An advantage of this embodiment is that, by applying a weight factor to the similarity function it is possible to make a safety score more influenced by content types with a higher associated risk.

In an embodiment, a content type may be one of: an image, a text element, a color element, a UI element, an input element, a navigation element, a font type, a stylesheet, a media element, a QR code, or any other content type.

In a further embodiment, a content type may be a hash tag, wherein a content element which has the hash tag as content type comprises a hash value representative of its corresponding content collection and wherein the step of providing a retrieved content collection comprises: determining whether a content element with the hash tag as content type is present at the web address; and providing the retrieved content collection only containing the content element with the hash tag as content type.

An advantage of this embodiment is that agreed upon organizations, such as banks, email providers or social media platforms may provide such a hash value on their website which means that, if the web address is of such an agreed upon organization, the generating of the content match score can be done very quick and efficient. Furthermore, as another advantage, since only the hash value is stored in the content collection its size is reduced significantly, saving processing power, memory and possible reducing an amount of data transfer.

In a preferred embodiment the hash tag is embedded in a header.

In an embodiment, the list of one or more similarity functions may comprise one or more of: a direction comparison function (e.g. is equal function), an Euclidean distance, a Manhattan distance, a key point matching function, a SIFT function, a decision tree, a semantic forest, a dynamic time warping function, a hash function, a fuzzy text match function or any other suitable function similarity function.

It is noted that in the context of this disclosure the term similarity function must be understood as any function which can be used two compare two content elements, also including, for example, dissimilarity functions or a direct comparison.

An advantage of this embodiment is that suitable similarity functions can be used depending on content type, resulting in a more accurate content match score.

In an embodiment the similarity function may comprise one or more preprocessing functions, such as a compression function, a scaling function, a transformation function, a hashing function, a noise reduction function or any other suitable preprocessing function.

An advantage of this embodiment is that content elements may be preprocessed to be more suitable for the similarity functions, resulting in more accurate result, memory reduction and reduction in needed processing power.

In an embodiment a content type may be a QR code and the preprocessing step associated with a QR may comprise the steps of decoding the QR code, for example using Zxing decoder, and using at least a part of the decoded QR code, for example a URL hidden in the QR code, as content element instead of the QR code as a whole.

An advantage of this embodiment is that a QR code may be used to obtain a similarity between websites while the QR code presented on a website may be different every time a website is visited. For example, a banking website may present a QR code for login purposes containing a URL and a session identifier. Due to the changing nature of QR codes they are not suitable to be directly used as content element, as they will never be similar although they may come from the same source. By decoding the QR code, parts of information hidden in the QR codes may be used which remains similar between different QR codes of the same website.

In an embodiment the step of selecting one or more similarity functions comprises determining suitable similarity functions in accordance with one or more of: content type, an availability of processing power, an availability of memory, an availability of energy resources, an availability of data transfer speed, a predetermined customer type, a predetermined risk level. An advantage of this embodiment is that the used similarity functions can be adapted according to a set of circumstances wherein they are used.

In an embodiment the method further comprises the step of determining one or more website categories corresponding to the web address by:

- performing a lookup in a web address database, wherein the web address database comprises an index of web addresses and their associated website categories, and wherein the generation of the safety score is only performed if at least one of the determined one or more website categories is on a predetermined risk list; and/or

- feeding the retrieved content collection to a machine learning model trained to assign one or more website categories from a list of known website categories to the web address, and wherein the generation of the safety score is only performed if at least one of the determined one or more website categories is on a predetermined risk list.

In an example the list of known website categories may comprise one or more of banking, social networking, news, health, government, mail, shopping, adult, hate, science, information, blogs etc.

An advantage of this embodiment is that the generation of the safety score is only performed if the web address is deemed to be of a category that is on a predetermined risk list, preventing using processing power on low risk website. For example, the banking category and mail categories may be considered high risk while the science category and the information category may be considered low risk.

In an embodiment the method comprises the step of filtering the web address before performing the other steps, the filtering comprising the steps of:

- pruning the web address to a base domain name;

- providing a known domain names database comprising one or more known base domain names and associated known safety scores;

- performing a lookup of the base domain name in the base domain name database; if the lookup is successful, retrieving the associated known safety score and skipping the steps of retrieving the content collection, retrieving the domain identifier, providing the content database, generating one or more content match scores, and generating the safety score.

An advantage of this embodiment is that previously determined safety scores may be reused, saving processing power. In an embodiment the method further comprises the step of performing package inspection performed before the generating of the safety score, the step of performing package inspection comprising:

- intercepting an encrypted package sent by a user application to an intended receiver;

- decrypting the package into a decrypted package using a private key of the user application; and,

- inspecting the decrypted package to retrieve a full web address; and

- forwarding the encrypted package to the intended receiver, wherein the full web address is used as the web address in the rest of the method for determining the safety score and wherein preferably the step of forwarding of the encrypted package is only performed when the safety score is above a predetermined safety threshold.

It is noted that the forwarded encrypted package is not adapted.

An advantage of this embodiment is that web addresses may be intercepted from applications in which the web address is normally hidden.

In a further embodiment, the steps of decrypting and inspecting the package are only performed when the intended receiver is not a predetermined list of white-listed websites.

An advantage of this embodiment is that privacy sensitive packages are not decrypted nor inspected.

Preferably, the method is a computer implemented method. The method may be implemented on any suitable computing means, such as a personal computer, a mobile phone, a laptop, a PDA, a server, a tablet, a router, a firewall, a microprocessor or any other suitable computing means, the computing means mat comprise a processor, a memory module, a storage module and/or a communication module, wherein the memory is configured to store instructions that, when executed by the processor, perform one or more steps of the method. The computer means may comprise of a single device performing all steps of the method or a plurality of devices each device performing one or more steps of the method.

A security web server is further provided comprising a processor, a memory module and a storage device comprising a content database, wherein the security web server is configured to: receive a safety score request from a user device, the safety score request comprising a web address; generate a safety score by executing the steps of the method as described above; send the safety score to the user device. A user device is furthermore provided, wherein the user device is further configured to provide or to receive a safety score for a web address using the method. The user device may be arranged to display a graphic element and/or block the internet traffic in response to the safety score, for instance being above a predetermined safety threshold.

The user device may comprise a processor, a memory module and or a display, wherein the processor is configured to execute one or more internet applications that generate internet traffic and/or a DNS filtering application, wherein the DNS filtering application is configured to filter the internet traffic generated by the internet applications. The step of filtering the internet traffic may include filtering a website based on a received and/or generated safety score as explained above. The step of filtering may also comprise executing one or more of the steps of: determining one or more website categories, filtering the web address and inspection according to the corresponding steps as described above.

In an embodiment, the user device is a low end device, or a high end device in a data saving mode or a battery saving mode, and wherein low end device is configured to provide the safety score by sending a safety score request to a security web server according to claim 12 and to receive the safety score from the security web server.

An advantage of this embodiment is that steps of determining the safety score which require high computing powers are offloaded to the web security server.

In an embodiment, the user device is a high end device and wherein the high end device is configured to provide the safety score by executing the steps of the method as described above.

A web scraper is further provided configured to visit a web address from a list of known web addresses and to retrieve a content collection and a domain identifier from the web address and to optionally store the content collection and the domain identifier, for instance a the domain certificate element, in the content database, and wherein the web scraper is preferably configured to conceal its IP address when retrieving the content collection and domain identifier, for example using a VPN connection.

An advantage of this embodiment is that the content collections and domain identifiers are kept up to date.

In an example, the web scraper refers to a computer program comprising instructions which, when the program is executed by a first computer, cause the first computer to visit a web address from a list of known web addresses and to retrieve a content collection and a domain identifier from the web address and to optionally store the content collection and the domain identifier, for instance a domain certificate element, in the content database, and wherein the web scraper is preferably configured to conceal its IP address when retrieving the content collection and domain identifier, for example using a VPN connection.

A system is further provided comprising the web security server and the user device as described, the system preferably further comprising the web scraper as described above.

The invention is described in the foregoing as example. It is understood that those skilled in the art are capable of realizing different variants of the invention without actually departing from the scope of the invention. Further advantages, features and details of the invention are elucidated on the basis of preferred embodiments thereof, wherein reference is made to the accompanying drawings, in which:

- Figure 1 shows a schematic overview of three example websites;

- Figure 2 shows a schematic overview of an embodiment of the method;

- Figure 3 shows a schematic overview of a low end device;

- Figure 4 shows a schematic overview of a high end device according to an embodiment.

Figure 1 shows three examples of websites 102, 202 and 302. Website 102 has content collection 104 containing content elements 104a - 104e. In this example content collection of website 102 contains navigation bar 104a that includes one or more hyperlinks, each hyperlink comprising a text field (e.g. title) and a destination URL, user name input field 104b, password input field 104c, website logo 104d and text item 104e. It is noted that the content collection of website 102 may comprise more content elements besides shown content elements 104a - 104e, these additional content elements may be of similar or different content types. Website 102 further has associated web address 106 and SSL certificate 108. Web address 106 comprises basis domain name 106a and path name 106b. SSL certificate 108 comprises domain name 108a and public key 108b. SSL certificate 108 may further comprise one or more of: a certificate validity period, a certificate authority details element, a public key algorithm, a certificate signature algorithm, a SSL version number, a thumbprint, a thumbprint algorithm, a website owner, and/or website owner details.

Similarly, website 202 has content collection 204 with content elements navigation bar 204a, user input field 204b, password input field 204c, website logo 204d and text item 204e. It is noted that the content collection of website 202 may comprise more content elements besides shown content elements 204a - 204e, these additional content elements may be of similar or different content types. Website 202 further has web address 206 and SSL certificate 208. Web address 206 comprises basis domain name 206a and path name 206b. SSL certificate comprises domain name 208a and public key 208b. Website 302 has content collection 304 with navigation bar 304a, text item 304b, first image 304c and second image 304d. Website 302 further has web address 306 with basis domain name 306a and path name 206b and SSL certificate 308 with domain name 308a and public key 308b.

Figure 2 shows an example of a method for determining a safety score for website 102 with web address 106. It is noted that different steps in the method are displayed using arrows while results are displayed in the blocks between the arrows. The method according to the example comprises optional step of filtering 1000 domain name 106a using a DNS filter wherein a lookup is performed (not shown) to identify if domain name 106a is in a list of blacklisted websites. If domain name 106a is not in the list of blacklisted websites, package inspection 1002a is performed to retrieve web address 106 by performing the steps of (not shown), intercepting an encrypted package sent to domain name 106a by a user application; decrypting the package into a decrypted package using a private key of the user application; and, inspecting the decrypted package to retrieve web address 106. Alternatively to performing the step of package inspection 1002a to retrieve address 106, domain name 106a may be used as web address 106 or web address may be retrieved in some other suitable way (for example via a browser plugin or if a package is sent unencrypted).

When domain name 106a is on the blacklist, steps 1003 - 1016 are skipped and instead the step of setting 1002b safety score 116 to a score of, for example, -1 is performed, wherein safety score 116 being -1 is indicative of domain 106a being blocked by the DNS filter. It is noted that the value of -1 is purely provided as an example and that any other standard value may be used.

Given web address 106 (retrieved) content collection 104 of website 102 is provided in step 1004. Examples of providing content collection 104 are illustrated in figures 3 and 4.

Next domain certificate 108 is provided as domain identifier in step 1006.

As an optional alternative step to step 1006, one or more website categories may be determined corresponding to web address 106 by retrieving (not shown) one or more website categories from a web address database (not shown) and/or feeding (not shown) content collection 104 to a machine learning model trained to assign one or more website categories to web address 104. If one or more of the assigned categories is on a risk list, step 1006 is performed, otherwise steps 1006 - 1016 are skipped and the step of setting 1006b safety score 116 to a value indicative for the assigned web categories not being on a risk list is performed. For example web address 104 is assigned “banking website” as category by the neural network, which is a high risk category and thus the method continues with step 1006.

In step 1008 content database 110 is provided containing known two content collections 204 and 304 corresponding to websites 202 and 302 respectively.

In step 1010 content match scores 102a and 102b are determined between content collection 104 and known content collections 204 and 304.

For example the content match score 112a between retrieved content collection 104 and known content collection 204 is determined to be 0.9 (indicative of content collections 104 and 204 being relatively similar) and the content match score 112b between retrieved content collection 104 and known content collection 304 is determined to be 0.2 (indicative for content collections 104 and 204 being relatively dissimilar).

In step 1012a domain identifier 208 is selected as it corresponds to website 202 which corresponding content match score 112a was highest in this example. As an alternative optional step, it may be determined that the highest content match score is below a threshold and performing step 1012b in response, skipping steps 1014 and 1016 and setting safety score to a value indicative of the content match score being below the threshold, alternatively to setting safety score 116 to the predetermined value, alternatively to step 1012b in step 1012c safety score 115 may be determined by providing a connection with a search engine; performing a reverse search with the search engine using the retrieved content collection 104; retrieving one or more search results comprising web addresses from the reverse search; retrieving one or more search content collections corresponding to the web addresses from the reverse search; retrieving one or more search domain identifiers corresponding to the web addresses from the reverse search; and perform steps 1010 - 1016 using the search content collections and search domain identifiers as known content database 110. It is noted that steps 1012a and 1012b implicitly contains the step of determining which content match score is highest.

In step 1014 identity match score 114 is generated by checking if public key 108b of domain identifier 108 is equal to public key 208b of domain identifier 208. In this example the identity match score 114 may be 0.0 indicating that public key 108a differs from public key 208b.

Finally in step 1016 safety score 116 is generated. In this example safety score 116 may be 0.1, indicating that website 102 may be maliciously copying content of known website 204. In figure 3 low end mobile device 402 is executing application 404 and uses system 400 to block access to malicious websites.

System 400 comprises components: DNS filter 406, web security service 408, web scraper 410, content database 412 and categorization service 414. The components 406 - 414 being connected via connections C105 - Cl 11 as shown in figure 3. System 400 is configured to perform steps 1000 - 1016 shown in figure 2 using its various components as shown below. System 400 may be executed on a server, a firewall, in a cloud based execution or any other suitable machine. The components of system 400 may be executed on the same server or may be executed independent of each other. The different components may be stand alone applications are may be encompassed in a single application.

To this end DNS filter 406 is configured to block internet connections from user devices, for example user device 402, to websites, such as website 102, if said websites are on a blacklist and, if not on the blacklist, to request safety scores from web security service 408 for websites and to block websites if their corresponding safety score is below a threshold value.

Web security service 408 is configured to provide retrieved content collections using web crawler 410 that is configured to anonymously connect to websites to retrieve their content collections and domain identifiers. Web security service 408 is further configured to determine website categories using web categorizer 414 that is configured to determine one or more web categories for websites by looking up their web addresses in a category database and/or by feeding their content collections to a machine learning classifier, for example a trained neural network.

Web security service 408 is further configured to determine if websites are in a risk category based on the determined one or more web categories and to determine safety scores for the websites by following executing steps 1010 - 1016 using known content collections stored in content database 412 and to provide DNS filter 406 with said safety scores.

For example, when application 404 executed by user device 402 tries to access website 102 (figure 1) DNS filter 406 initially blocks connection C103 and first checks website 102 against its blacklist. After determining website 102 is not on its blacklist, DNS filter 406 requests a safety score from web security service 408 for website 102. Web security service receives the request for the safety score and retrieves content collection 104 and domain identifier 108 by accessing website 102 via web crawler 410. Next web site 102 and retrieved content collection 104 are passed through by web security service to web categorizer 414, which looks up website 102 in its category database. If it is determined that website 102 is in its category database, one or more associated categories are returned If it is determined that website 102 is not in its category database, web categorizer 414 determines the web category of website 102For example, when website 102 is by feeding retrieved content collection 104 to its neural network classifier and passes the web category back to web security service 408. Upon receiving the web category “banking”, web security service 408 determines that this is of a risk category and determines that website 102 has a safety score of 0.2 by executing steps 1010 - 1016 using known content collections and known domain identifiers stored in content database 412. Web security service passes said safety score back to DNS filter 406 which determines that it is below predetermined safety threshold of 0.7 and continues blocking connection C103 to website 102 (and optionally adds website 102 to its blacklist) for user device 402. It is noted that if the safety score would have been equal or higher than 0.7, connection C103 to website 102 would be unblocked for user device 402.

High end device 502 (figure 4) is configured to execute user application 502, plugin 506, web security service 508, categorization function 514 and web scraper 510. Plugin 506 may perform similar functions as DNS filter 406 and is configured to plug in to application 502 to detect and intercept internet traffic and to request safety scores from web security service 508 depending on the intercepted internet traffic and to display a visual indicator in application 502 to the user indicative of the safety score. It is noted that instead of plugin 506, a standalone application, firewall application or any other suitable type of application may be used to filter web traffic of application 502. Web security service 508 may be similar to web security service 408 and is configured to receive a request for a safety score from plugin 506 when user application tries to visit an unknown website and to retrieve known content collections from content database 512 via an internet connection. Content database 512 may be similar to content database 412.

The web scraper 510 and categorization service 514 may be similar to web scraper 410 and categorization service 414.

When a user of high end device uses application 502 to visit website 102, plugin 506 intercept a HTTPS package between application 502 and website 102, decrypts the HTTPS package and retrieves web address 106 from the HTTPS packages. Next plugin requests a safety score from web security service 508 for web address 106. Web security service 508 retrieves content collection 104 and domain certificate 108 using web scraper 510. Next it uses web categorizer 514 to determine that website 106 is of the “banking” category. Using the determined category, known content collections with the same associated category are retrieved from content database 512. Using the known content categories a safety score is determined to be 0.2 by executing steps 1010 - 1016 of the method according to figure 1 and is passed back to plugin 506. Upon receiving the safety score, plugin 506 blocks internet traffic to website 102 in application 502 and displays a visual indicator to the user to inform him of the block.

The functions of the various elements shown in the figures, including any functional blocks labelled as “processors”, may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term “processor” or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non volatile storage. Other hardware, conventional and/or custom, may also be included. Similarly, any switches shown in the figures are conceptual only. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.

It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present invention. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer.

The present invention is by no means limited to the above described preferred embodiments thereof. It will be clear that one or more features from an embodiment be combined with one or more features from one or more other embodiments according to the invention. It will further be clear that terms like “received”, “retrieved”, “send” or any other term which suggest any form of direction of communication are used as being non limited and should merely be interpreted to communication being present or possible. E.g. received may be interpreted as meaning retrieved and vice versa.

The rights sought are defined by the following claims within the scope of which many modifications can be envisaged.

Claims

1. Method for determining a safety score for a web address, wherein the method is a computer implemented method comprising the steps of:

- providing a retrieved content collection present at the web address;

- providing a domain identifier associated with the web address;

- for one or more known content collection stored in the content database:

- generating a content match score between the retrieved content collection and the known content collection, wherein the content match score is indicative of a similarity between the retrieved content collection and the known content collection;

2. Method according to claim 1, the method further comprises the step of determining if the highest content match score is above a predetermined match threshold and in reaction of the highest content match score is below the predetermined match threshold setting the safety score to an alternative safety score.

3. Method according to claim 2, wherein setting the safety score to an alternative safety score comprises skipping the steps of selecting the known content identifier with the highest content match score, generating the identity match score and generating the safety and setting the safety score to a value indicative for the highest content match score being below the predetermined match threshold.

4. Method according to claims 2, wherein the method further comprising: in reaction of the highest content match score being below the predetermined match threshold, providing one or more search content collections by performing a reverse search using the retrieved content collection and using the one or more search content collections as the one or more known content collections.

5. Method according to claim 4, wherein the step of providing one or more search content collections comprises and the step of using the one or more search content collections as the one or more known content collections comprises performing the steps of: providing a connection with one or more search engine; performing a reverse search with the one or more search engine using the retrieved content collection; retrieving one or more search results comprising web addresses from the reverse search, wherein optionally the retrieved one or more search results are selected by considering their search ranking to be relatively high; retrieving one or more search content collections corresponding to the web addresses from the reverse search; retrieving one or more search domain identifiers corresponding to the web addresses from the reverse search; for each of the one or more search content collections:

6. Method according to any of the previous claims, wherein a content collection comprise one or more content elements and wherein generating the content match score comprises generating a similarity score for one or more content element in the retrieved content collection with one or more known content elements of the known content collection and wherein the content match score is a function of the similarity scores.

7. Method according to claim 6, wherein each content element has a content type and wherein the step of generating the similarity score comprises:

- determining the content type of the retrieved content element; - selecting one or more similarity functions corresponding to the content type from a list of similarity functions wherein each similarity function is associated with one or more content types;

- calculating one or more similarity score by applying the one or more selected similarity function to the retrieved content element and one or more known content elements that have the same content type, wherein a similarity function preferably comprises a weight factor associated with the content type.

8. Method according to claim 7, wherein a content type may be a hash tag, wherein a content element which has the hash tag as content type comprises a hash value representative of its corresponding content collection and wherein the step of providing a retrieved content collection comprises:

- determining whether a content element with the hash tag as content type is present at the web address; and

- providing the retrieved content collection only containing the content element with the hash tag as content type.

9. Method according to claim 6, 7 or 8, wherein the step of selecting one or more similarity functions comprises determining suitable similarity functions in accordance with one or more of: content type, an availability of processing power, an availability of memory, an availability of energy resources, an availability of data transfer speed, a predetermined customer type, a predetermined risk level.

10. Method according to any of the previous claims, further comprising the step of determining one or more website categories corresponding to the web address by:

- performing a lookup in a web address database, wherein the web address database comprises an index of web addresses and their associated website categories; and/or

11. Method according to any of the previous claims, further comprising the step of filtering the web address before performing the other steps, the filtering comprising the steps of: - pruning the web address to a base domain name;

12. Method according to any of the previous claims, further comprising the step of performing package inspection performed before the generating of the safety score, the step of performing package inspection comprising:

- inspecting the decrypted package to retrieve a full web address; and

- forwarding the encrypted package to the intended receiver, wherein the full web address is used as the web address in the rest of the method.

13. Method according to claim 12, wherein the steps of decrypting and inspecting the package are only performed when the intended receiver is not a predetermined list of white-listed websites.

14. Security web server comprising a processor, a memory module and a storage device comprising a content database, wherein the security web server is configured to: receive a safety score request from a user device, the safety score request comprising a web address; generate a safety score by executing the steps of the method according to one of the claims 1 - 13; and, send the safety score to the user device.

15. User device comprising a processor, a memory module and a display, wherein the processor is configured to execute one or more internet applications that generate internet traffic and a DNS filtering application, wherein the DNS filtering application is configured to filter the internet traffic generated by the internet applications, wherein filtering the internet traffic comprises executing one or more of the steps of: determining one or more website categories, filtering the web address and package inspection according to the corresponding steps of the method according to the invention, wherein the user device is further configured to provide a safety score for the web address using the method according to one of the claims 1 - 13 and to display a graphic element and/or block the internet traffic in response to the safety score being above a predetermined safety threshold.

16. User device according to claim 15, wherein the user device, for example a low end device or a high end device in a data saving mode or a battery saving mode, is configured to provide the safety score by sending a safety score request to a security web server according to claim 14 and to receive the safety score from the security web server.

17. User device according to claim 15, wherein the user device is a high end device and wherein the high end device is configured to provide the safety score by executing the steps of the method according to one of claims 1 - 13.

18. Web scraper configured to visit a web address from a list of known web addresses and to retrieve a content collection and a domain identifier from the web address and to store the content collection and the domain identifier, for instance a domain certificate element, in a content database, and wherein the web scraper is preferably configured to conceal its IP address when retrieving the content collection and domain identifier, for example using a VPN connection.

19. System comprising the web security server according to claim 14 and the user device according to claim 15, 16 or 17, wherein the system preferably also comprising the web scraper according to claim 18.