SAFE VIEWING OF WEB PAGES
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and apparatus for preventing the damage, which may be caused to a user's computer, by web pages.
DESCRIPTION OF RELATED ART
There are around 4 billion web pages indexed by the search engine Google® at the time of writing. The first web pages were made up simply of text, described by the markup language HyperText Markup Language (HTML), which also includes the ability to insert images into the text. Since the original invention of HTML in the early 1990's, the number and types of codes that can be placed into web pages have increased immensely: not only are there various versions of HTML, but it is also now possible to include general programming language elements such as JScript or Microsoft's® Visual Basic® Scripting Edition. These languages provide wide support for controlling the appearance of a page of information. Finally, web pages have been further extended to run almost any program through Microsoft's® ActiveX® controls and Sun's® Java language.
With such a wide range of options, it is no wonder that errors in the way browsers render these pages are commonplace. Some of the errors are simple: for example, the user's version of the browser program may be unable to parse the codes in the page, so the page is distorted. In other cases, the careful "sandbox" environment that is supposed to contain web-based programs can be breached, so allowing a program to perform actions it is not supposed to be able to perform, such as accessing the hard disk of the computer on which the web page is being viewed. In other cases, the codes can lure unwary users into downloading and running programs that are potentially dangerous, such as viruses and so on.
The threat of such unwanted behaviour cannot be underestimated. In simple cases, a web page might contain genuine HTML that can cause older versions of browsers to crash. In more extreme cases, a web page might cause a program to be installed, that monitors passwords being typed in and sends the information to organised crime sites. In order to monitor this situation, various solutions have been proposed. Standard anti- virus programs, such as those described in US Patent 5,319,776, monitor the end user's computer for anything written to disk. A common extension to such anti-virus
programs is the ability to retrieve updates of recently known viruses over the internet. Other solutions involve monitoring from within the web browser and quarantining anything that looks unwanted.
Such solutions require every user to run the anti-virus product.
US Patent 6,785,732 discloses a web server that can be set up to check for virus files arriving through web pages, and thus perform monitoring for many users.
In all these cases, the user may still want to read the web page that is potentially harmful, even after receiving a warning about the content, but the very act of reading it will cause the unwanted payload to be activated.
SUMMARY OF THE INVENTION The present invention relates to a method and an apparatus, which seek to allow a user to view web pages in a safe manner, even when it has been identified that the web pages have potentially dangerous content.
According to a first aspect of the present invention, there is provided a method of altering data stored in a file, according to a predefined set of rules, comprising the following steps: a) identifying the type of the file; b) consulting a database file, specific to the identified file type, for rules that define changes to be made to files of that file type; and c) making changes to the file according to the rules so defined.
According to a second aspect of the present invention, there is provided a computer system, adapted to operate in accordance with the method according to the first aspect of the invention.
According to a third aspect of the present invention, there is provided a computer program product, containing computer-readable code, for causing a computer to operate in accordance with the method according to the first aspect of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block schematic diagram of a computer system, including a computer in accordance with the present invention.
Figure 2 is a representation of a part of the computer of Figure 1.
Figure 3 is a flow chart illustrating a method according to the present invention.
Figure 4 is a block schematic diagram of a computer system, including a computer in accordance with another embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Figure 1 is a block schematic diagram of a computer system, in accordance with the present invention. Specifically, a user computer 10 has a central processor (CPU) 12, a disk 14, and a network interface 16. It will be appreciated by the person skilled in the art that the user computer 10, which is generally conventional, has various other features, which will be well known to the person skilled in the art. However, these features will not be described, except in so far as they are relevant to the operation of the present invention.
The user computer 10 has a connection via its network interface 16 over the internet 20 to a first web server computer 30. As is well known, information can be stored on the web server 30 in the form of web pages, and the user of the computer 10 can access these web pages, and download the stored information for viewing on the computer 10. The user of the computer 10 can also access a second web server computer 32, which contains a threat database 34, as will be described in more detail below.
Figure 2 is a further block schematic diagram, representing the contents of the disk 14 of the user computer 10. Again, it will be appreciated by the person skilled in the art that the disk contains additional data, which will be well known to the person skilled in the art, but which will not be described, except in so far as it is relevant to the operation of the present invention.
As shown in Figure 2, the main memory 14 stores operating system software 36. The disk 14 also contains a program directory 40, which itself contains an initialisation file
42, a database file 44, browser software 46, and a file processing program 48. The disk 14 also contains one or more quarantine areas 50, and one or more safe areas 52.
Figure 3 is a flow chart, illustrating the operation of the file processing program 48, in accordance with a preferred embodiment of the present invention.
In step 70 of the process, the computer 10 is turned on by the user in the conventional way. The operating system software 36 then arranges that the file processing program 48 runs whenever the computer is turned on. The program 48 starts to run in step 72 of the process shown in Figure 3.
When the program 48 starts up, it reads the initialisation file 42, which contains various optional settings and also describes the location on the disk 14 of the one or more quarantine areas 50 and the safe areas 52. Thus, the program records the options and, in step 74, starts monitoring the defined quarantine directories 50.
In operation of the computer 10, the user may use the browser software 46 to access web pages that are stored on remote web servers, such as the web server 30, shown in Figure 1. When the user wishes to view a web page, the browser software 46 causes a copy of the web page to be downloaded onto the computer 10 using the
Hypertext Transfer Protocol (HTTP). Web pages, or other files, that are determined by the browser program 46, or by an extension to the browser program, to be potentially dangerous or unsuitable are placed in the quarantine area 50, and an alternative web page, describing what has been done, is displayed to the user. Similarly, web pages or other files can be placed in the quarantine area 50 by anti-virus programs, web-based monitors, web proxies, or other types of technology that can identify such pages. Files placed in the quarantine area 50 are preferably timestamped, for use in later steps of the process, as described below.
In step 76 of the process, it is determined whether the directory content of the quarantine directories 50 has changed. If there has been no change since the quarantine directories 50 were last monitored, the process returns to step 74, where, after a predetermined time, the quarantine directories are again monitored. As is well known in the art, the file processing program can record the time of each directory scan, and compare this with the timestamps on the files in the quarantine directories, in order to determine which files are new.
If, in step 76, one or more new files are detected then, in step 80, the exact filetype of each new file is detected. For example, it is determined whether the file is generated using the Microsoft® Word® word processing program. In a preferred embodiment of the invention, the filetype is determined by examining the filename, and specifically the file extension, that is ".doc" in the case of a file generated using the Microsoft® Word® word processing program. In an alternative embodiment of the invention, the filetype may be determined by examining the content of the file.
Having detected the filetype, then, in step 82 of the process, a relevant part of the database file 44 is consulted. The database file 44 contains parts which are relevant to many of the different filetypes which may be detected. In step 83, it is determined whether the database file 44 contains any entries for the filetype detected in step 80. If not, the process returns to step 74, where, after a predetermined time, the quarantine directories are again monitored. If it is determined in step 83 that the database file 44 does contain entries for the filetype detected in step 80, the process passes to step 84.
In each part of the database file 44, there are descriptions of data that may be contained within files of the relevant type, and which may be unsafe. For example, the data might be potentially unsafe tags in HTML files that may contain scripts, programs, images or references to other websites which may themselves have unsafe content.
As mentioned above, these potentially unsafe data inside the files are identified by reference to the database file 44. This database is updated over the internet 20 from the threat database 34. Thus, the provider of the web server 32 can continually maintain the threat database 34, and this can then be used to update the database file 44 on a regular basis.
The database file 44 can be updated periodically, either by the file processing program 48 itself or by some other scheduled event, allowing potential threats to be identified in a timely manner. The maintenance of the database file 44 can operate in a way which is similar to the way that anti-virus definitions are updated in existing commercially available products.
The process then passes to step 84, in which it is determined if the files, which have been newly added to the quarantine area, contain any of these potentially unsafe data.
Thus, in this embodiment of the invention, files are sent to a quarantine area, and it is then determined whether those files contain any of the potentially unsafe data. In another embodiment of the invention, the database file may be consulted, in order to determine whether to send the files to the quarantine area.
For each newly added file, if it is determined in step 84 that the file does not contain any of the potentially unsafe data relevant to that filetype, the process jumps to step 90.
However, if it is determined in step 84 that the file contains one or more of the potentially unsafe data relevant to that filetype, then, in step 86, the potentially unsafe data is removed.
Next, in step 90, the files that could be dealt with, or that were determined in step 84 not to contain any unsafe data, are moved from the quarantine area 50 to the safe area 52.
The process finally returns to step 74, where, after a predetermined time, the quarantine directories are again monitored. Thus, as is conventional, the user of the computer 100 can be warned that a requested web page has been identified as potentially unsafe, and has been put into the quarantine area 50.
Further, the user can now be informed that a safe copy of the file is available in the safe area 52, and that, although the potentially unsafe data, such as certain tags, scripts, programs or references to other websites, have been removed, the text, and any normal formatting, are retained, so that the contents of the page can be read, albeit perhaps not quite as the web page designer originally intended.
Figures 1-3 relate to an embodiment of the present invention, in which the file processing program 48 runs on a client computer 10, and modifies files stored in a quarantine area of the disk 14 of the computer.
In an alternative embodiment of the invention, a web page processing program runs on a web server, through which multiple client computers can access the internet.
Figure 4 is a block schematic diagram of a computer system operating in accordance with this alternative embodiment of the invention. Specifically, Figure 4 shows a computer 110, having a CPU 112 and a disk 114, and acting as a web proxy server in a manner which is generally conventional, as will be well known to the person skilled in the art. The web proxy server 110 has a connection over the internet 20 to a first web server computer 30. As is well known, information can be stored on the web server 30 in the form of web pages, and the users of other internet-connected computers can access these web pages, and download the stored information for viewing on their computers. The web proxy server 110 can also access a second web server computer 32, which contains a threat database 34, as described with reference to Figure 1.
As is well known to the person skilled in the art, users of client computers can connect to the internet 20 through the web proxy server 110, and the web proxy server 110 collects web pages and other data for clients, using HTTP. Again, in a generally conventional way, the web proxy server 110 may pre-process the data which it collects, and may keep a local copy for reasons of efficiency. Figure 4 shows two such client computers 120, 122, although it will be appreciated that any number of such client computers may be connected in this way.
In this second embodiment of the invention, the file processing program runs on the web proxy server 110, but otherwise operates generally in accordance with Figure 2 and the associated description. Thus, when a web page is identified by the proxy software as potentially unsafe, and is moved to a quarantine area of the disk 114, a web page is presented to the client computer, identifying the problem and informing the user. If the program is able to remove the potentially unsafe data, the web page presented to the client computer can also contain a hypertext link to the safe version of the file.
In a further, related, embodiment of the invention, the web proxy software itself can advantageously perform many of the steps of the process shown in Figure 2. For example, the web proxy software can identify potentially dangerous web pages or other files, then modify those pages if possible using the relevant database file.
There is therefore described a system which allows a user to have access to a potentially dangerous file, after the potentially dangerous part of the file contents has been removed.