GB2483246A - Identifying Plagiarised Material - Google Patents

Identifying Plagiarised Material Download PDF

Info

Publication number
GB2483246A
GB2483246A GB1014476.4A GB201014476A GB2483246A GB 2483246 A GB2483246 A GB 2483246A GB 201014476 A GB201014476 A GB 201014476A GB 2483246 A GB2483246 A GB 2483246A
Authority
GB
United Kingdom
Prior art keywords
computer system
file
fingerprint
suspect
available
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1014476.4A
Other versions
GB201014476D0 (en
Inventor
Daniel Goodman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to GB1014476.4A priority Critical patent/GB2483246A/en
Publication of GB201014476D0 publication Critical patent/GB201014476D0/en
Publication of GB2483246A publication Critical patent/GB2483246A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/106Enforcing content protection by specific content processing
    • G06F21/1063Personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • G06F17/2211

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Storage Device Security (AREA)

Abstract

A computer-implemented method of identifying plagiarised material in a suspect file stored in a first computer system by comparison against a database (DBF) of available files in a second, remote computer system, the method comprising processing the suspect file to produce a suspect file fingerprint (S100); and using the result of a comparison (S110) with available file fingerprints from the second computer system database to identify any part of the suspect file which may be plagiarised. The database (DBF) in the second computer system is preferably a master database and the first computer system preferably incorporates a slave database transmitted from the second computer system. The fingerprint is preferably generated using a feature element matrix by producing a hash value for each keyword in a character string.

Description

Method and Apparatus for Detection of Plagiarised Material The present invention relates to automatic detection of plagiarism. It has applications in the business, government and education spheres. Plagiarism is the copying or (close) imitation of material belonging to another party in preparing a work product, usually followed by the presentation of the product as one's own work.
Plagiarism is a growing problem in the modern world, and while traditionally this is thought of as a problem of students handing in work copied from the internet, the issue is not limited to schools and universities. Any organisation that places information into the public domain, such as publishing houses, government departments, companies and universities, is concerned with both the legal ramifications of publishing work that may not belong to them, and the public embarrassment of publishing work that does not belong to them. An example of the latter case is the Masters student dissertation on weapons of mass destruction in Iraq that was published by UK ministers as part of the UK Government's case for a war As a result of these concerns many techniques to detect plagiarism using computer matching techniques have been developed by companies and are now sold as services. The sophistication of these services varies from simple string matching to more complex techniques based around analysing various issues relating to the structure of documents or other files (such as spreadsheets or images). These more sophisticated techniques aim to be able to detect plagiarism even when the original document or other file has been substantially altered+ However, while universities may be happy to send student essays to be checked by a third party providing such a service, companies and government organisations may be less happy about releasing commercially sensitive or classified information to a third party organisation.
Detecting plagiarism requires a large database of available documents for example that are already in the public domain (or conceivably to a smaller group of potential recipients) against which to check the document in question. The building of such a database is normally impractical for a non-specialist organisation, so such services are normally purchased from other companies that specialise in this field. However, an organisation that wants to check that documents they are providing/using do not contain plagiarised content may not wish to release the content of these documents to an external organisation (for example, if these documents are being used to brief ministers in the build up to making policy decisions, or if the company is preparing to release information on a new product). In these circumstances they do not want others to pre-ernpt the release of information, however they do want to know that what they are working from is first-hand information, and that they are neither breaching copyright, nor using information that will turn out to be an embarrassment later.
It is thus desirable to provide a method of identifying plagiarised material that does not require release or requires very limited release only of the suspect file (the document or other file to be checked for plagiarism) outside the internal ("home") computer system where it is held.
According to a first aspect, embodiments of the present invention provide a computer-implemented method of identifying plagiarised material in a suspect file stored in a first computer system by comparison against a database of available files in a second, remote computer system, the method comprising processing the suspect file to produce a suspect file fingerprint; and using the result of a comparison with available file fingerprints from the second computer system database to identify any part of the suspect file which may be plagiarised, The use of llngerprinting" (producing a unique data file representing the file's characteristics) allows release of the file in a very controlled manner, or not at all, since the fingerprints of files are compared, rather than the files themselves. The method may be carried out in either computer system, dependent on where the various processing operations are more advantageously carried out and on security considerations. Often, the suspect file is required to remain within its home computer system. Therefore the suspect document may be stored within its home computer system (the first computer system) with only its fingerprint sent to the remote computer system (the second computer system) belonging, for example to a file checking service provider. Here, the term "remote" includes situations in which the two systems are physically separated, but essentially refers to separately controlled (and probably owned) systems, material in one system being potentially confidential with respect to the other system, and possibly vice versa.
Thus in a first option, according to some embodiments, the method set out above is carried out within the first computer system and the suspect file fingerprint is transmitted to the second computer system for the comparison against available file fingerprints, the result of the comparison being transmitted to the first computer system.
The skilled person will appreciate that both computer systems can include transmission and reception components to transfer fingerprints and/or other data, such as file transmission protocol (ftp) or an http server for an internet implementation.
As a second option, in other embodiments, the suspect file and its fingerprint may both remain within the first computer system, without transfer to the second computer system. For this methodology, the fingerprints of the files against which the suspect file is checked (the available files) may be transmitted to the first computer system.
Thus in these other embodiments, the method above is carried out within the first computer system, and the second computer system database is a master database, with a slave database of available file fingerprints in the first computer system used for comparison of the suspect file against the available file fingerprints, the slave database being populated by transmission from the second computer system at the start of the method and the result of the comparison being held exclusively within the first computer system.
Here, the slave database in the first (home) computer system can be set up once (in a one-off set up phase, rather than each time a suspect file is checked), and updated as necessary for continuing use of the fingerprint data therein, to avoid unnecessary transmission. Any transmission of the available files themselves from the second computer system can be reqes..ted on an ad-hoc basis by the home computer system, although it should be noted that this may give an external indication of suspect file contents.
As a third option, the suspect file itself may be transmitted to the second computer system, the method carried out in the second computer system, and results transmitted to the first computer system. Fingerprinting can still improve security under these circumstances. For example, the suspect file may be held in a protected storage environment in the second computer system and only its fingerprint released further into less secure sections of the second computer system for comparison with the fingerprints of the available files.
According to a further aspect of the invention relating to a method in the second computer system, there is provided a computer-implemented method of detecting plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the method comprising, in the second computer system: processing the available files to produce a fingerprint for each available file; receiving from the first computer system a suspect file fingerprint or a suspect file from which a suspect file fingerprint is produced; comparing the suspect file fingerprint (which is first produced, or as transmitted from the first computer system) against the fingerprint for each available file to detect a part of the suspect file which may be plagiarised; and transmitting the results to the first computer system.
Such a method in the second computer system corresponds to the first and third options set out above for the first computer system and has the advantages of a secure implementation (by use of fingerprinting) and, in comparison with the second option, requires a lower level of sophistication and processing capability in the first computer system (which could even be a stand-alone PC) as well as transfer of less information between the two systems. In fact, a representation of the suspect file could even be printed from the first computer system and transferred in paper form to the second computer system, where it could be scanned in and automatic character recognition used to recognise the suspect file contents. The same manual' methodology could be used also for transfer of other information between the systems, where practical.
According to a further aspect of the invention relating to a method in the second computer system, there is provided a computer-implemented method of aiding detection of plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the method comprising, in the second computer system: processing the available files to produce a fingerprint for each available file; storing the fingerprints in a master database and transmitting the available file fingerprints to the first computer system to provide a stave database in the first computer system for comparison against a fingerprint of a suspect file.
This method in the second computer system corresponds to the second option in the first computer system.
The skilled reader will appreciate that the second computer system could provide any or all of the options described above with respect to the method in the first computer system for different first computer systems (clients), depending on the level of sophistication and security requirements of these clients.
In all these different embodiments, the fingerprinting methodology used for the available files and suspect file is important. In practical implementations, the fingerprinting method is likely to be the same in both cases, or at least substantially similar, perhaps with some minor differences.
Any fingerprinting method is likely to have the effect of obscuring the contents of the file, at least to some extent. Preferably, the fingerprint is produced with reference to key features within the file, and the fingerprinting method produces a fingerprint from which the key features cannot be derived. That is, the fingerprint alone will not allow identification of the key features themselves. This gives improved security.
In a preferred embodiment, using fingerprinting methodology set out in more detail later, the fingerprint for the suspect file and the available files is provided using a methodology of identifying key features in the file in question (for fingerprinting) and building a fingerprint in the form of a signature data structure derived from key features and the relative placement of groups of key features within the file in... question. Each group may consist of two (a pair) or more key features.
As set out above, preferably the data structure holds only information from which the key features cannot be derived. To this end, the signature data structure may be a feature element matrix derived from a string in the file for fingerprinting, by producing a hash value for each key feature in the string; storing the hash value and an appearance position of each key feature; determining whether another key feature is within a predetermined range of any given key feature; and constructing the feature element matrix, a matrix element being obtained by associating the two hash values of the key features as a row and column address of the matrix, wherein a 1' indicates a present matrix element, and a 0' indicates that there is no matrix element and thus that the other key feature is not within a certain distance of the given key feature. In this embodiment, the method of providing the key feature hash value from the key feature may be confidential.
The key features may be universal to all computer systems using the fingerprinting methodology. Preferably, the key features are determined in the second computer system, and optionally a list of key features is made available to the first computer system (particularly for use in the embodiments in which the signature is produced in the first computer system).
Further operations in the method can aid the eventual recipient of the results in a manual cross-check. In some embodiments, the method in any of the above alternatives, in either the first or second computer system as appropriate, further comprises producing a highlighted copy of the suspect file and/or the or each available original file from which possible plagiarism has been detected, wherein the highlighting indicates the part of the suspect file which may be plagiarised and/or the part of the or each available file which may have been plagiarised. A different type of highlighting (for example a different colour) may be used for each separate possible instance of plagiarism. lf both the available and the suspect file are highlighted, the different highlighting types can be used to match the plagiarised part of the suspect file with the relevant part of the available file which has been plagiarised.
Preferably the available files are files in the public domain, but their availability may be more limited, for example to a group of co-operating companies, or a standards organisaton.
In many practical implementations, the suspect file and available files are documents.
According to a further aspect of the present invention, there is provided an apparatus which in use identifies plagiarised material in a suspect file stored in a first computer system by comparison against a database of available files in a second, remote computer system, the apparatus comprising a fingerprinting device operable to process the suspect file to produce a suspect file fingerprint; and an identifying device operable to use the result of a comparison with available file fingerprints from the second computer system database to identify a part of the suspect file which may be plagiarised. The apparatus defined may form part of (or all of) the first or second computer system.
The fingerprinting device and identifying device may be embodied as processing means (with access to storage capability) in practical implementations, either as different functionality programmed into the same processor or in different processors.
Two further aspects relate to an apparatus which may form part of (or all of) the second computer system.
According to one of these aspects of the present invention, there is provided an apparatus which in use detects plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the apparatus comprising a fingerprinting device operable to process the available files to produce a fingerprint for each available file (and optionally also to process the suspect file to produce a suspect file fingerprint); a receiver operable to receive from the first computer system a suspect file fingerprint or a suspect file from which a suspect file fingerprint is produced; a comparator operable to compare the suspect file fingerprint against the fingerprint for each available file to detect a part of the suspect file which may be plagiarised; and a transmitter operable to send results to the first computer system.
According to the other of these aspects of the invention, there is provided an apparatus which in use aids detection of plagiarised material in a suspect file in a first computer system by comparison against a master database of available files held by a second, remote computer system, the apparatus comprising: a fingerprinting device operable to processing the available files to produce a fingerprint for each available file; storage arranged to (operable to) store the fingerprints in a master database and a transmitter operable to transmit the available file fingerprints to the first computer system to provide a slave database in the first computer system for comparison against a fingerprint of a suspect file.
In this aspect the comparison is not directly against the master database, but against a slave database which is effectively a copy of the master database. The copy may be provided each time a suspect file is checked or the copy may be periodically updated.
As before, the devices mentioned can be part of the same or different processors or other units. In essence, the devices defined in each of the aspects may employ memory and processing capability. The parts may be configured by software to provide the functions described.
Further apparatus features correspond to the preferable features set out above in the method aspects. Moreover, the different apparatus aspects can be combined, particularly if the second computer system provides a different service for different clients, as briefly mentioned above.
In one aspect, a plagiarism checking network is provided, comprising at least one first computer system as described above and a second computer system as described above.
According to one computer program aspect there may be provided a computer program which when executed on a first or second computer system carries out any of the above methods. In another computer program aspect, a computer program may be downloaded onto an apparatus to cause that apparatus to operate as any of the apparatus described above.
The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The invention can be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, one or more hardware modules. A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a data processing environment. A computer program can be deployed to be executed on one module or on multiple modules at one site or distributed across multiple sites and interconnected by a communication network.
Method steps/operations of the invention can be performed by one or more programmable processors executing a computer program to perform functions of the invention by operating on input data and generating output. Apparatus of the invention can be implemented as programmed hardware or as special purpose logic circuitry, including e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both.
The essential elements of a computer are a processor for executing instructions coupled to one or more memory devices for storing instructions and data.
Test scripts and script objects can be created in a variety of computer languages.
Representing test scripts and script objects in a platform independent language, e.g., Extensible Markup Language (XML) allows one to provide test scripts that can be used on different types of computer platforms.
The invention has been described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, the different processes of the invention *can be performed in a different order and still achieve desirable results. Multiple test script versions can be edited and invoked as a unit without using object-oriented programming technology; for example, the elements of a script object can be organized in a structured database or a file system, and the operations described as being performed by the script object can be performed by a test control program.
The related art and preferred features of the present invention will now be described, purely by way of example, with reference to the accompanying drawings, in which: Figure 1 is a schematic diagram showing hardware used in the first (home) and second computer system and the link between the two systems; Figure Ia represents the hardware in both systems according to a first invention embodiment in which a file is transferred to the second computer system for checking; Figure 1 b represents the hardware in both systems according to a second invention embodiment in which a fingerprint is transferred to the second computer system for checking; and Figure 1 c represents the hardware in both systems according to a third invention embodiment in which a slave database is provided in the first computer system against which a suspect document can be checked; Figure 2 is a flow diagram showing related art detection of use of confidential material using proprietary technology; Figure 3 is a flow diagram showing identification of piagiarised material according to the first invention embodiment in which a file is transferred to the second computer system for checking; Figure 4 is a flow diagram showing identification of plagiarised material according to the second invention embodiment in which a fingerprint is transferred to the second computer system for checking; and Figure 5 is a flow diagram showing identification of plagiarised material according to the third invention embodiment in which a slave database is provided in the first computer system against which a suspect document can be checked.
The present invention requires a fingerprinting method. A suitable related art fingerprinting method is set out in the annex, which is the flied version of European Patent Application EP 10155231.3, in the name of Fujitsu Ltd, filed on 2 March 2010.
This document is hereby incorporated by reference.
While applicable to any form of data for which a suitable fingerprint technique exists, the initial embodiments of this invention use an adaptation of related art techniques for fingerprinting documents described in the above Fujitsu application. This art is an email scanning tool that will scan documents and identify keywords within the document. Having identified these it then builds a data structure around their placement within the document, and from this builds a fingerprint of the document.
These fingerprints can then be used to compare the original document with any new document to detect similarity in the documents. This allows a database of documents to be constructed which other documents can be compared against. In the related art document, the database is used for detecting confidential information in emails by building the database from a collection of confidential documents.
A diagram outlining how the detection process could be used to check and highlight a single document is shown as Figure 2. In step 310, fingerprints of confidential documents B are generated. These fingerprints are held in a database of fingerprints DBF. The reader will appreciate that the fingerprints may be accumulated in the database gradually, so that the step SlO is a continuing operation rather than a discrete step. When a "suspect" document A is to be checked (for example when an external email is to be sent), a fingerprint of the document is generated in step 520. In step 530 the suspect document fingerprint is compared against the fingerprints in the database. The result of this comparison is used in step 540 to report suspect parts of the suspect document (if any). Finally, in step 540 the suspect document is highlighted to form document A'.
The data structure used as a fingerprint in this related art document is derived from a character string (corresponding to the text in the suspect document). Keywords may be any word in the character string apart from extremely common words such as "this" "that" and "computer". The process provides a (numerical) hash value for each keyword and stores the hash value and an appearance position of the keyword in a table (see figure 3). Each keyword may appear multiple times in the table, each time with a different appearance position corresponding to its use in a different place in the text.
A feature element matrix is then derived. Using the keyword table, it is possible to determine whether another keyword is within a predetermined range of any given keyword (for example a certain number of words before or behind the given keyword).
In this case, a matrix element is obtained by associating the two hash values of the keyword as a row and column address of the matrix. A 1' indicates a present matrix element; whereas a 0' indicates no matrix element (the other keyword is not within a certain distance of the given keyword).
The feature element matrix may be converted to another form, for example serialised into a string, such as a character string.
For the comparison with the feature element matrix of another document, a logical product of the two matrices may be computed, to form a common matrix, which can be analysed for the number of common elements (number of 1 values). This number can then be compared against a preselected limit. Moreover, other analyses can be carried out, in particular combining three keywords, as detailed further in the related art document.
If the database described in the Fujitsu application is constructed from publicly available documents instead of confidential documents it can be used to detect plagiarism. This could be used either as a standalone system, or in collaboration with the database of confidential information. Such a database would be very large, so this technology would probably best be provided as a service from an existing plagiarism detection company or another service provider.
Figure 3 is a flowchart showing an embodiment in which a suspect file A is transferred from its home computer system for checking by a second, remote computer system.
The transfer may be by any electronic or paper method. The available files BI, B2, B3 etc are held in the second computer system, and their fingerprints are generated in step S50 and held in database DBF. In step 360, the fingerprint of the suspect document A is generated in the second computer system and in step S70, the fingerprint is compared with the available file fingerprints. The reporting of the suspect parts of the file in step S80 may take place in either computer system, and a highlighted copy of A, shown as A', may be provided.
In addition to plagiarism detected by the submission of documents directly, it would be possible for an organisation to submit just the fingerprint, and receive back the information about documents in the public domain which match these fingerprints. As the fingerprints obfuscate the contents of the document, this would mean that documents could be checked without having to release to the third party the document that is being checked.
Figure 4 is a diagram demonstrating the workflow for this process. Here the comparison of the fingerprints takes place in the second computer system and only the suspect file fingerprint is transferred, rather than the entire file. In step S90 fingerprints are generated for available documents (for example on an ongoing basis, rather than as a single discrete step) and stored in database DBF. In step S100 the suspect fingerprint is generated in the first computer system. It is transferred to the second computer system and in step SIlO the suspect fingerprint is compared with the available fingerprints in the second computer system. Using the comparison from the second computer system, a report is produced in the first computer system of any suspect pads of the document. For example, since the hash function for generating the fingerprint is known, then this can be reversed and combined with the results of the comparison to generate the highlighted copy of the original document. That is, the result of the comparison may be returned to the first system to enable the report to be generated. Finally, a highlighted copy of the document A' may be produced.
However, the above approach might still lead to the leakage of some information since if any documents had partial matches with fingerprints in the database, it would be possible to look and see the content of the publicly available documents that caused this match. An even more secure approach would be for the database to be fed by the third party, but be hosted and administered by the party wishing to check their documents. This would prevent any potential analysis of which documents are checked, including the prevention of frequency analysis of document checking, which might allow the service provider to predict that the user was about to release a large amount of information. A diagram demonstrating the workflow for this alternative arrangement can be seen in Figure 5.
In figure 5, the suspect document fingerprint is generated in step S140 and available document fingerprints are generated in step 5130 and held in a master database MOB, with a slave database in the first computer system DBF (which may correspond to all or part of the master database) being used for fingerprint comparison in step S150. The slave database is updated as necessary. Step S160 provides the report of the suspect document parts.
The methods described above require both the customer and the service provider to be using the same set of keywords (key features) for the analysis, but this can be achieved through the service provider making this list available to the customer, for example through some out-of-band mechanism, such as an RSS feed.
Figure 1 a shows the hardware used for the first embodiment of the present invention, in which a file is transferred to the second computer system for checking. The home system is linked to a storage device 130 which holds suspect document A and includes a processing unit 50 with a display adapter 60, for display on a terminal with VDU 70.
The user can provide input to the home system via a keyboard 80 and/or other standard input devices, such as a mouse. The home system 10 is connected to the second system 20 over a network, using network interfaces (which effectively provide transmitter/receiver functionality) 30 (in the home system) and 40 (in the second system). In the second system, there is provided a processing unit 90, including a fingerprint generator (fingerprinting device) 100, comparison engine (comparator) 110, and report generator and highlighter (identifying device) 120. The second system is linked to a fingerprint database DBF and available documents B are stored in a storage device. The hardware may function as follows. From the home system, a suspect file A is transferred from storage device 130 over the network interfaces 30, 40 to the second system. The file may be transferred as part of an automatic checking system, or following specific user input. In the second system, the fingerprint of the suspect file A is generated in the fingerprint generator. Fingerprints of available documents B have also been generated using the fingerprint generator, and the comparison engine 110 compares the fingerprint of the suspect document with each of the available document fingerprints in turn. A report generator and highlighter 120 uses the result of the comparison to produce a report and/or highlight parts of the suspect document which may be pagiarised and/or parts of the available file(s) which may have been plagiarised. The report/highlighted document(s) are then returned to the home system via the network interfaces for display.
Figure 1 b represents the hardware in both systems according to the second invention embodiment n which a fingerprint is transferred to the second computer system for checking. Parts having the same functionality as in the Figure la embodiment are identically numbered and therefore detailed description thereof will be omitted. In this embodiment a further fingerprint generator 140 is required in the home system 101 and the report generator and highlighter 120 is situated in the home system rather than in the second system. A fingerprint of a suspect document generated in the home system is transferred via interfaces 30, 40 to processing unit 90 in the second system, where it is compared in comparison engine 110 with fingerprints of available documents that have been generated in the generator 100 of the second system. The result of the comparison is sent over the network interfaces to the report generator and highlighter in the home system. The report/highlighted document(s) can be displayed on display 70.
Figure ic represents the hardware in both systems according to the third invention embodiment in which a slave database is provided in the first computer system against which a suspect document fingerprint can be checked. Parts having the same functionality as in the Figure la and Figure lb embodiments are identically numbered and therefore detailed description thereof will be omitted. Fingerprints for all the available documents are produced by fingerprint generator 100 in the second system and the fingerprints are stored in master database MDB. Data is sent over the network interfaces to provide a slave database DBF of fingerprints in the home system. Thus the comparison engine 110 and report generator and highlighter 120 are both provided in the home system, along with fingerprint generator 140, to produce a fingerprint of the suspect document.
In all of the above cases, sections of the fingerprints may be matched pairwise, and it is possible to return with the highlighted document the documents that it is believed to have been plagiarised. This was not a possible operation in the original use of the related art method for preventing release of confidential material as the documents that were fingerprinted were confidential and so were to remain hidden.
As stated earlier in this document, the initial embodiment of this technique is described in relation to the detection of plagiarism in documents, but the system can be applied to any other field for which a suitable fingerprinting technique is available. The properties of such a technique are that it is able to generate a fingerprint or hash' of a piece of data that both obfuscates the original data, and can be compared with other fingerprints of similar but not identical data to determine the level of similarity.
The invention has been described with reference to a suspect document and a collection of available documents, but the available documents may be any other collection of documents against which a suspect document may be checked; they need not be available, and in one application may in fact be a collection of confidential documents, against which a third party is checking suspect documents for one or more clients.
$±irnm.arv of Benefits of Invention Embodiments Invention embodiments can provide a way of preventing/reducing data disclosure while performing plagiarism detection. They can provide a service for use by both lower-level and sophisticated clients. While other systems may be able to generate fingerprints, it is not always clear that such a fingerprint would sufficiently obfuscate the original documents. The way that the related art document produces a feature element matrix using hash values can be used to hide the document contents, in the sense that the way the hash value is computed can be confidential to the second computer system or held as joint confidential material between the first and second computer systems.

Claims (16)

  1. CLAIMS1. A computer-implemented method of identifying plagiarised material in a suspect file stored in a first computer system by comparison against a database of available files in a second, remote computer system, the method comprising processing the suspect file to produce a suspect file fingerprint; and using the result of a comparison with available file fingerprints from the second computer system database to identify any part of the suspect file which may be plagiarised.
  2. 2. The method according to claim 1, carried out within the first computer system, wherein the suspect file fingerprint is transmitted to the second computer system for the comparison against available file fingerprints, the result of the comparison being received by the first computer system.
  3. 3. The method according to claim 1, carried out within the first computer system, wherein The second computer system database is a master database, with a slave database of available file fingerprints in the first computer system used for comparison of the suspect file against the available file fingerprints, the method further comprising nitial population of the slave database using a transmission from the second computer system, wherein the result of the comparison is held exclusively within the first computer system.
  4. 4. A computer-implemented method of detecting plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the method comprising, in the second computer system: processing the available files to produce a fingerprint for each available file; receiving from the first computer system a suspect file fingerprint or a suspect file from which a suspect file fingerprint is produced; comparing the suspect file fingerprint against the fingerprint for each available file to detect a part of the suspect file which may be plagiarised; and transmitting the results to the first computer system.
  5. 5. A computer-implemented method of aiding detection of plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the method comprising, in the second computer system: processing the available files to produce a fingerprint for each available file; storing the fingerprints in a master database; and transmitting the available file fingerprints to the first computer system to provide a slave database in the first computer system for comparison against a fingerprint of a suspect file.
  6. 6. The method according to any of the preceding claims, further comprising transmission of any available files identified as possibly plagiarised by the result of the comparison to the first computer system.
  7. 7. A method according to any of the preceding claims, wherein the fingerprinting method is the same for the suspect file as for the available files.
  8. 8. A method according to any of the preceding claims, wherein the fingerprint is produced with reference to key features within the file, and the fingerprinting method produces a fingerprint from which the key features cannot be derived.
  9. 9. A method according to any of the preceding claims, wherein the fingerprint is provided using a methodology of identifying key features in the file for fingerprinting and building a fingerprint in the form of a signature data structure derived from key features and the relative placement of groups of key features within the file in question.
  10. 10. A method according to claim 9, wherein the signature data structure is a feature element matrix derived from a character string in the file for fingerprinting, by producing a hash value for each keyword in the character string; storing the hash value and an appearance position of each keyword; determining whether another keyword is within a predetermined range of any given keyword; and constructing the feature element matrix, a matrix element being obtained by associating the two hash values of the keywords as a row and column address of the matrix, wherein a 1' indicates a present matrix element, and a 0' indicates that there is no matrix element and thus that the other keyword is not within a certain distance of the given keyword.
  11. 11. A method according to any of the preceding claims, further comprising producing a highlighted copy of the suspect tHe and/or the or each available plagiarised file from which pcssible plagiarism has been detected, wherein the highlighting indicates the part of the suspect file which may be plagiarised and/or the part of the or each available file which may have been plagiarised.
  12. 12. An apparatus which in use identifies plagiarised material in a suspect file stored in a first computer system by comparison against a database of available files in a second, remote computer system, the apparatus comprising a fingerprinting device operable to process the suspect file to produce a suspect file fingerprint; and an identifying device operable to use the result of a comparison with available file fingerprints from the second computer system database to identify a pad of the suspect file which may be plagiarised.
  13. 13. An apparatus which in use detects plagiarised material in a suspect file from a first computer system by comparison against a master database of available files held by a second, remote computer system, the apparatus comprising a fingerprinting device operable to process the available files to produce a fingerprint for each available file; a receiver operable to receive from the first computer system a suspect file fingerprint or a suspect file from which a suspect file fingerprint is produced; a comparator operable to compare the suspect file fingerprint against the fingerprint for each available file to detect a part of the suspect file which may be plagiarised; and a transmitter operable to send results to the first computer system.
  14. 14. An apparatus according to claim 13, wherein the fingerprinting device is also operable to process the suspect file to produce a suspect file fingerprint.
  15. 15. An apparatus which in use aids detection of plagiarised material in a suspect file in a first computer system by comparison against a master database of available files held by a second, remote computer system, the apparatus comprising: a fingerprinting device operable to processing the available files to produce a fingerprint for each available file; storage arranged to store the fingerprints in a master database and a transmitter operable to transmit the available file fingerprints to the first computer system to provide a slave database in the first computer system for comparison against a fingerprint of a suspect file.
  16. 16. A remote computer system fOr use in detection of plagiarism in a suspect file held in one or more home computer system, the remote computer system comprising any of the features of claims 13 to 15.
GB1014476.4A 2010-09-01 2010-09-01 Identifying Plagiarised Material Withdrawn GB2483246A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1014476.4A GB2483246A (en) 2010-09-01 2010-09-01 Identifying Plagiarised Material

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1014476.4A GB2483246A (en) 2010-09-01 2010-09-01 Identifying Plagiarised Material

Publications (2)

Publication Number Publication Date
GB201014476D0 GB201014476D0 (en) 2010-10-13
GB2483246A true GB2483246A (en) 2012-03-07

Family

ID=43013478

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1014476.4A Withdrawn GB2483246A (en) 2010-09-01 2010-09-01 Identifying Plagiarised Material

Country Status (1)

Country Link
GB (1) GB2483246A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095735A (en) * 2016-06-06 2016-11-09 北京中加国道科技有限责任公司 A kind of method plagiarized based on deep neural network detection academic documents
GB2561177A (en) * 2017-04-03 2018-10-10 Edinburgh Napier Univ Method for identification of digital content
EP3767489A1 (en) * 2019-07-16 2021-01-20 National Tsing Hua University Privacy-kept text comparison method, system and computer program product

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Academic plagiarism: An Analysis of Current Technological Issues *
Distributed Similarity and Plagiarism Search *
Winnowing: Local Algorithms for Document Fingerprinting *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095735A (en) * 2016-06-06 2016-11-09 北京中加国道科技有限责任公司 A kind of method plagiarized based on deep neural network detection academic documents
GB2561177A (en) * 2017-04-03 2018-10-10 Edinburgh Napier Univ Method for identification of digital content
GB2561177B (en) * 2017-04-03 2021-06-30 Cyan Forensics Ltd Method for identification of digital content
US11762959B2 (en) 2017-04-03 2023-09-19 Cyacomb Limited Method for reducing false-positives for identification of digital content
EP3767489A1 (en) * 2019-07-16 2021-01-20 National Tsing Hua University Privacy-kept text comparison method, system and computer program product
US11232157B2 (en) 2019-07-16 2022-01-25 National Tsing Hua University Privacy-kept text comparison method, system and computer program product

Also Published As

Publication number Publication date
GB201014476D0 (en) 2010-10-13

Similar Documents

Publication Publication Date Title
Horsman Tool testing and reliability issues in the field of digital forensics
CN104063664B (en) The safety detection method of software installation bag, client, server and system
CN109376078B (en) Mobile application testing method, terminal equipment and medium
US9081987B2 (en) Document image authenticating server
US8286171B2 (en) Methods and systems to fingerprint textual information using word runs
US11677783B2 (en) Analysis of potentially malicious emails
US20190005268A1 (en) Universal original document validation platform
EP1995681A1 (en) Authenticity assurance system for spreadsheet data
US20060190988A1 (en) Trusted file relabeler
CN108140084A (en) Using multilayer tactical management come managing risk
CA3033144A1 (en) Tracing objects across different parties
CN111191246A (en) Spring annotation based security development verification method
CN109829304A (en) A kind of method for detecting virus and device
CN106021237A (en) Language independent probabilistic content matching
KR101742041B1 (en) an apparatus for protecting private information, a method of protecting private information, and a storage medium for storing a program protecting private information
GB2483246A (en) Identifying Plagiarised Material
CN114047854B (en) Information interaction method and device for document processing, electronic equipment and storage medium
Mainka et al. Shadow Attacks: Hiding and Replacing Content in Signed PDFs.
Dubettier et al. File type identification tools for digital investigations
US20100325156A1 (en) Systems and methods for secure data entry and storage
CN113282550A (en) File preview method and device, computer equipment and storage medium
EP3149648A2 (en) Document meta-data repository
US20240265103A1 (en) Systems and Methods for Detecting, Localizing, and Visualizing Manipulations of Portable Document Format Files
CN116384352B (en) Data set generation method, device, equipment and medium
RU2778460C1 (en) Method and apparatus for clustering phishing web resources based on an image of the visual content

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)