GB2561241A

GB2561241A - A managed file transfer system and method

Info

Publication number: GB2561241A
Application number: GB1705658.1A
Authority: GB
Inventors: O'Keeffe Shane; Cussen Danielle; O'Dwyer Micheál; Tracey Michael
Original assignee: Iconx Solutions
Current assignee: Iconx Solutions
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2018-10-10
Also published as: GB201705658D0

Abstract

A method and system for the pre-processing of a file before transfer over a file transfer system. The file is processed in accordance with a workflow, which comprises a number of steps such as anonymisation or pseudonymisation to be performed on data held in one or more fields of the file. A preview 33 of the file is then generated and is shown to a user to determine if the processed file is suitable for transfer. This preview may be alongside a display of the unprocessed fields 31. If the processed file is unsuitable for transmission, further steps of the workflow may be specified. The file format, header or trailer may be defined. The anonymisation or pseudonymisation action may be a blanking, masking, hashing, filtering, or a number function operation. The anonymisation or pseudonymisation action may also be a field replace function based on other fields picked either randomly, repeatably, from a defined list or from a statistical analysis.

Description

(54) Title of the Invention: A managed file transfer system and method

Abstract Title: Anonymisation or pseudonymisation of data in a file before transfer of the file (57) A method and system for the pre-processing of a file before transfer over a file transfer system. The file is processed in accordance with a workflow, which comprises a number of steps such as anonymisation or pseudonymisation to be performed on data held in one or more fields of the file. A preview 33 of the file is then generated and is shown to a user to determine if the processed file is suitable for transfer. This preview may be alongside a display of the unprocessed fields 31. If the processed file is unsuitable for transmission, further steps of the workflow may be specified. The file format, header or trailer may be defined. The anonymisation or pseudonymisation action may be a blanking, masking, hashing, filtering, ora number function operation. The anonymisation or pseudonymisation action may also be a field replace function based on other fields picked either randomly, repeatably, from a defined list or from a statistical analysis.

>

At least one drawing originally filed was informal and the print reproduced here is taken from a later filed formal copy.

1/20

07 18

2/20

07 18

Fi^2

3/20 o

co

07 18 ³^^a) 35(b) ³⁵^ 35(d) ³⁵^ 35(f) ³⁵^⁸^ 35(h) ³⁵^ 35(j) ³⁵^ 35(l)

:½.

%

4/20

07 18

5/20

07 18

51(a)

51(b)

51(c)

51(d)

Anonymisation Workflow 1 - Customer Care

Stop Action

Field Description

Masking 4

Blanking S

Blanking 24

Rounding 11

Replacement 2 φ φκχ^63 ψ φ·—^65

Round salary down to nearest 10k φ θγ

Replace first name from list φ.

^{; 1} ^^69

Last 4 digits oi phone number Remove bank account number Remove PPS Number

Fi&6

6/20

07 18

7/20

07 18

cn

8/20

07 18

Fig^9

9/20

·: >5 :

|g ΐ<:

10/20

ο

11/20

12/20

07 18

13/20

07 18

14/20

15/20

ο <D

Fig- 16

16/20

ο

17/20

18/20

193 195

19/20

07 18

Fig- 20

20/20

07 18

210

Fig- 21

- 1 “A managed file transfer system and method”

Introduction

This invention relates to a managed file transfer system and method.

There are many situations where it is desirable or necessary to transfer large files containing data from one location to another. For example, a system administrator may wish to copy the contents of a database and store the copy in a remote back-up memory. Alternatively, a worker may wish to take a copy of the contents of a large file so that they or another party may carry out analysis on the data. It is common for organisations, particularly large organisations, to transfer data from one location to another location for storage, processing and/or analysis of the data. For example, it is not uncommon for one company department to send data to another company department or for one company to send data to another company. The company departments may be situated in the same or in different locations. Indeed, the company departments or the companies, in the case of a company to company transfer, may be located in the same or in different jurisdictions.

Take for example a large telecommunications company. There are many instances where the workers in the telecommunications company would wish to transfer large files of data from one location to another. There may be a record of all data traffic handled by the telecommunications company on a switch in their network stored in a central repository. An engineer in the engineering department may wish to analyse all of the network traffic through that switch for a given time period to see whether or not their network resources are adequate for the volume of traffic experienced. On the other hand, a marketing professional may wish to analyse all of the network traffic through that switch for a given time period to analyse their customer demographic and how those customers are using the network so that they may target their marketing efforts more effectively. Finally, an accounting professional may wish to transfer the call data from the switch to a third party clearing house to reconcile accounts with other telecommunications operators.

-2However, the transfer of files and data represents a material risk for any organisation. Whenever moving data, even in an intracompany transfer, there is a risk that the data may be sent to the wrong destination or that the data may be intercepted en route and copied, redirected or modified. Alternatively, or in addition to this, whenever moving a file, there is a risk that data in the file that is not relevant or appropriate for the target location is transferred along with other appropriate data. For example, it would not be appropriate to transfer a customer’s billing information along with their call information to the engineering department.

It is clear that there are substantial risks when moving files with customer’s credit card details and/or personally identifiable information (PI I) from one location to another. Organisations have a duty of care to their customers to ensure that their data is not inadvertently released and there are heavy penalties and serious consequences for breaches of this duty of care. For example, the EU General Data Protection Regulation, due to come into force in May 2018, has proposed fines of up to €20,000,000 or 4% of global company turnover, whichever is the larger, for wilful negligence in the case of inadvertent release of customer data. Furthermore, many companies have already found how detrimental a serious breach of customer data can be for their business, both reputationally and financially, in many cases resulting in the closure of the business.

There are also many instances when customer information is transferred unnecessarily. For example, if a network engineer is analysing network traffic through a switch between phones having a first area code to phones having a second area code, they will typically only require a small subset of the data contained in the call log for the switch. In this instance, they will only require calls originating/terminating from phones having a first area code prefix to or from calls having the second area code prefix, the number of calls, the duration of those calls, and the time of those calls. They will not need to know the full number of the calling or called party, simply their prefix and they will not need to know other information that could be used to identify the calling/called party such as name, address, billing information and the like. It is not however uncommon for this additional superfluous data to be sent which not only represents a data breach risk but also is inefficient from a bandwidth usage and storage point of view and increases the data that must be processed at the receiver’s end.

- 3It is an object of the present invention to provide a managed file transfer system and method that overcomes at least some of the above-identified problems and that provides an alternative choice to the consumer.

Statements of Invention

According to the invention there is provided a method of transferring a file between a first host connector and a second host connector, the method comprising the steps of:

specifying a source location in the first host connector;

specifying a destination location in the second host connector;

specifying the file to be transferred between the source location in the first host connector and the destination location in the second host connector;

specifying a file transfer schedule;

characterised in that, the method comprises the step of processing the file prior to the transfer of the file, the file having a plurality of records containing data in a plurality of fields arranged in a structured format, the processing steps comprising:

processing the file in accordance with a workflow, the workflow comprising one or more steps, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields, thereby anonymizing or pseudonymizing the data in one or more fields of the file;

generating a preview of the processed file;

displaying the preview of the processed file on a user interface for evaluation by a user; and on the processed file being deemed suitable for transfer upon evaluation by the user, transferring the file to the destination location in the second host connector.

-4By having such a method, any superfluous data contained in the file may be anonymized or pseudonymized prior to the file being saved in local or remote storage. This will obviate the likelihood of sensitive data being released inadvertently thereby minimising risk for the company and may also reduce the size of the file being transferred and/or stored. Advantageously, a preview of the processed file is generated and thereafter displayed on a user interface so that the user may evaluate the data and assess whether there is any superfluous data being sent or whether any of the data in the file that has not been anonymized or pseudonymized should be anonymized or pseudonymized prior to the processed file being shared or stored in local or remote memory. This will allow the user to quickly determine whether the file has been adequately processed.

In one embodiment of the invention there is provided a method of processing a file in which the method further comprises the step of simultaneously displaying the preview of the processed file with the corresponding portions of the original file on the user interface. This is seen as a particularly advantageous aspect of the present invention. In this way, the operator will be able to determine whether the data contained in the processed file has indeed been sufficiently altered so that it does not represent a data breach risk.

In one embodiment of the invention there is provided a method of processing a file in which the step of generating a preview of the processed file comprises generating a preview of a sub-set of the plurality of records in the file. It is envisaged that this will be a more effective way of presenting the file and the processed file for evaluation and will enable the operator to more quickly evaluate whether or not the file has been anonymized or pseudonymized sufficiently.

In one embodiment of the invention there is provided a method comprising the initial step of:

creating a workflow for the file including specifying the one or more steps of the workflow, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields.

- 5In one embodiment of the invention there is provided a method of processing a file in which the step of specifying one or more steps of the workflow further comprises specifying one or more parameters for the anonymization or pseudonymization action.

In one embodiment of the invention there is provided a method of processing a file in which if the processed file is deemed unsuitable upon evaluation, the method comprises the further step of specifying one or more additional steps of the workflow, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields. In this way, if the file has not been sufficiently anonymized or pseudonymized so that it still represents a data breach risk, the operator may thereafter specify further steps to the workflow and these steps will then be performed on the file when it is next processed.

In one embodiment of the invention there is provided a method in which the step of specifying the file transfer schedule comprises scheduling periodic processing of the file subsequent to the initial processing of the file, the subsequent periodic processing of the file comprising the steps of:

processing the file in accordance with the workflow thereby anonymizing or pseudonymizing the data in one or more fields of the file; and thereafter transferring the file to the second host connector.

In other words, the method according to the invention may be formed as an integral component of a MFT/MDT system in which once the workflow is created, it may be used repeatedly on subsequent instances of the file. This may be useful for example if a given file is being exported to remote memory periodically. Once the workflow has been created and evaluated, it may be used again and again without requiring the operator to review the processed file each time.

In one embodiment of the invention there is provided a method of processing a file in which the method comprises the initial step of defining the format of the file.

-6In one embodiment of the invention there is provided a method of processing a file in which the method comprises the initial step of specifying a header portion of the file.

In one embodiment of the invention there is provided a method of processing a file in which the method comprises the initial step of specifying a trailer portion of the file.

In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a blanking operation. By blanking data in a field of a record, the entire field entry will be removed thereby anonymizing that part of the data making that part of the file general data protection regulation (GDPR) compliant. Furthermore, this will reduce the amount of data being sent and will reduce the size of the file being sent and/or stored.

In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a masking operation. Masking data may be useful if only part of a record is required by an end user. For example, if handling phone numbers and it is desired to review calls between numbers of a given prefix, only the prefix would be left unmasked whereas the remainder of the number could be masked to obviate the possibility of a data breach.

In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a hashing operation. Hashing is seen as a useful technique because it can effectively be made irreversible so that a person cannot reverse engineer the hashed value into the original value but will still allow the hashed value to be used to compare this record with other records containing this hashed value. For example, if it was desirable to track a number of users calls but it was not desirable to release their phone numbers, the phone numbers in the file could be hashed and these hashed number could then be used to analyse the calls of one or more people in the group.

In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a number function operation.

- 7In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a filtering operation. A filtering operation is seen as particularly useful to reduce the amount of data that is transmitted from one place to another and/or stored in memory. For example, when analysing call traffic through a switch, it may be possible to filter out all call attempts that were unsuccessful. These may be filtered out by omitting any calls with a call duration of 0. These records will not have to be stored or analysed.

In one embodiment of the invention there is provided a method of processing a file in which the anonymization or pseudonymization action comprises a field replacement operation. This is seen as a particularly useful embodiment of the invention. In this way, the data in the file will appear as though it is a real data set with the original names and/or information when in actual fact, these names or information will have been replaced by other, different information. For example, a surname field may be replaced with another surname from a list of surnames. The record will look genuine but the persons surname will have been changed from their real surname to another surname from the list.

In one embodiment of the invention there is provided a method of processing a file in which the field replacement operation comprises random field replacement. In this way, the records will be altered significantly and it will not be possible to identify common records from an individual.

In one embodiment of the invention there is provided a method of processing a file in which the field replacement operation comprises repeatable field replacement. This is seen as a useful embodiment of the invention. In this way, the value in the field will be replaced by the same value each time. For example, the surname Smith may be replaced with the surname Jones each time the name Smith is encountered in the file. This will enable tracking of multiple records in the file from the individual users.

In one embodiment of the invention there is provided a method of processing a file in which the field replacement operation utilizes a pre-defined list for field replacement. This is seen as an efficient and effective way of performing the field replacement.

- 8In one embodiment of the invention there is provided a method of processing a file in which the field replacement operation utilizes statistical accuracy for field replacement. This is seen as a particularly preferred embodiment of the present invention and will ensure that the processed data does not look out of line with an expected data set. By using statistical accuracy, a relatively common name such as Smith will be replaced by another relatively common name such as Jones rather than an uncommon name. If there were noticeably larger instances of the uncommon name, this in itself could be a clear indicator that the name had been changed from a more common name. This could lead to reverse engineering of the data in a data breach and in this way, the method obviates the dangers of such a data breach.

In one embodiment of the invention there is provided a method of processing a file in which the field replacement operation comprises a plurality of cases for the field replacement step. This is seen as useful and by having multiple cases, it is possible for the different rules to be applied to the data being replaced. For example, first names in a first name field may be selected based on gender.

In one embodiment of the invention there is provided a computer program product having program instructions stored thereon that when loaded onto a computer cause the computer to carry out one or more of the steps of the method.

In one embodiment of the invention there is provided a managed file transfer system comprising:

a plurality of host connectors;

a plurality of managed file transfer links for transfer of a file from one of the host connectors to or from another of the host connectors;

at least one user computer; and a managed file transfer link establishment module to enable a user to create a managed file transfer link and to specify for the managed file transfer link: a source location in one of the host connectors; a destination location in one of the

- 9host connectors; the file to be transferred between the source location and the destination location; and the file transfer schedule;

and in which the managed file transfer link establishment module comprises a file anonymization/pseudonymization module integral therewith operable to create a workflow for the file including one or more steps, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields of the file thereby anonymizing or pseudonymizing the data in one or more fields of the file prior to the file being transferred in accordance with the file transfer schedule.

By having such a system, files may be anonymized and transferred from one host connector to another host connector with ease. Importantly, some or all of the data in the files can be anonymized or pseudonymized to obviate the possibility of a data breach. By having the file anonymization/pseudonymization module integral with the managed file transfer link establishment module, the anonymization/pseudonymization may form part of the establishment of the transfer link itself and may be used to prevent the establishment of a transfer link until appropriate anonymization/pseudonymization steps have been established.

In one embodiment of the invention there is provided a managed file transfer system in which there is provided a managed file transfer hub and in which the user computer is in communication with the managed file transfer hub.

In one embodiment of the invention there is provided a managed file transfer system in which there is provided a secure memory and in which the managed file transfer link and the workflow associated therewith is stored in secure memory.

Detailed Description of the Invention

The invention will now be more clearly understood from the following description of some embodiments thereof given by way of example only with reference to the accompanying drawings, in which:- 10Figure 1 is a diagrammatic representation of a managed file transfer network in which the method according to the invention may be performed;

Figure 2 is a diagrammatic representation of a system in which the method can operate;

Figure 3 is a view of a user interface demonstrating a file that has been partially anonymized and/or pseudonymized;

Figure 4 is a view of an editable form for the creation of a workflow for a managed file transfer;

Figure 5 is a view of an editable form for the input of file attributes;

Figure 6 is a diagrammatic representation of the steps and actions of a workflow;

Figures 7 to 20 are various screenshots illustrating the steps of the creation of a workflow of the method according to the invention; and

Figure 21 is a diagram illustrating the operation of statistical distribution in replacement operations.

Referring to Figure 1, there is shown a diagrammatic representation of a managed file transfer network, indicated generally by the reference numeral 10, in which the method according to the invention may also be performed. The system for the transfer of data comprises a plurality of host connectors 11 (a)-11(g) and a managed data transfer hub 12 through which data being transferred from one of the host connectors to another host connector is controlled. There is further shown a pair of laptops 13, 14 and a pair of PCs 15, 16. Finally, there are provided a plurality of managed data transfer links 17(a)-175(g) for transfer of data from one of the host connectors 11 (a)-11 (g) to another of the host connectors. Typically, the laptops 13, 14 and the PCs 15, 16 are operated by staff in an organisation that are involved in the process of the establishment of the managed data transfer links. For example, PCs 15, 16 may be operated by IT personnel in the organisation. Laptop 13 may be operated by an IT manager in the organisation and

- 11 laptop 14 may be operated by a Data Protection Officer (DPO) in the organisation. Each of these will have access to the managed data transfer hub 12. In the embodiment shown, the host connectors 11(a)-11(g) are locations where data is stored. By way of example, the host connectors 11 (a)-11 (g) may be servers, databases or a combination of servers and databases (i.e. some are servers, some are databases). The host connectors 11 (a)-11 (g) may be in one or more locations in one or more jurisdictions.

Referring now to Figure 2, there is shown a diagrammatic representation of another system in which the method according to the invention may be performed, indicated generally by the reference numeral 20. The system 20 comprises a data source which may comprise a plurality of files 21 or a database 22. A pair of application servers 23 and 24 are used to process jobs 25. More or less than two application servers could be provided if desired. A database 26 contains the configuration data and a web interface accessible by a user is used to manage the configuration data in the database 26, to manage workflows, and to control the data that is to be brought into the application servers 23, 24. Once the data file has been processed, it is stored in the data destination which for simplicity shows a plurality of processed files 29(a) and a database 29(b) with processed records.

Referring now to Figure 3, there is shown a user interface, indicated generally by the reference numeral 30, illustrating the content of at least portion of an original file 31 and a preview of portion of a partially anonymized/pseudonymized processed file 33. It can be seen that the original file comprises data in a plurality of fields 35(a)-35(m) including First Name field 35(a), Last Name field 35(b), Addressl field 35(c), Address2 field 35(d), Town field 35(e), County field 35(f), Postcode field 35(g), Country field 35(h), Gender field 35(i), Date Of Birth field 35(j), Marital Status field 35(k), Length of Service field 35(1) and Current Salary field 35(m). The file shown may for example be taken from a human resources management (HRM) database of a company.

It may be desirable to transport a representation of this file to a remote location. For example, it may be desirable to transfer a representation of this file with a recruitment agency to provide the recruitment agency with a profile of the current team members that they are attempting to complement with new hires. The sender company however would like to keep some details of their employee’s records private that they deem are not

- 12relevant to the recruitment agency and indeed may not wish to provide the recruitment agency with details that may allow the recruitment agency identify specific individuals with ease. A workflow having a plurality of steps is created and the original file is processed in accordance with the workflow.

The result of the processing is the processed file with anonymized/pseudonymized data, a preview 33 of which is shown at the bottom of the user interface 30. In this simplistic example, the data has been pseudonymized by modifying the last name in the last name field 35(b) of each of the records in the file by replacing the name in the last name field 35(b) with another name. Furthermore, the date of birth in the date of birth field 35(j) of each of the records has been replaced with a hashed value and the marital status of each of the records in the marital status field 35(k) has been blanked. In the embodiment shown, the Current Salary field 35(m) is not shown but it has not been deleted, it is simply “off-screen” due to the increase in size of the date of birth field. The output file attributes and in particular the number of fields of data in each record is maintained through the processing stage.

The various setup steps that are used to get to the stage of having the preview of the processed file with the original file on the user interface will now be explained in more detail. Referring first of all to Figure 4, there is shown a view of an editable form for the creation of a workflow for a managed file transfer, indicated generally by the reference numeral 40. The editable form comprises a plurality of editable fields 41(a)-41(m) for population of data as well as a save button 43 and a cancel button 45. The editable fields include a workflow name field 41(a), workflow status field 41(b), a source folder field 41(c), a destination folder field 41(d), an error folder field 41(e), a file selection criteria field 41(f), an instances field 41(g), a batch size field 41(h), a scheduling field 41 (i), an active from field 41 (j), an active to field 41 (k), an attachments tickbox field 41(1), and a description field 41 (m).

When creating the workflow for the first time, the user inserts a workflow name in the workflow name field 41(a). There is only one workflow associated with each file. The current status of the workflow such as active, expired or pending authorisation is inserted into the workflow status field 41(b). The workflow form shown is specifically for a managed file transfer and the location of the file to be processed is entered into a source

- 13folder field 41(c) and the location of the output, i.e. the location of the processed file is inserted into a destination folder field 41(d). An error folder field 41(e) indicates where files that are not suitable for onward use or transmission are stored for review. A file selection criteria field 41(f) is provided to outline the attributes of the files along with an instances field 41(g). A batch size field 41(h) provides an indication of the size of each batch of records that is brought in for processing each time (it will be understood that the files are suitable for parallel processing simultaneously across a number of processors). A scheduling field 41 (i) provides the frequency with which the file is to be processed (e.g. if this is a write to external memory this may be done each day, several times a day or indeed upon a fixed number of records being amended or added). An active from field 41 (j) provides the date that the workflow is set up and an active to field 41 (k) provides the date that the workflow will be used until. An attachments tickbox field 41(1) is provided. Finally, a description field 41 (m) is provided to give a plain language description of why the workflow has been set up and/or what it is intended for and/or who the processed file is intended for.

Referring now to Figure 5, there is shown a view of an editable form 50 for the input and setting of file attributes. Each file will have it’s own set of attributes. The present invention is intended for use with data arranged in a structured format. For example, the present invention is intended for use with data presented in an Excel (registered trade mark ®) spreadsheet format, .csv file format, database table format, pipe separated, delimited format and the like so that the different fields can be identified and manipulated with ease. The form 50 comprises a data type field 51(a), in this case delimited or fixed width. If delimited is chosen, the type of delimiter (e.g. comma delimiter) is entered in field 51(b). If the file contains one or more rows of headers, it will be important that these are not subjected to processing and there is a start import at row field 51(c). If the file contains a trailer, this will be indicated by clicking on checkbox 51(d). Finally, a representation of the fixed width file is shown in box 53. In the example shown, the file is a fixed width file and there are arrows 55 showing the borders between the adjacent fields. These arrows and by extension the nature of the fields and the field widths may be modified by inserting arrows, deleting arrows or dragging arrows across the box 53.

Referring now to Figure 6, there is shown a diagrammatic representation of the steps and actions of a workflow in list 60. The list 60 comprises 5 steps, each of which has an

- 14entry 61, 63, 65, 67 and 69 in the list 60. Step 1 entry 61 indicates the action, in this case masking, the field that the action is to be carried out on, in this case field 4, and a description of the action being taken, in this case masking of the last four digits of a telephone number. Step 2 entry 63 indicates that a blanking action will be taken on field 8 to remove the bank account number from the file record. Step 3 entry 65 also indicates that there will be a blanking operation on the field 24 of the file record in this instance to remove the PPS number from that file record. Step 4 entry 67 indicates a rounding operation on field 11 where the persons salary in that file record will be rounded down to the nearest figure divisible by ten thousand and resulting in an integer. Step 5 entry 69 proposes to replace the name in field 2 with a name from a list of alternative names. More or less steps could be provided and the order of the steps can be adjusted if desired. An add step button 62 is provided to allow a step to be added to the list. If desired, there may be a step for each field in the record. Masking, blanking, rounding and replacement are all anonymization/pseudonymization techniques that will be described in further detail below. Essentially, these are techniques to in some way alter the data in that field so that it anonymizes/pseudonymizes the data in the field from its original value.

Referring now to Figures 8 to 20 inclusive, there is shown a number of screen shots of the user interface used in the creation of an anonymization job for a managed file transfer application or simply to anonymize a file. In Figure 7, a general screen 70 is provided with a plurality of fields 71 (a)-71 (h), a save button 73 and a cancel button 75. The fields 71 (a)-71 (h) comprise a name field 71(a) where the name of the workflow can be inserted, a workflow step field 71(b) where a new or existing workflow step can be chosen for creation or modification, a scheduler group field 71(c), an instances field 71(d), a manage instances field 71(e), an active from 71(f) and an active to 71(g) field and a description field 71(h). The user can insert the relevant data into the fields 71(a)71(h) in much the same way as described above in relation to Figure 4 before saving or cancelling their changes using buttons 73, 75 respectively. Once the user has populated the fields shown in Figure 7, they progress to the next screen by clicking on arrow 77. They may return to a previous screen by clicking on arrow 79. If the user clicks on arrow 77, they will progress onto the user interface screen 80 shown in Figure 8.

- 15Referring to Figure 8, there is shown a file selection screen 80. In the file selection screen, there are a number of fields 81(a)-81(f) including source host connector field 81(a) which provides the address of where the file to be processed resides, destination host connector field 81(b) which provides the address of where the processed file is to be delivered, search pattern field 81(c), batch size field 81(d), run every fields 81(e) and 81(f) which determine the frequency with which the file will be processed and transferred to the destination host connector. A confirm button 83 and a cancel button 85 are also provided to save or discard any changes that are made.

Referring now to Figure 9, there is shown a first file setup screen 90. The file setup screen comprises a plurality of fields 91(a)-91(d) used to indicate the type of file that will be processed including file type field 91(a), the delimiter field 91(b), the contains header tickbox field 91(c) and the start import at row field 91(d). There is further provided an upload sample data button 93 along with a save button 95 and a cancel button 97. Referring to Figure 10, there is shown the second file setup screen 100, where like parts have been given the same reference numeral as before. In this instance, the upload sample data button 93 was depressed in the known manner and a preview of the data to be processed is illustrated in box 101.

Referring now to Figures 11 to 19 inclusive, once the file has been selected for the application of the workflow, the workflow steps are then created for one or more of the fields of data of the original data record 101. In Figure 11, there is shown a data anonymization setup screen 110. The screen comprises an Add Step button 111. If the Add Step button 111 is depressed, a drop down menu 113 is presented with a number of anonymization steps including blanking, filtering, hashing, masking, numbers and replacement. The user scrolls up or down the menu in the known manner until a pointer or cursor is on the correct option and then the user selects the option from the list. Other menus, tickboxes or forms will then be presented for population by the user including a query concerning the field that the anonymization step is to apply to and a description of the anonymization step being performed. When the user is satisfied with the field, they will click on a save button 115 or a cancel button 117 if they do not wish to save the changes.

- 16Referring now to Figure 12, this screen 120 is shown if the user has selected the filtering option from the drop down menu 113 in Figure 11. The screen 120 comprises a plurality of fields 121(a)-121(e) including method field 121(a) indicating that it is a filtering method being used in this step, a Field field 121(b) indicating the field that the action is to modify, in this case a first name field, a pair of filter fields 121(c), 121(d) and a Rule field 121(e) that set the parameters for the field. In other words, when and how is the rule to be applied. A save button 123 and a close button 125 are provided to save or cancel any changes.

By setting up filtering rules, records can be filtered out of the file. For example, a user may want to filter out all records from a telecoms switch file where the Duration field is zero, or filter the file so that just roaming records are left in the file. Applying filtering rules will allow the user to send only the information that is relevant to a third party from a file, and therefore effectively anonymizing the data that is irrelevant. Filtering will remove records from the file and can be used in conjunction with the anonymization actions above so that records that remain in the file can have their fields anonymized. Ideally, for performance, filtering should occur before anonymization of the remaining fields.

Referring now to Figure 13, this screen 130 is shown if the user has selected the blanking option from the drop down menu 113 in Figure 11. The purpose of the blanking action is to fully remove the contents of a field. For character delimited fields, this will result in an empty field being output. For fixed width fields, the resultant field will be padded with spaces to the same width as the original field. No additional parameters, other than the Action and Field Number, are required for blanking. The screen 130 comprises a pair of fields 131(a) and 131(b) including method field 131(a) indicating that it is a blanking method being used in this step, and a Field field 131(b) indicating the field that the action is to modify, in this case the marital status field. A save button 133 and a close button 135 are provided to save or discard changes.

Referring now to Figure 14, this screen 140 is shown if the user has selected the hashing option from the drop down menu 113 in Figure 11. The purpose of the hashing action is to replace the field with a hashed representation of that string. In this way, fields that are subject to hashing should be changed to the same output value if they have the

- 17same input value, and therefore this action is repeatable (e.g. if the user wishes to remove identifying phone numbers from the file, but wants to be able to use the resultant file for pattern analysis of particular individual numbers, the hashing action should be used). The hashing algorithm used should be secure and non-resolvable. No additional parameters will be required by the user for the hashing action. The screen 140 comprises a pair of fields 141(a) and 141(b) including method field 141(a) indicating that it is a hashing method being used in this step, and a Field field 141(b) indicating the field that the action is to modify, in this case the postcode field. A save button 143 and a close button 145 are provided to save or discard changes.

Referring now to Figure 15, this screen 150 is shown if the user has selected the masking option from the drop down menu 113 in Figure 11. The screen 150 comprises a plurality of fields 151 (a)-151(f) including method field 151(a) indicating that it is a masking method being used in this step, a Field field 151(b) indicating the field that the action is to modify, in this case a date of birth field, a replacement character field 151(c) indicating the character that will be used to mask the data, a start position field 151(d) indicating where in the field is the masking to begin (defaults to the start of the field), a specify replacement length field 151(e) tickbox that if ticked will allow the user to specify the length of string to be replaced, and a truncate leading characters tickbox 151(f) that if ticked will allow the user to truncate leading characters in the field. A save button 153 and a close button 155 are provided to save or cancel any changes.

The purpose of the Masking action is to replace a set of characters in a field. The set of characters can be replaced with a defined character either for each character in the set or a single character for the whole set. In addition to the Action and Field Number parameters, the following should also be supplied: (i) Position of first character to be masked in the string (default to the start of the string); (ii) Position of the last character in the section of the string to be masked (default to the last character in the string); (iii) Character to be used for masking. This may be left blank to replace the characters with nulls. In the case of fixed width file format, replacing character strings with null will result in space padding at the end of the field, rather than replacing the characters with spaces, where the character string does not extend to the end of the field; (iv) Whether the masking should be truncated to a single character to replace the entire masked string, or whether the masking should replace each character of the masked string with

- 18the masking character. In the case of fixed width files, if the truncated option is chosen the resultant field should be padded with spaces to the size of the original field to preserve the overall file format. A possible requirement may be to mask the day and month portions of a date based on the supplied date format.

Referring now specifically to Figures 16 to 19 inclusive, and initially to Figure 16, the screen 160 is shown if the user has selected the replacement option from the drop down menu 113 in Figure 11. The screen 160 comprises a plurality of fields 161 (a)-161(e) including method field 161(a) indicating that it is a replacement method being used in this step, a Field field 161(b) indicating the field that the action is to modify, in this case a first name field. It will be understood that screen 120 also demonstrated an action to be performed on the first name field. It will be understood that during operation, only one action will be performed on each field and this option is used simply for the purpose of demonstrating the modification to a field using the replacement option rather than a composite example to be combined with the other actions described above. The screen 160 further comprises a replacement type field 161(c) which comprises a drop down menu of different replacement type options, a look-up type field 161(d) with a pair of options, standard and conditional, and a Lookup field 161(e) with a drop down menu. A save button 163 and a close button 165 are provided to save or cancel any changes.

Referring now to Figure 17, where like parts have been given the same reference numeral as before, this screen 170 illustrates some of the options of the drop down list of the replacement type field 161(c) of Figure 16. The replacement field type options include random lookup replacement option 171(a), random lookup replacement with statistical distribution option 171(b), consistent lookup replacement option 171(c), and consistent lookup replacement with statistical distribution option 171(d). The user can scroll up and down the list in the known manner or select one of the options from the list using a pointing device.

Referring now to Figure 18, where like parts have been given the same reference numeral as before, this screen 180 illustrates some of the options from the drop down menu of the Lookup field 161(e) from screen 160 shown in Figure 16. The drop down menu comprises a plurality of options including US boys names, US girls names, UK boys names, UK girls names, Irish boys names, Irish girls names, Irish Towns and

- 19Cities, English Surnames and Irish Surnames. Each of these represents a predefined list of options for replacing the existing word with one of the entries on the predefined list.

Referring now to Figure 19, this screen 190 illustrates how one or more conditions 191 can be set up for a replacement step. For example, the operator may provide conditions that in the replacement of first name operation, if the gender field of the record indicates that the entry is male, then the replacement set to be used will be UK boys names. Otherwise, UK girls names will be used to replace the first name. Other conditions may be provided as well and it is not limited to a single condition. For example, it may be decided to use UK boys names and Irish girls names or to replace boys names in the original record for girls names in the processed record and replace girls names in the original record for boys names in the processed record. A save button 193 and a close button 195 are provided to save or cancel any changes. It will be understood from the foregoing how the order of steps in the anonymization could be relevant as one step may rely on the true data from another field and it is important that the step is completed before the other field is modified.

The various Replacement actions will replace the values in fields from pre-defined lists of other values. In all cases, it will be assumed that the field type of the pre-defined list is the same as the type of the field that is being replaced, although no system validation will be done on this (it will be up to the user to validate). Before going through the actual replacements, it is helpful to discuss the concept of (i) pre-defined lists and (ii) statistical accuracy.

(i) Pre-Defined Lists:

For any of the Replacement actions described herein, the system will use pre-defined lists to read replacement fields from. It will be assumed that these pre-defined lists can either come as standard lists packaged with the system, or can be added by the user. These lists can be made up of a set of any number of strings, and should be presented to the system to be uploaded ordered by the most frequently used (or rather the most frequent occurring in the files for which the replacement will take place).

-20The user should be able to edit lists previously uploaded to the system, although in the replacement cases where this will possibly return different replacement results to those encountered previously, the user should be warned before any edits are saved.Upper, lower and mixed case fields should all be converted to be in the same case as the predefined list. Where the pre-defined list is in upper case, if a surname field in a file is in mixed case then the name should be treated as if it is in upper case, and the resultant field will be in upper case.

In all of the Replacement Actions, it should be possible to have a choice of which predefined list to use based on data in other fields in the record. For example, it should be possible for the user to use the list Male First Names if the field representing Gender starts with M, Female First Names if it starts with F and Combined First Names for any other value. For each step, it should be possible to define multiple cases for one field (as in the Gender case), but also to check multiple fields. The complete replacement step will contain a number of sub steps, and the field will be replaced once the first sub step that the record satisfies is reached. There should always be a default sub step at the end that records which do not satisfy any of the previous sub steps fall into.

(ii) Statistical Accuracy:

In both the Random and Repeatable replacements described here, the user should be able to select whether the replacement should be performed with statistical accuracy. This means that the replacement value will be a value that has a similar statistical chance of occurring as the value being replaced. To do this, the system will first check if the value in the field to be replaced occurs in the selected list. If it does, then the replacement value will be picked from the set of list values n-x to n+x where x is the statistical range to use, which can either be a whole number or a percentage. To illustrate this, we will use the example of British surnames. A pre-defined list is uploaded to the system containing a list of all surnames in a database and it ordered by popularity from most popular (Smith) to least popular. Assuming a statistical range of 5 (a whole number, not a percentage), this means that if you are replacing a name field in the file, the replacement name will be taken from the subset of the complete list that is within 5 names of the name to be replaced. This will be described in more detail with reference to Figure 21 below.

-21 Replacements may be either random or repeatable. With random replacements, the purpose of this action is to replace the value in a field with a random value from the selected pre-defined list. In addition to the Action and Field Number parameters, the user should be able to select the pre-defined list to use, and to apply field rules. In addition, a checkbox should exist to indicate if statistical accuracy is used. If this is selected, then the range is to also be supplied. With repeatable replacements, the purpose of this action is to allow for the replacement value for a particular field value to be the same each time the same field value is processed, even if the source file being processed is different. For example two different files both contain a surname field. When the first file is processed and a surname field in one of the records in the file is replaced from the Surname pre-defined list, if the same surname value occurs either in the same file or in a different file that uses the same list for repeatable replacement, it will be expected that the same replacement value will be used. The replacement should be based on an algorithm applied to the original field in such a way that the algorithm will always return the same value for that field, and the replacement value should be based on the result of that algorithm. As above, a checkbox should exist to indicate if statistical accuracy is used. If this is selected, then the range is to also be supplied.

Referring now to Figure 20, this screen 200 is shown if the user has selected the numbers anonymization step option from the drop down menu 113 in Figure 11. The screen 200 comprises a plurality of fields 201(a)-201(g) including method field 201(a) indicating that it is a numbers anonymization method being used in this step, a Field field 201(b) indicating the field that the action is to modify, in this case a Current Salary field, and a numeric method fields 201(c) with a drop down menu of different numeric operations, in this case rounding is chosen. The screen 200 also comprises round to nearest value field 201(d), in this case 10,000, a rounding option filed 201(e) in this case a standard rounding option, a minimum value field 201(e) and a maximum value field 201(g) that set the parameters for the action. A save button 203 and a close button 205 are provided to save or cancel any changes.

There will be a few different actions which cover the anonymization of number fields in the file. These can be applied to fields that hold either whole numbers or decimals. These include: (i) Number Variance; (ii) Number Rounding; and (iii) Number Banding; each of which will be described in more detail below:

-22(i) Number Variance:

The purpose of this action is to alter numbers to a random number that is within either a whole number (e.g. +/- 100) or percentage variance of the original number in the file (e.g. +/- 10%). The user should be able to define a maximum and minimum value that the resulting number should fall between. If the resulting number falls outside the limits it should default to the maximum or minimum. The parameters required for this action are as follows: (i) +, - or +/-, i.e. to always add the random value, always subtract the random value or to either add or subtract the random value; (ii) the upper and lower limits, defaulted to 0 and infinity respectively; (iii) whether the variance range is a whole number or a percentage of the field value; and (iv) the variance range value.

(ii) Number Rounding:

The purpose of the Number Rounding Action is to allow number fields to be rounded up or down to a fixed multiple. For example, for ages the user may want to round down to the nearest 10 so that all ages will be changed to the decade (e.g. anybody in their 20’s will be changed to 20) or salaries may be rounded down to the nearest 10000. As well as rounding, the user should be able to define maximum and minimum ranges. Any rounded figures outside this range will default to either the defined maximum or minimum (so that outliers cannot be identified). The format of the original number should be preserved, so if the number is read in with decimal places, the rounded number should also have decimal places, even though the minimum rounding will be to the nearest 1 (in which case the decimal will always be .00). The parameters required for this action are as follows: (i) the ability to round up, down or to the nearest whole number; (ii) the multiple to be rounded to; and (iii) the number of decimal places in the result.

(iii) Number Banding:

This is not a requirement, but is added here for discussion. The system should have the ability to place numbers into bands, and the user should be able to define bands. For instance, where the user wants age fields banded into particular demographics.

-23Referring now to Figure 21, in order to illustrate the operation of statistical accuracy, there is shown a table 210 of the 20 most common British Surnames. When replacing the data from one file with another, it may be desirable to replace the data entry with a data value of similar likelihood. In other words, if a name is quite common, it is preferable to replace that name with another, different common name rather than an obscure name. If the common name was routinely replaced with an obscure name this would result in a disproportionate amount of obscure names in the processed file which in itself might be indicative that the obscure name is simply a replacement for a very common name. By using statistical distribution, a common name will be replaced by a correspondingly common name from the list. The user can set the parameter of how close in the list to the original name does the replacement name have to be. The user can also set other parameters for selecting a suitable replacement name from the list.

For example, the user may set the parameter that the replacement name should be within 5 places up or down the list from the original name. If the name is not shown on the list, a name from the list may be taken at random. In the drawing, the four example names are Wright, Thomas, Jones and Fox. Wright appears at no. 14 on the list and therefore any of the surnames between no. 9 - Wilson, and no. 19 - Green, may be used as a replacement for Wright. Jones appears at no. 2 on the list so there is only one entry above it and 5 entries below it on the list that can be used. Fox is not on the list and therefore any name can be chosen at random from the list. Finally, Thomas is at no. 8 on the list. Typically, this would include any entry from entry 3 to 7 (the five entries above it on the list) to 9 to 14 (the five entries below it on the list). However, entry 14 is Thompson and there may be an additional rule in place that the first character or number of characters cannot be the same as the original name being replaced and therefore only entries 3 to 13 can be used to replace Thomas.

Throughout this specification, reference has been made to anonymization and pseudonymization. Anonymization will be understood to mean when the information is rendered anonymous, such that the data subject is no longer identifiable. If personal data has been anonymized, it is no longer deemed personal data and is therefore believed to be not subject to the GDPR. Pseudonymization will be understood to mean when the information is rendered neither anonymous nor directly identifying. This is a process that enables personal identifiers to be replaced by pseudonyms therefore

-24enabling businesses to use this data for other legitimate business purposes than it was originally intended (e.g. for conducting medical research, statistical analysis and testing purposes), whilst protecting the anonymity of the data subject/particular individual. While the personal data is still deemed identifiable once it has been pseudonymized, the level of difficulty in the re-identification may be taken into consideration in the case of a data breach.

It will be understood that various parts of the present invention are performed in hardware and other parts of the invention may be performed either in hardware and/or software. It will be understood that the method steps and various components of the present invention will be performed largely in software and therefore the present invention extends also to computer programs, on or in a carrier, comprising program instructions for causing a computer or a processor to carry out steps of the method or provide functional components for carrying out those steps. The computer program may be in source code format, object code format or a format intermediate source code and object code. The computer program may be stored on or in a carrier, in other words a computer program product, including any computer readable medium, including but not limited to a floppy disc, a CD, a DVD, a memory stick, a tape, a RAM, a ROM, a PROM, an EPROM or a hardware circuit. In certain circumstances, a transmissible carrier such as a carrier signal when transmitted either wirelessly and/or through wire and/or cable could carry the computer program in which cases the wire and/or cable constitute the carrier.

It will be further understood that the present invention may be performed on two, three or more devices with certain parts of the invention being performed by one device and other parts of the invention being performed by another device. The devices may be connected together over a communications network. The present invention and claims are intended to also cover those instances where the system is operated across two or more devices or pieces of apparatus located in one or more locations.

In the embodiments described above the method has been described in terms of a Managed File Transfer system. However, it will be understood that the present invention may be operated in the manner of a managed data transfer system rather than a managed file transfer system per se and the present invention also relates to managed

-25data transfer systems in which one or more fields of data in the data to be transferred may be anonymized and/or pseudonymized.

In this specification the terms “comprise, comprises, comprised and comprising” and the 5 terms “include, includes, included and including” are all deemed totally interchangeable and should be afforded the widest possible interpretation.

The invention is not limited to the embodiments hereinbefore described but may be varied in both construction and detail within the scope of the appended claims.

Claims

Claims:

(1) A method of transferring a file between a first host connector and a second host connector, the method comprising the steps of:

specifying a source location in the first host connector;

specifying a destination location in the second host connector;

specifying the file to be transferred between the source location in the first host connector and the destination location in the second host connector;

specifying a file transfer schedule;

characterised in that, the method comprises the step of processing the file prior to the transfer of the file, the file having a plurality of records containing data in a plurality of fields arranged in a structured format, the processing steps comprising:

processing the file in accordance with a workflow, the workflow comprising one or more steps, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields, thereby anonymizing or pseudonymizing the data in one or more fields of the file;

generating a preview of the processed file;

displaying the preview of the processed file on a user interface for evaluation by a user; and on the processed file being deemed suitable for transfer upon evaluation by the user, transferring the file to the destination location in the second host connector.

-27(
2) A method as claimed in claim 1 in which the method further comprises the step of simultaneously displaying the preview of the processed file with the corresponding portions of the original file on the user interface.

5
(3) A method as claimed in claim 1 or 2 in which the step of generating a preview of the processed file comprises generating a preview of a sub-set of the plurality of records in the file.
(4) A method as claimed in any preceding claim comprising the initial step of:

creating a workflow for the file including specifying the one or more steps of the workflow, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields.

15
(5) A method as claimed in any preceding claim in which the step of specifying one or more steps of the workflow further comprises specifying one or more parameters for the anonymization or pseudonymization action.
(6) A method as claimed in any preceding claim in which if the processed file is

20 deemed unsuitable upon evaluation, the method comprises the further step of specifying one or more additional steps of the workflow, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields.

25
(7) A method as claimed in any preceding claim in which the step of specifying the file transfer schedule comprises scheduling periodic processing of the file subsequent to the initial processing of the file, the subsequent periodic processing of the file comprising the steps of:

30 processing the file in accordance with the workflow thereby anonymizing or pseudonymizing the data in one or more fields of the file; and thereafter transferring the file to the second host connector.

-28(
8) A method as claimed in any preceding claim in which the method comprises the initial step of defining the format of the file.
(9) A method as claimed in any preceding claim in which the method comprises the initial step of specifying a header portion of the file.
(10) A method as claimed in any preceding claim in which the method comprises the initial step of specifying a trailer portion of the file.
(11) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a blanking operation.
(12) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a masking operation.
(13) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a hashing operation.
(14) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a number function operation.
(15) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a filtering operation.
(16) A method as claimed in any preceding claim in which the anonymization or pseudonymization action comprises a field replacement operation.
(17) A method as claimed in claim 16 in which the field replacement operation comprises random field replacement.
(18) A method as claimed in claim 16 in which the field replacement operation comprises repeatable field replacement.

-29(
19) A method as claimed in claim 16 to 18 in which the field replacement operation utilizes a pre-defined list for field replacement.
(20) A method as claimed in claims 16 to 19 in which the field replacement operation utilizes statistical accuracy for field replacement.
(21) A method as claimed in claims 16 to 20 in which the field replacement operation comprises a plurality of cases for the field replacement step.
(22) A computer program product having program instructions stored thereon that when loaded onto a computer cause the computer to carry out one or more of the steps of the method of any one of claims 1 to 21 inclusive.
(23) A managed file transfer system comprising:

a plurality of host connectors;

a plurality of managed file transfer links for transfer of a file from one of the host connectors to or from another of the host connectors;

at least one user computer; and a managed file transfer link establishment module to enable a user to create a managed file transfer link and to specify for the managed file transfer link: a source location in one of the host connectors; a destination location in one of the host connectors; the file to be transferred between the source location and the destination location; and the file transfer schedule;

and in which the managed file transfer link establishment module comprises a file anonymization/pseudonymization module integral therewith operable to create a workflow for the file including one or more steps, each step comprising an anonymization or pseudonymization action to be performed on the data in one of the fields of the file thereby

-30anonymizing or pseudonymizing the data in one or more fields of the file prior to the file being transferred in accordance with the file transfer schedule.

5
(24) A managed file transfer system as claimed in claim 23 in which there is provided a managed file transfer hub and in which the user computer is in communication with the managed file transfer hub.
(25) A managed file transfer system as claimed in claim 23 or 24 in which there is 10 provided a secure memory and in which the managed file transfer link and the workflow associated therewith is stored in secure memory.