WO2016039676A1

WO2016039676A1 - Pre-processing of user data

Info

Publication number: WO2016039676A1
Application number: PCT/SE2014/051053
Authority: WO
Inventors: Yi Cheng; Christian Schaefer
Original assignee: Telefonaktiebolaget L M Ericsson (Publ)
Priority date: 2014-09-11
Filing date: 2014-09-11
Publication date: 2016-03-17

Abstract

There is provided a method for policy based pre-processing of user data. The method is performed by a pre-processing node. The method comprises receiving a request from a job management node for a client to access a set of user data. The request comprises an identity of an application. The method comprises acquiring a policy for the application based on the identity. The method comprises determining a modified set of user data from the set of user data based on the policy. The method comprises providing a reference to the modified set of user data to the job management node.

Description

PRE-PROCESSING OF USER DATA

TECHNICAL FIELD

Embodiments presented herein relate to pre-processing of user data, and particularly to methods, a pre-processing node, a job management node, computer programs, and a computer program product for policy based preprocessing of user data.

BACKGROUND

In data generating networks, there may be a challenge to obtain good performance and capacity. For example, one parameter in providing good performance and capacity for a given data generating network is to enable efficient collection, processing, and analysis of the generated data.

In general terms, data sets may grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. Such large amounts of data may be generated, collected, processed, and analyzed to provide insights that are valuable for operation efficiency and business innovation. Such large amounts of data are

sometimes referred to as big data. Big data may be defined as high volume, high velocity, and/ or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. In general terms, big data may thus be regarded as an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. Further, large amounts of data may also bring privacy concerns. User privacy is at risk if their data, or data about them, are not properly protected. To allow valuable data usage while preserving user privacy, data anonymization may be a precondition for analytics applications to access sensitive user data such as user location, medical records, financial transactions, etc. One mechanism to preserve privacy is to pre-process the data, for example by subjecting the data to pseudo-anonymization, before the data is analyzed. Pseudo-anonymization may involve removing user identifier or replacing it with some other value (e.g. hash of the real identifier). More sophisticated anonymization mechanisms, for example k-anonymity, I-diversity and t- closeness, could make very extensive modifications to the original data. It is usually not possible to perform anonymization at the time the data is being accessed. One common approach to preserve privacy is to anonymize the original data in advance and create anonymized versions for analytics applications.

Apache Hadoop is a non-limiting example of a platform for big data storage and processing. Currently, Hadoop provides access control based on file permissions or access control lists (ACL) which specify who can do what on the associated data. With the increasing concern on user privacy, it is becoming more and more common to put data anonymization as a

precondition for analytics applications to access sensitive user data.

Hence, there is still a need for an improved pre-processing of user data.

SUMMARY

An object of embodiments herein is to provide efficient pre-processing of user data.

The inventors of the herein disclosed embodiments have realized that it is currently not possible (for example in Hadoop) to specify conditions under which access can be granted to privacy-protected data. The inventors of the herein disclosed embodiments have further realized that depending on the properties of individual applications and the types of the data they use, per- application anonymization may be needed. At present Hadoop does not have native support for per-application pre-processing in order to perform anonymization and other kinds of data modifications.

According to a first aspect there is therefore presented a method for policy based pre-processing of user data. The method is performed by a pre- processing node. The method comprises receiving a request from a job management node for a client to access a set of user data. The request comprises an identity of an application. The method comprises acquiring a policy for the application based on the identity. The method comprises determining a modified set of user data from the set of user data based on the policy. The method comprises providing a reference to the modified set of user data to the job management node.

Advantageously this provides efficient pre-processing of user data.

Advantageously, such policy-based pre-processing allows anonymization, filtering, encryption and other privacy/ security mechanisms to be applied on sensitive data before being accessed by applications, supporting conditional access.

Advantageously, the pre-processing node may make policy-based preprocessing an integral part of Hadoop Map Reduce, thereby reducing manual administrative overhead. Because the pre-processing is per application, it is more dynamic and flexible in addressing application specifics.

Advantageously, the pre-processing maybe carried out in a MapReduce job and may thus take advantage of the high performance of Hadoop.

Advantageously, data that is not needed in advance may be filtered out. The amount of data the application needs to deal with may thereby be reduced and runtime access control simplified.

Advantageously, applications will not have a chance to try to snoop on data they are not supposed to access, regardless of unintentional programming error or malicious code. Advantageously, as data may be modified just before an application accesses it, and the modified version of the data may be deleted after usage, there is no need to keep several versions of the data in the storage for different use cases all the time. According to a second aspect there is presented a pre-processing node for policy based pre-processing of user data. The pre-processing node comprises a processing unit. The processing unit is configured to receive a request from a job management node for a client to access a set of user data. The request comprises an identity of an application. The processing unit is configured to acquire a policy for the application based on the identity. The processing unit is configured to determine a modified set of user data from the set of user data based on the policy. The processing unit is configured to provide a reference to the modified set of user data to the job management node. According to a third aspect there is presented a computer program for policy based pre-processing of user data, the computer program comprising computer program code which, when run on a processing unit of a preprocessing node, causes the pre-processing node to perform a method according to the first aspect. According to a fourth aspect there is presented a method for policy based preprocessing of user data. The method is performed by a job management node. The method comprises receiving a request from a client to execute an application which needs access a set of user data. The request comprises an identity of the application. The method comprises forwarding the request to a pre-processing node. The method comprises receiving a reference to a modified set of user data of the set of user data from the pre-processing node.

According to a fifth aspect there is presented a job management node for policy based pre-processing of user data. The job management node comprises a processing unit. The processing unit is configured to receive a request from a client to execute an application which needs access a set of user data. The request comprises an identity of the application. The

processing unit is configured to forward the request to a pre-processing node. The processing unit is configured to receive a reference to a modified set of user data of the set of user data from the pre-processing node. According to a sixth aspect there is presented a computer program for policy based pre-processing of user data, the computer program comprising computer program code which, when run on a processing unit of a job management node, causes the job management node to perform a method according to the fourth aspect.

According to a seventh aspect there is presented a computer program product comprising a computer program according to at least one of the third aspect and the sixth aspect and a computer readable means on which the computer program is stored. It is to be noted that any feature of the first, second, third, fourth, fifth, sixth and seventh aspects may be applied to any other aspect, wherever

appropriate. Likewise, any advantage of the first aspect may equally apply to the second, third, fourth, fifth, sixth, and/or seventh aspect, respectively, and vice versa. Other objectives, features and advantages of the enclosed embodiments will be apparent from the following detailed disclosure, from the attached dependent claims as well as from the drawings.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the element, apparatus, component, means, step, etc." are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

BRIEF DESCRIPTION OF THE DRAWINGS

The inventive concept is now described, by way of example, with reference to the accompanying drawings, in which:

Fig. 1 is a schematic block diagram illustrating nodes for pre-processing according to embodiments; Fig. 2a is a schematic diagram showing functional units of a pre-processing node according to an embodiment;

Fig. 2b is a schematic diagram showing functional modules of a preprocessing node according to an embodiment; Fig. 3a is a schematic diagram showing functional units of a job management node according to an embodiment;

Fig. 3b is a schematic diagram showing functional modules of a job management node according to an embodiment;

Fig. 4 shows one example of a computer program product comprising computer readable means according to an embodiment;

Figs. 5, 6, 7, and 8 are flowcharts of methods according to embodiments; and

Fig. 9 is a signalling diagram according to an embodiment.

DETAILED DESCRIPTION

The inventive concept will now be described more fully hereinafter with reference to the accompanying drawings, in which certain embodiments of the inventive concept are shown. This inventive concept may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the inventive concept to those skilled in the art. Like numbers refer to like elements throughout the description. Any step or feature illustrated by dashed lines should be regarded as optional.

The embodiments disclosed herein relate to policy based pre-processing of user data. In order to obtain such policy based pre-processing of user data there is provided a pre-processing node, methods performed by the preprocessing node, a computer program comprising code, for example in the form of a computer program product, that when run on a processing unit of the pre-processing node, causes the pre-processing node to perform the methods. In order to obtain such policy based pre-processing of user data there is further provided a job management node, methods performed by the job management node, a computer program comprising code, for example in the form of a computer program product, that when run on a processing unit of the job management, causes the job management to perform the methods.

Fig. 2a schematically illustrates, in terms of a number of functional units, the components of a pre-processing node 21 according to an embodiment. A processing unit 22 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate arrays (FPGA) etc., capable of executing software instructions stored in a computer program product 41a (as in Fig. 4), e.g. in the form of a storage medium 24. Thus the processing unit 22 is thereby arranged to execute methods as herein disclosed. The storage medium 24 may also comprise persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. The pre-processing node 21 may further comprise a communications interface 23 for communications with at least a job management node 31. As such the communications interface 23 may comprise one or more transmitters and receivers, comprising analogue and digital components and a suitable number of antennas or wired ports. The processing unit 22 controls the general operation of the pre-processing node 21 e.g. by sending data and control signals to the communications interface 23 and the storage medium 24, by receiving data and reports from the communications interface 23, and by retrieving data and instructions from the storage medium 24. The

functionality of the pre-processing node 21 maybe implemented in a server. Other components, as well as the related functionality, of the pre-processing node 21 are omitted in order not to obscure the concepts presented herein. Fig. 2b schematically illustrates, in terms of a number of functional modules, the components of a pre-processing node 21 according to an embodiment. The pre-processing node 21 of Fig. 2b comprises a number of functional modules; a receive module 22a, an acquire module 22b, a determine module 22c, and a provide module 22d. The pre-processing node 21 of Fig. 2b may further comprises a number of optional functional modules, such as any of a map module 22e, a filter module 22f, a generate module 22g, a replace module 22I1, and delete module 22j. The functionality of each functional module 22a-j will be further disclosed below in the context of which the functional modules 22a-j maybe used. In general terms, each functional module 22a-j maybe implemented in hardware or in software. Preferably, one or more or all functional modules 22a-j may be implemented by the processing unit 21, possibly in cooperation with functional units 23 and/ or 24. The processing unit 22 may thus be arranged to from the storage medium 24 fetch instructions as provided by a functional module 22a-j and to execute these instructions, thereby performing any steps as will be disclosed hereinafter. Fig- 3a schematically illustrates, in terms of a number of functional units, the components of a job management node 31 according to an embodiment. A processing unit 32 is provided using any combination of one or more of a suitable central processing unit (CPU), multiprocessor, microcontroller, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate arrays (FPGA) etc., capable of executing software instructions stored in a computer program product 41b (as in Fig. 4), e.g. in the form of a storage medium 34. Thus the processing unit 32 is thereby arranged to execute methods as herein disclosed. The storage medium 34 may also comprise persistent storage, which, for example, can be any single one or combination of magnetic memory, optical memory, solid state memory or even remotely mounted memory. The job management node 31 may further comprise a communications interface 33 for communications with at least a client 11, a pre-processing unit 21, and a task processing node 13. As such the communications interface 33 may comprise one or more transmitters and receivers, comprising analogue and digital components and a suitable number of antennas or wired ports. The processing unit 32 controls the general operation of the job management node 31 e.g. by sending data and control signals to the communications interface 33 and the storage medium 334, by receiving data and reports from the communications interface 33, and by retrieving data and instructions from the storage medium 34. The functionality of the job management node 31 maybe implemented in a server. Other components, as well as the related

functionality, of the job management node 31 are omitted in order not to obscure the concepts presented herein.

Fig. 3b schematically illustrates, in terms of a number of functional modules, the components of a job management node 31 according to an embodiment. The job management node 31 of Fig. 3b comprises a number of functional modules; a receive module 32a, and a forward module 32b. The job management node 31 of Fig. 3b may further comprises a number of optional functional modules, such as a provide module 32c. The functionality of each functional module 32a-c will be further disclosed below in the context of which the functional modules 32a-c may be used. In general terms, each functional module 32a-c maybe implemented in hardware or in software. Preferably, one or more or all functional modules 32a-c maybe implemented by the processing unit 32, possibly in cooperation with functional units 33 and/ or 34. The processing unit 32 may thus be arranged to from the storage medium 34 fetch instructions as provided by a functional module 32a-c and to execute these instructions, thereby performing any steps as will be disclosed hereinafter.

Fig. 4 shows one example of a computer program product 41a, 41b

comprising computer readable means 43. On this computer readable means 43, a computer program 42a can be stored, which computer program 42a can cause the processing unit 22 and thereto operatively coupled entities and devices, such as the communications interface 23 and the storage medium 24, to execute methods according to embodiments described herein. The computer program 42a and/or computer program product 41a may thus provide means for performing any steps of the pre-processing node 21 as herein disclosed. Further, on this computer readable means 43, a computer program 42b can be stored, which computer program 42b can cause the processing unit 32 and thereto operatively coupled entities and devices, such as the communications interface 33 and the storage medium 34, to execute methods according to embodiments described herein. The computer program 42b and/or computer program product 41b may thus provide means for performing any steps of the job management node 31 as herein disclosed.

In the example of Fig. 4, the computer program product 41a, 41b is illustrated as an optical disc, such as a CD (compact disc) or a DVD (digital versatile disc) or a Blu-Ray disc. The computer program product 41a, 41b could also be embodied as a memory, such as a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM) and more particularly as a non-volatile storage medium of a device in an external memory such as a USB (Universal Serial Bus) memory or a Flash memory, such as a compact Flash memory. Thus, while the computer program 42a, 42b is here schematically shown as a track on the depicted optical disk, the computer program 42a, 42b can be stored in any way which is suitable for the computer program product 41a, 41b.

As disclosed above, pseudo-anonymization maybe used to hide user identity. A database with user identifier removed or randomized may be used to serve multiple applications.

As an illustrative example, the following steps maybe taken (manually) for each application in order to create pseudo-anonymized or anonymized user data:

First, based on legal requirements, organization and business rules, it is determined whether data anonymization is required before the application is allowed to access the requested data. If so, the anonymization algorithm and parameters are determined.

Second, the anonymization is performed on the original data and a new version of the data is created. Third, the new (anonymized) version of the data is, instead of the original data, provided as input to the application.

Fourth, the anonymized version of the data is deleted if no longer needed by the application. However with the fast advancement in data mining and data analytics techniques, pseudo-anonymization may not be efficient/strong enough to prevent sophisticated re-identification attacks. More advanced

anonymization mechanisms that take into account the properties of the application and the types of the data it uses may therefore be needed to effectively protect user privacy. Even if the same anonymization algorithm can be applied to multiple applications, the relevant parameters should be tuned according to the application specifics.

As hereinafter disclosed there is therefore proposed mechanisms that involve anonymization on a per-application basis. Particularly, below is proposed a policy-based pre-processing which may be used within the Hadoop framework to make per-application anonymization an integral part of a MapReduce job execution. In general terms, in the Hadoop authorization model, file permissions or ACLs specify who can access the associated data. It is currently not possible to specify the condition under which access can be granted. Therefore application based policies that can accommodate more application-specific access control policies are hereby introduced. Depending on application properties (purpose, core or non-core business, internal or external application, etc.), relevant legal requirements, organization and business rules should be used to define application polices. Figs. 5 and 6 are flow chart illustrating embodiments of methods for policy based pre-processing of user data as performed by the pre-processing node 21. Figs. 7 and 8 are flow chart illustrating embodiments of methods for policy based pre-processing of user data as performed by the job

management node 31. The methods are advantageously provided as computer programs 42 a, 42b. Reference is now made to Fig. 5 illustrating a method for policy based preprocessing of user data as performed by a pre-processing node 21 according to an embodiment. Parallel reference is made to Fig. 9 illustrating a signalling diagram according to embodiments. Parallel reference is also made to Fig. 1 illustrating a schematic block diagram 10 of nodes for pre-processing according to embodiments. The schematic block diagram 10 comprises a client 11. The client is operatively connected to a job management node 31. The job management node 31 comprises a job queue entity 12. The job management node 31 is operatively connected to a task processing node 13 and a pre-processing node 21. The pre-processing node 21 is operatively connected to a database 14. The database comprises application policies. The pre-processing node 21 further has access to a set of user data 15 and is configured to determine a modified set of data 16 from the set of user data 15 for the task processing node 13 to operate on when executing an application of the client 11. This is schematically illustrated by the dotted arrow in Fig. 1. The functionality of each node in Fig. 1 will now be described in more detail.

The pre-processing node 21 is configured to, in a step S102, receive a request from the job management node 31. The request from the job management node 31 is for the client 11 to access the set of user data 15. The request comprises an identity of an application.

Different applications (and/or applications with different properties) maybe allowed to access different amounts, or different kinds, of user data 15. The pre-processing node 21 therefore uses the identity to determine what data the application is allowed to access. Particularly, the pre-processing node 21 is configured to, in a step S104, acquire a policy for the application. The policy is based on the identity of the application. The policy may be acquired from the database 14.

This policy is then by the pre-processing node 21 used to determine what data the application is allowed to access. Particularly, the pre-processing node 21 is configured to, in a step S106, determine the modified set of user data 16. The modified set of data 16 is determined from the set of user data 15 and is based on the policy. As will be further disclosed below there are different ways for the pre-processing node 21 to determine the modified set of user data 16. For example, the modified set of user data 16 may be a subset of the set of user data 15 (i.e., a version of the set of user data 15 where some data posts have been removed) or a version of the set of user data 15 where the values of some of the data posts have been changed.

The job management node 13 is then made aware of the modified set of user data 16 instead of the (original) set of user data 15. Particularly, the preprocessing node 21 is configured to, in a step S108, provide a reference to the modified set of user data to the job management node 13.

The pre-processing node 21 is thereby configured to adapt the set of user data 15 based on to what entity the application belongs, and to what entity the result of the application is to be provided to.

Embodiments relating to further details of policy based pre-processing of user data as performed by the pre-processing node 21 will now be disclosed.

There may be different kinds of policies. For example, the policy may comprise conditions under which the application is allowed (or not allowed) to access the set of user data. If the condition is anonymization the policy may indicate an anonymization algorithm and necessary parameters, and/ or whether the anonymized data should be deleted or not after usage.

Additionally or alternatively the policy may comprise identification of what data types in the set of user data the application is allowed (or not allowed) to access. Other information e.g., encryption algorithm, may also be specified in the policy if needed.

There may be different conditions for determining the policy. These conditions may depend on a relation between the application and the set of user data 15. For example, the policy may then depend on whether the application is an internal core application, an internal non-core application, or an external application in relation to the set of user data 15.

There may be different kinds of user data in the set of user data. For example, the user data may be network data and relate to session records, location of portable wireless devices, radio access network node data, and/or charging data.

The application may be configured to access the modified set of user data 16 during execution of the application.

Reference is now made to Fig. 6 illustrating methods for policy based pre- processing of user data as performed by the pre-processing node 21 according to further embodiments. Parallel reference is continued to Fig. 9 and Fig. 1.

There may be different ways to determine the modified set of user data 16. Different embodiments relating thereto will now be described in turn.

For example, as noted above, the policy may comprise identification of what data types in the set of user data the application is allowed to access. The preprocessing node 21 may use this information to determine the modified set of user data 16. Particularly, the pre-processing node 21 maybe configured to, in an optional step Sio6a, determine the modified set of user data 16 by mapping the data types to data posts in the set of user data 15. Additionally or alternatively, the modified set of user data 16 may be determined by applying a filter to the set of user data 15. Filtering may include anonymization, encryption, etc. Particularly, the pre-processing node 21 maybe configured to, in an optional step Sio6b, determine the modified set of user data 16 by filtering the set of user data 15 based on the policy, thereby removing or changing data posts from the set of user data 16.

Additionally or alternatively, a job maybe generated in order to determine the modified set of user data 16. Particularly, the pre-processing node 21 may be configured to, in an optional step Sio6c, determine the modified set of user data 16 by generating Sio6c a computer implemented job to determine the modified set of user data 16 from the set of user data 15 by processing the set of data 15 based on the policy.

There maybe different kinds of computer implemented jobs. Different embodiments relating thereto will now be described in turn. For example, the computer implemented job may be based on the policy. The computer implemented job may comprises identification of an anonymization algorithm for determining the modified set of user data from the set of user data, identification of which data posts in the set of user data to be pseudo- anonymized, and/or identification of which data posts in the set of user data to be removed. The computer implemented job may further comprise instructions to generalize values of data posts to intervals, to merge at least two data posts into one data post, and/or to change an order of the data posts in the set of user data.

There maybe different ways to handle the computer implemented job. For example, the computer implemented job may be given a high priority level for execution.

There maybe different ways in which the computer implemented job maybe implemented. For example, the computer implemented job maybe a Hadoop MapReduce job. There may be different ways to provide the reference to the modified set of user data 16 as in step S108. For example, an identification of the set of user data 15 maybe replaced with an identification of the modified set of user data 16. Hence, the pre-processing node 21 maybe configured to, in an optional step Sio8c, replace an identification of the set of user data 15 with an identification of the modified set of user data 16.

There may be different ways to handle the modified set of user data 16 once it has been accessed. Particularly, the pre-processing node 21 maybe

configured to, in an optional step S110, receive an indication that access to the modified set of user data 15 by the application has been terminated. The pre-processing node 21 may then be configured to, in an optional step S112, l6 and in response to step S110, delete the modified set of user data 16. There may be conditions regarding if and/ or when the modified set of user data 16 should be deleted. For example, the modified set of user data 16 may only be deleted if required by the policy for the application. Reference is now made to Fig. 7 illustrating a method for policy based preprocessing of user data as performed by a job management node 31 according to an embodiment. Parallel reference is continued to Fig. 9 and Fig. 1.

The job management node 31 is configured to, in a step S202, receive a request from a client 11. The request is for the job management node 31 to execute an application. The application needs access to a set of user data 15. The request comprises an identity of the application.

Before executing the application the job management node 31 forwards the request to a pre-processing node 21. Hence, the job management node 31 is configured to, in a step S204, forward the request to a pre-processing node 21. The request is received by the pre-processing node 21 as in step S102.

The pre-processing node 21 may then processes the request as in steps S104 and S106, and optionally as in any of steps Sio6a, Sio6b, and Sio6c, before providing a reference to the modified set of user data 16 to the job

management node. The job management node 31 is configured to, in a step S206, receive the reference to the modified set of user data 16 of the set of user data 15 from the pre-processing node 21.

Embodiments relating to further details of policy based pre-processing of user data as performed by the job management node 31 will now be disclosed.

There may be different kinds of requests that are received by the job management node 31 in step S202. For example, the request may comprise a job Java Archive (JAR) file, a configuration file, input specifications, and/or output specifications. Additionally or alternatively the request may comprise a Hadoop MapReduce job. Reference is now made to Fig. 8 illustrating methods for policy based preprocessing of user data as performed by the job management node 31 according to further embodiments. Parallel reference is continued to Fig. 9 and Fig. 1. There may be different ways for the job management node 31 to handle the received reference to the modified set of user data 16. Different embodiments relating thereto will now be described in turn.

The application may be executed and configured to during its execution access the modified set of user data 16. Thus, the job management node 31 may be configured to, in an optional step S208, provide instructions to a task processing node 13 to execute the application. The instructions may comprise the reference to the modified set of user data 16. In general terms, there may be a plurality of such task processing nodes 13. Several such task processing nodes 13 may constitute a job processing node. Before being provided to the task processing node 13 the instructions may be processed in a job queue entity 12 in the job management node 31. The job queue entity 12 maybe configured to monitor which instructions have been sent to which task processing node 13, in what order the instructions have been sent, and/ or the status of the instructions. Once the application has been executed by the task processing node(s) 13 a result maybe provided to the job management node 31. The job management node 31 may therefore be configured to, in an optional step S210, receive a result of the execution from the task processing node 13.

There maybe different actions that the job management node 31 may perform once it has received the result of the execution from the task processing node 13. For example, the job management node 31 maybe configured to, in an optional step S212, provide the result to the client 11. Additionally or alternatively the job management node 31 maybe configured to locally store the result. l8

Additionally or alternatively the job management node 31 maybe configured to, in an optional step S214, provide an indication to the pre-processing node 21 that access to the modified set of user data by the application has been terminated. As noted above, this may cause the pre-processing node 21 to delete the modified set of user data 16.

One particular embodiment based on at least some of the above disclosed embodiments will now be disclosed. The particular embodiment is based on a modified MapReduce job submission procedure. The particular embodiment is based on applications that are enabled to access data in an HDFS (Hadoop Distributed File System) file or a database table (HBase or Hive that is built on top of HDFS). For illustrative purposes only it is assumed that user data is stored in tables with rows and columns.

S302: The client 11 (on whose behalf the application will run) sends a request to the job management node 31 acting as a job tracker for creating a new MapReduce job. The request includes a job JAR file, a configuration file, and input and output specification. Table T is specified as the data input and thus represents the set of user data 15. In addition to these parameters, an application identity "S" is also provided. One way to implement step S302 is to perform step S202. S304: The application job request is forwarded by the job management node 31 to the pre-processing node 21. One way to implement step S304 is to perform steps S204 and S102.

S306: The pre-processing node 21 retrieves the application policy for S from a local storage and based on the application policy identifies which data types S is allowed or not allowed to access and under which conditions. Using the table metadata of T, the pre-processing node 21 maps the identified data types to the columns in T. One way to implement step S306 is to perform step S104.

S308: The pre-processing node 21 creates a MapReduce job to pre-process the data in T, using an existing mapreduce program in the form of a JAR file that can perform anonymization (pseudo or more advanced anonymization), filtering, encryption and other data modifications. The pre-processing node 21 also specifies necessary parameters, e.g. the anonymization algorithm to use, the column(s) to be pseudo-anonymized, the column(s) to be filtered out, etc. based on the policy for S. This will provide instructions to create a modified set of user data 16. One way to implement step S308 is to perform step S106.

S310: The job created by the pre-processing node 21 is given high priority and put into execution. In addition to the processing the pre-processing node 21 has specified, the map tasks could also perform filtering etc. based on user specified policies (if any) that are cached in each memory of the task processing node 13. The job created by the pre-processing node 21 outputs the modified data into a new table T'. This table T' thus corresponds to the modified set of user data 16. One way to implement step S310 is to perform any of steps S106, Sio6a, Sio6b, Sio6c.

S312: When the job created by the pre-processing node 21 is finished, the preprocessing node 21 replaces T with T' (i.e., replaces an identification of the set of user data 15 to an identification of the modified set of user data 16) in the original job's input specification and sends it to the job queue. One way to implement step S312 is to perform any of steps S108, Sio8a, and S206.

S314: When the application job is finished, the pre-processing node 21 deletes table T' if required by the policy for S. One way to implement step S314 is to perform steps S110 and S112.

In summary, the Hadoop authorization model (i.e. based on file permissions and ACLs) may be extended with application policies and a policy-based preprocessing. Application policies may regulate which application can do what on which user data, and under which conditions. When an application is submitted as a job the pre-processing node 21 examines the application policy for this application and if required by the policy initiates

anonymization and other pre-processing to create a new version of the input data. The pre-processing node 21 may then re-submit the job with the new input (i.e., with a modified set of user data 16). If necessary the preprocessing node 21 deletes the modified set of user data 16 after the job is completed. The pre-processing itself may be carried out in a MapReduce job, taking advantage of the parallel processing capability of Hadoop. In addition to anonymization, filtering, encryption and other data modifications may be executed in the pre-processing job according to the application policy.

The inventive concept has mainly been described above with reference to a few embodiments. However, as is readily appreciated by a person skilled in the art, other embodiments than the ones disclosed above are equally possible within the scope of the inventive concept, as defined by the appended patent claims. For example, although some examples and embodiments relate to the use of Hadoop, other embodiments and examples, as well as the herein disclosed general inventive concept are not based on the use of Hadoop.

Claims

1. A method for policy based pre-processing of user data, the method being performed by a pre-processing node (21), the method comprising the steps of:

receiving (S102) a request from a job management node (31) for a client

(11) to access a set of user data (15), wherein said request comprises an identity of an application;

acquiring (S104) a policy for said application based on said identity; determining (S106) a modified set of user data (16) from said set of user data based on said policy; and

providing (S108) a reference to said modified set of user data to said job management node.

2. The method according to claim 1, wherein said policy comprises conditions under which said application is allowed to access said set of user data.

3. The method according to claim 1 or 2, wherein said policy comprises identification of what data types in said set of user data said application is allowed to access.

4. The method according to claim 3, wherein determining said modified set of user data comprises:

mapping (Sio6a) said data types to data posts in said set of user data.

5. The method according to any one of the preceding claims, wherein determining said modified set of user data comprises:

filtering (Sio6b) said set of user data based on said policy, thereby removing or changing data posts from said set of user data.

6. The method according to any one of the preceding claims, wherein determining said modified set of user data comprises:

generating (Sio6c) a computer implemented job to determine said modified set of user data from said set of user data by processing said set of data based on said policy.

7. The method according to claim 6, wherein said computer implemented job is based on said policy and comprises at least one of identification of an anonymization algorithm for determining said modified set of user data from said set of user data, identification of which data posts in said set of user data to be pseudo-anonymized, and identification of which data posts in said set of user data to be removed.

8. The method according to claim 6 or 7, wherein said computer

implemented job is given a high priority level for execution.

9. The method according to claim 6, 7, or 8, wherein said computer implemented job is a Hadoop MapReduce job.

10. The method according to any one of the preceding claims, wherein providing said reference comprises:

replacing (Sio8a) an identification of said set of user data with an identification of said modified set of user data.

11. The method according to any one of the preceding claims, further comprising:

receiving (S110) an indication that access to said modified set of user data by said application has been terminated; and in response thereto:

deleting (S112) said modified set of user data.

12. The method according to claim 11, wherein said modified set of user data only is deleted if required by said policy for said application.

13. The method according to any one of the preceding claims, wherein said application is configured to access said modified set of user data during execution of said application.

14. The method according to any one of the preceding claims, wherein said policy depends on whether said application is an internal core application, an internal non-core application, or an external application in relation to said set of user data.

15. The method according to any one of the preceding claims, wherein said user data is network data and relate to any of session records, location of portable wireless devices, radio access network node data, and charging data.

16. A method for policy based pre-processing of user data, the method being performed by a job management node (31), the method comprising the steps of:

receiving (S202) a request from a client (11) to execute an application which needs access a set of user data (15), wherein said request comprises an identity of said application;

forwarding (S204) said request to a pre-processing node (21); and receiving (S206) a reference to a modified set of user data (16) of said set of user data from said pre-processing node.

17. The method according to claim 16, further comprising:

providing (S208) instructions to a task processing node (13) to execute said application, wherein said instructions comprises said reference to said modified set of user data.

18. The method according to claim 17, further comprising:

receiving (S210) a result of said execution from said task processing node.

19. The method according to claim 18, further comprising:

providing (S212) said result to said client.

20. The method according to claim 19, further comprising:

providing (S214) an indication to said pre-processing node that access to said modified set of user data by said application has been terminated.

21. The method according to any one of claims 16 to 20, wherein said request comprises at least one of a job Java Archive, JAR, file, a configuration file, input specifications, and output specifications.

22. The method according to any one of claims 16 to 21, wherein said request comprises a Hadoop MapReduce job.

23. A pre-processing node (21) for policy based pre-processing of user data, the pre-processing node comprising a processing unit (22) configured to: receive a request from a job management node (31) for a client (11) to access a set of user data (15), wherein said request comprises an identity of an application;

acquire a policy for said application based on said identity;

determine a modified set of user data (16) from said set of user data based on said policy; and

provide a reference to said modified set of user data to said job management node.

24. A job management node (31) for policy based pre-processing of user data, the job management node comprising a processing unit (32) configured to:

receive a request from a client (11) to execute an application which needs access a set of user data (15), wherein said request comprises an identity of said application;

forward said request to a pre-processing node (21); and

receive a reference to a modified set of user data (16) of said set of user data from said pre-processing node.

25. A computer program (42a) for policy based pre-processing of user data, the computer program comprising computer program code which, when run on a processing unit (22) of a pre-processing node (21) causes the processing unit to:

receive (S102) a request from a job management node (31) for a client (11) to access a set of user data (15), wherein said request comprises an identity of an application;

acquire (S104) a policy for said application based on said identity;

determine (S106) a modified set of user data (16) from said set of user data based on said policy; and provide (S108) a reference to said modified set of user data to said job management node.

26. A computer program (42b) for policy based pre-processing of user data, the computer program comprising computer program code which, when run on a processing unit (32) of a job management node (31) causes the processing unit to:

receive (S202) a request from a client (11) to execute an application which needs access a set of user data (15), wherein said request comprises an identity of said application;

forward (S204) said request to a pre-processing node (21); and receive (S206) a reference to a modified set of user data (16) of said set of user data from said pre-processing node.

27. A computer program product (41a, 41b) comprising a computer program (42a, 42b) according to at least one of claims 25 and 26, and a computer readable means (43) on which the computer program is stored.