CN110019012B

CN110019012B - Data preprocessing method, data preprocessing device and computer-readable storage medium

Info

Publication number: CN110019012B
Application number: CN201711245143.9A
Authority: CN
Inventors: 马怡安; 陆绪海; 杨迪; 王铮
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-05-11
Anticipated expiration: 2037-12-01
Also published as: CN110019012A

Abstract

The disclosure provides a data preprocessing method, a data preprocessing device and a computer readable storage medium, and relates to the technical field of big data. The data preprocessing method comprises the following steps: acquiring HTTP data; acquiring user agent field information in hypertext transfer protocol (HTTP) data; acquiring an identifier associated with the Useragent field information, wherein the length of the identifier is smaller than that of the Useragent field information; the UserAgent field information is replaced with an identification to reduce space usage of HTTP data. By the method, the Useragent field can be replaced by the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.

Description

Data preprocessing method, data preprocessing device and computer-readable storage medium

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a data preprocessing method, apparatus, and computer-readable storage medium.

Background

After receiving DPI (Deep Packet Inspection) data, the big data platform generally performs preprocessing, including operations such as deleting an error ticket, checking a format, desensitizing, and the like, and then stores the data in an HDFS (Hadoop Distributed File System). In a large data processing environment, a large amount of DPI data puts a great strain on the storage space.

The current network generally solves the problem of insufficient storage space by intercepting Cookie fields or linearly expanding and preprocessing clusters. The preprocessing cluster decompresses, decodes, merges, desensitizes, cleans, formats, etc. the data, and then stores the data on the HDFS.

Disclosure of Invention

The inventor finds that the user agent of each user generally has less change, but occupies a large amount of space of 50-100 bits, and the space redundancy is large.

One purpose of this disclosure is to reduce DPI data's storage space, reduce data storage cost.

According to an aspect of the present disclosure, a data preprocessing method is provided, including: acquiring HTTP (Hypertext Transfer Protocol) data; acquiring user agent field information in HTTP data; acquiring an identifier associated with the Useragent field information, wherein the length of the identifier is smaller than that of the Useragent field information; the UserAgent field information is replaced with an identification to reduce space usage of HTTP data.

Optionally, obtaining the identifier associated with the usergent field information comprises: judging whether an identifier associated with the Useragent field information exists in the matching data; if the identifier associated with the Useragent field information exists, extracting the identifier associated with the Useragent field information in the matching data; and if the identifier associated with the UserAgent field information does not exist, assigning the associated unique identifier for the UserAgent field information, and recording the association relationship between the UserAgent field information and the identifier in the matching data.

Optionally, assigning an associated unique identifier to the usergent field information includes: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identification is not reached, a unique identification with the same length as the existing identification is distributed; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information.

Optionally, the method further comprises: acquiring information of Cookie fields in HTTP data; and arranging the information of the Cookie field into an identity map Id-Mapping format to reduce the space occupation of the HTTP data.

Optionally, the arranging the information of the Cookie field into an Id-Mapping format includes: analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same URL information as those in the Cookie field; if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information; if the user information which is the same as the user information in the Cookie field is included and the same URL information is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field; and if the user information which is the same as the user information in the Cookie field is not included, newly building the stored user information and URL information according to the URL information and the user information in the Cookie field.

Optionally, the method further comprises: and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.

By the method, the Useragent field can be replaced by the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.

According to another aspect of the present disclosure, a data preprocessing apparatus is provided, including: a data acquisition unit configured to acquire HTTP data; a field information acquisition unit configured to acquire user agent field information in the HTTP data; the identification acquisition unit is configured to acquire an identification associated with the Useragent field information, and the length of the identification is smaller than that of the Useragent field information; a replacement unit configured to replace the Useragent field information with the identification to reduce a space occupation amount of the HTTP data.

Optionally, the field information obtaining unit includes: the judging subunit is configured to judge whether an identifier associated with the Useragent field information exists in the matching data; the identification extraction subunit is configured to extract the identification associated with the Useragent field information in the matching data if the identification associated with the Useragent field information exists; and the identification allocation subunit is configured to allocate the associated unique identification for the UserAgent field information if the identification associated with the UserAgent field information does not exist, and record the association relationship between the UserAgent field information and the identification in the matching data.

Optionally, the identity assignment subunit is configured to: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identification is not reached, a unique identification with the same length as the existing identification is distributed; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information.

Optionally, the field information acquiring unit is further configured to acquire information of a Cookie field in the HTTP data; the data preprocessing apparatus further includes: and the Cookie information sorting unit is configured to sort the information of the Cookie field into an Id-Mapping format so as to reduce the space occupation amount of the HTTP data.

Optionally, the Cookie information collating unit is configured to: analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same URL information as those in the Cookie field; if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information; if the user information which is the same as the user information in the Cookie field is included and the same URL information is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field; and if the user information which is the same as the user information in the Cookie field is not included, newly building the stored user information and URL information according to the URL information and the user information in the Cookie field.

Optionally, the Cookie information collating unit is further configured to: and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.

According to still another aspect of the present disclosure, a data preprocessing apparatus is provided, including: a memory; and a processor coupled to the memory, the processor configured to perform any of the data preprocessing methods above based on instructions stored in the memory.

The device can replace the Useragent field with the identifier with shorter length, thereby compressing the storage space required by storing the Useragent field information and reducing the burden of big data storage and the cost of data storage.

According to yet another aspect of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement any of the above data pre-processing methods.

By executing the instructions, the computer-readable storage medium can replace the Useragent field with the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:

FIG. 1 is a flow chart of one embodiment of a data preprocessing method of the present disclosure.

Fig. 2 is a flowchart of an embodiment of acquiring a user agent associated identifier in the data preprocessing method according to the present disclosure.

Fig. 3 is a flow chart of another embodiment of a data preprocessing method of the present disclosure.

FIG. 4 is a flow chart of one embodiment of processing Cookie in the data preprocessing method of the present disclosure.

Fig. 5 is a schematic diagram of an embodiment of a data preprocessing apparatus according to the present disclosure.

Fig. 6 is a schematic diagram of another embodiment of a data preprocessing apparatus according to the present disclosure.

Fig. 7 is a schematic diagram of another embodiment of a data preprocessing apparatus according to the present disclosure.

Detailed Description

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

A flow diagram of one embodiment of a data preprocessing method of the present disclosure is shown in fig. 1.

In step 101, HTTP data is acquired. In one embodiment, HTTP protocol type data may be retrieved from DPI data.

In step 102, the user agent field information in the HTTP data is acquired. In one embodiment, the user agent field information may be obtained by field segmentation.

In step 103, an identifier associated with the user agent field information is obtained, and the length of the identifier is smaller than that of the user agent field information. In one embodiment, the association of the UserAgent field information with the identification may be stored in a database (e.g., an in-memory database). In one embodiment, a browser identifier, an operating system identifier, an encryption level, a browser language, version information, and the like are distinguished from a user agent field in the DPI data. All the information can be limited and exhausted, and the number of the permutation and combination of the information is limited, so that the corresponding relation can be stored in the memory database, and the short character string is used for replacing the original field.

In step 104, the Useragent field information is replaced with an identification to reduce space usage of the HTTP data.

In one embodiment, the fields included in the user agent field, including the browser identifier, the operating system identifier, the encryption level identifier, the browser language, the rendering engine identifier and the version information, may be set as:

and (3) browser identification: the two-digit hexadecimal number representation is used, and 256 different browsers can be represented in total;

the operating system identification: the two-bit hexadecimal number representation is used, and 256 different operating systems can be represented;

encryption level: the one-bit hexadecimal number representation is used, and 16 different encryption levels can be represented;

browser language: 256 different languages can be represented by using two-bit hexadecimal number representation;

the rendering engine identifies: using a one-bit hexadecimal number representation, 16 different rendering engines can be represented in total;

version information: using a two-bit hexadecimal representation, a total of 256 different versions can be represented.

The length of the identification field is only 15 bits, but most of the user agents stored in the current network need 50 to 100 bits, so that the method in the embodiment can save a large amount of storage space under the condition of not losing information.

In one embodiment, when the identifier associated with the user agent field information is not stored, a unique identifier can be allocated to the user agent field information in real time and stored, so that each type of user agent field information can have the unique identifier, the matching success probability can be improved along with application, and the data processing efficiency is improved.

A flowchart of one embodiment of the data preprocessing method of the present disclosure for obtaining the identifier associated with the user agent is shown in fig. 2.

In step 201, judging whether the matching data has an identifier associated with the user agent field information, if so, executing step 202; if there is no identifier associated with the user agent field information, step 203 is executed to assign a new unique identifier to the user agent field information.

In step 202, the identity associated with the Useragent field information in the matching data is extracted.

In step 203, it is determined whether the existing tag reaches the capacity range of the length of the tag. If the capacity range of the length of the identifier is not reached, go to step 205; if the capacity range of the identified length is reached, step 204 is performed.

In step 204, the length of the identifier is increased according to the predetermined granularity and the identifier is allocated to the user agent field information. In one embodiment, it can be analyzed which part of the identifier of the user agent field information exceeds the capacity range of the corresponding identifier length, such as the category of the browser identifier exceeds 256, exceeds the two-digit hexadecimal range allocated to it, or the rendering engine exceeds 16, exceeds the one-digit hexadecimal range allocated to it.

In step 205, the Useragent field information is assigned an associated unique identifier. In one embodiment, which part or parts of the information in the useful field information cannot be queried to obtain the associated identifier can be analyzed, and then the identifier is only allocated to the part of the information, if only the identifier associated with the operating system cannot be found and other parts in the useful field information are successfully matched, the identifier is only allocated to the operating system, and the identifier and the identifiers of other successfully matched information form the identifier of the useful field information; in addition, the identification assigned to the operating system is stored for later use in matching operations.

By the method, data processing errors caused by insufficient pre-allocated space can be prevented, and the expandability of the system is improved.

In one embodiment, in addition to compressing the Useragent field information, Cookie fields may also be processed. Cookie data is also called Cookie data, and is data stored on a local terminal of a user by a website for distinguishing the identity of the user and tracking a session. A flow chart of another embodiment of the data preprocessing method of the present disclosure is shown in fig. 3.

In step 301, HTTP data is acquired. In one embodiment, HTTP protocol type data may be retrieved from DPI data.

In step 302, information of a Cookie field in HTTP data is acquired.

In step 303, the information of the Cookie field is arranged into an Id-Mapping format to reduce the space occupation amount of the HTTP data. In one embodiment, when data is stored in the Id-Mapping format, URL information accessed by each user may be recorded by using user information (such as a user Id) as an index.

By the method, the information of the Cookie field can be stored in the Id-Mapping format, so that the storage space required for storing the Cookie field information is compressed, the storage space of the preprocessed DPI data in the HDFS is obviously reduced, the phase is changed, the cost is reduced, and the efficiency is improved. In addition, the user agent field and the Cookie field are processed in advance, so that direct calling of subsequent application is facilitated.

In one embodiment, Cookie information stored in the Id-Mapping format may be as follows:

by the method, the url information related to the user can be recorded by taking the user information as the index, and compared with the simple Cookie text storage, the method saves the storage space, makes the data more organized and is beneficial to the later analysis application.

A flow diagram of one embodiment of processing cookies in the data pre-processing method of the present disclosure is shown in fig. 4.

In step 401, the information of the Cookie field is parsed. In one embodiment, the user identification included in the Cookie field and the URL information accessed by the user may be parsed out.

In step 402, it is determined whether the parsing is successful. If the analysis is successful, go to step 403; if the parsing is not successful, go to step 408.

In step 403, it is determined whether the stored Cookie information includes the same user information as that in the Cookie field. If yes, go to step 405; if not, go to step 404.

In step 404, newly creating stored user information and URL information according to the URL information and the user information in the Cookie field.

In step 405, it is determined whether the stored Cookie information includes the same URL information as that in the Cookie field. If yes, go to step 407; if not, go to step 406.

In step 406, the URL information stored with the user information as an index is newly created according to the URL information in the Cookie field.

In step 407, the new timestamp updates the same user information and URL information in the stored Cookie information.

In step 408, Cookie information is stored in a Cookie table to avoid data loss due to interpretation failure.

By the method, Cookie data is stored in an id-mapping mode, and meanwhile, the storage space occupation amount can be further compressed in a mode of updating the existing information time stamp. In addition, the data failed in analysis can be stored, and data loss is avoided.

After the existing network receives source DPI data, the preprocessing cluster directly stores the data on an HDFS after carrying out operations such as decompression, decoding, merging, desensitization, cleaning, formatting and the like on the data, and user agent and Cookie fields in the DPI data are directly stored on the HDFS. By the method in the embodiment of the disclosure, the user agent and Cookie information can be specially processed in the preprocessing process after the source DPI data is received, the occupied space is reduced, and the pressure on the HDFS storage space is reduced. In addition, the analysis and application in the later period can be facilitated.

A schematic diagram of one embodiment of a data preprocessing apparatus of the present disclosure is shown in fig. 5. The data acquisition unit 501 can acquire HTTP data. In one embodiment, HTTP protocol type data may be retrieved from DPI data. The field information acquisition unit 502 can acquire the user agent field information in the HTTP data. In one embodiment, the user agent field information may be obtained by field segmentation. The identifier obtaining unit 503 can obtain an identifier associated with the user agent field information, and the length of the identifier is smaller than that of the user agent field information. The replacement unit 504 can replace the usergent field information with the identification to reduce space occupation of the HTTP data.

The data preprocessing device can replace the Useragent field with the identifier with shorter length, thereby compressing the storage space required by storing the Useragent field information and reducing the burden of big data storage and the cost of data storage.

In one embodiment, the field information acquiring unit 502 may include a judging subunit, an identification extracting subunit, and an identification allocating subunit. The judging subunit can judge whether the matching data has an identifier associated with the Useragent field information; the identification extracting subunit can extract the identification associated with the Useragent field information in the matching data under the condition that the identification associated with the Useragent field information exists; the identification allocation subunit can allocate the associated unique identification to the Useragent field information under the condition that the identification associated with the Useragent field information does not exist, and record the association relationship between the Useragent field information and the identification in the matching data.

The data preprocessing device can allocate unique identification for the user agent field information in real time and store the unique identification, so that each type of user agent field information can have the unique identification, the successful matching probability can be improved along with application, and the data processing efficiency is improved.

In one embodiment, the identifier allocation subunit is further capable of determining whether the existing identifier reaches a capacity range of the length of the identifier; if the capacity range of the length of the identification is not reached, a unique identification with the length equal to the existing identification is distributed to the user agent field information; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information. The device can prevent data processing errors caused by insufficient pre-allocated space and improve the expandability of the system.

In one embodiment, the field information obtaining unit 502 can also obtain information of a Cookie field in HTTP data. As shown in fig. 5, the data preprocessing device may further include a Cookie information sorting unit 505, which is capable of sorting information of a Cookie field into an Id-Mapping format to reduce a space occupation amount of HTTP data. In one embodiment, when data is stored in the Id-Mapping format, URL information accessed by each user may be recorded by using user information (such as a user Id) as an index.

The data preprocessing device can store the information of the Cookie field by adopting the Id-Mapping format, thereby compressing the storage space required by the Cookie field information and reducing the burden of large data storage and the cost of data storage.

A schematic structural diagram of an embodiment of the data preprocessing apparatus of the present disclosure is shown in fig. 6. The data preprocessing device includes a memory 601 and a processor 602. Wherein: the memory 601 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing the instructions in the corresponding embodiments of the data pre-processing method above. Processor 602 is coupled to memory 601 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 602 is configured to execute instructions stored in the memory, and can reduce the burden of large data storage and the cost of data storage.

In one embodiment, as also shown in fig. 7, the data preprocessing apparatus 700 includes a memory 701 and a processor 702. Processor 702 is coupled to memory 701 by a BUS BUS 703. The data pre-processing apparatus 700 may be further connected to an external storage apparatus 705 through a storage interface 704 for calling external data, and may be further connected to a network or another computer system (not shown) through a network interface 706. And will not be described in detail herein.

In this embodiment, the data instructions are stored in the memory, and then the instructions are processed by the processor, so that the burden of large data storage and the cost of data storage can be reduced.

In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiment of the data pre-processing method. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Finally, it should be noted that: the above examples are intended only to illustrate the technical solutions of the present disclosure and not to limit them; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the disclosure or equivalent substitutions for parts of the technical features may still be made; all such modifications are intended to be included within the scope of the claims of this disclosure without departing from the spirit thereof.

Claims

1. A method of data pre-processing, comprising:

acquiring hypertext transfer protocol (HTTP) data;

acquiring user agent field information in the HTTP data;

obtaining an identifier associated with the Useragent field information, including:

judging whether an identifier associated with the Useragent field information exists in the matching data;

if the identifier associated with the Useragent field information does not exist, assigning the associated unique identifier to the Useragent field information, wherein the identifier comprises: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to a preset granularity and distributing the identifier for the user agent field information;

recording the incidence relation between the Useragent field information and the identification in the matching data; wherein the length of the identifier is smaller than the length of the Useragent field information;

replacing the Useragent field information with the identification to reduce space usage of the HTTP data.

2. The method of claim 1, wherein the obtaining the identification associated with the Useragent field information further comprises:

and if the identifier associated with the UserAgent field information exists, extracting the identifier associated with the UserAgent field information in the matching data.

3. The method of claim 2, wherein said assigning the Useragent field information an associated unique identifier further comprises:

and if the capacity range of the length of the identifier is not reached, allocating a unique identifier with the same length as the existing identifier.

4. The method of claim 1, further comprising:

acquiring information of Cookie fields in the HTTP data;

and arranging the information of the Cookie field into an identity map Id-Mapping format to reduce the space occupation amount of the HTTP data, wherein the Id-Mapping format takes a user identity identification number (ID) as an index and records URL information accessed by each user.

5. The method of claim 4, wherein the collating information of the Cookie field into an Id-Mapping format comprises:

analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same Uniform Resource Locator (URL) information as those in the Cookie field;

if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information;

if the user information which is the same as the user information in the Cookie field is included and the URL information which is the same is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field;

and if the user information which is the same as the user information in the Cookie field is not included, newly establishing stored user information and URL information according to the URL information and the user information in the Cookie field.

6. The method of claim 5, further comprising:

and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.

7. A data pre-processing apparatus comprising:

a data acquisition unit configured to acquire hypertext transfer protocol (HTTP) data;

a field information acquiring unit configured to acquire user agent user field information in the HTTP data;

an identifier obtaining unit configured to obtain an identifier associated with the user agent field information, a length of the identifier being smaller than a length of the user agent field information, including:

the judging subunit is configured to judge whether an identifier associated with the UserAgent field information exists in the matching data;

an identifier assigning subunit configured to assign an associated unique identifier to the Useragent field information if there is no identifier associated with the Useragent field information, including: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to a preset granularity and distributing the identifier for the user agent field information; the system is further configured to record the incidence relation between the Useragent field information and the identification in the matching data;

a replacement unit configured to replace the Useragent field information with the identification to reduce a space occupation amount of the HTTP data.

8. The apparatus of claim 7, wherein the field information obtaining unit further comprises:

and the identification extraction subunit is configured to extract the identification associated with the UserAgent field information in the matching data if the identification associated with the UserAgent field information exists.

9. The apparatus of claim 8, wherein the identification assignment subunit is further configured to:

10. The apparatus of claim 7, wherein,

the field information acquisition unit is further configured to acquire information of a Cookie field in the HTTP data;

further comprising:

and the Cookie information sorting unit is configured to sort the information of the Cookie field into an identity map Id-Mapping format so as to reduce the space occupation amount of the HTTP data, wherein the Id-Mapping format takes a user identity identification number (ID) as an index and records URL information accessed by each user.

11. The apparatus of claim 10, wherein the Cookie information collating unit is configured to:

12. The apparatus of claim 11, the Cookie information collating unit further configured to:

13. A data pre-processing apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of any of claims 1-6 based on instructions stored in the memory.

14. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of claims 1 to 6.