CN110019012B - Data preprocessing method, data preprocessing device and computer-readable storage medium - Google Patents

Data preprocessing method, data preprocessing device and computer-readable storage medium Download PDF

Info

Publication number
CN110019012B
CN110019012B CN201711245143.9A CN201711245143A CN110019012B CN 110019012 B CN110019012 B CN 110019012B CN 201711245143 A CN201711245143 A CN 201711245143A CN 110019012 B CN110019012 B CN 110019012B
Authority
CN
China
Prior art keywords
information
field
cookie
identifier
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711245143.9A
Other languages
Chinese (zh)
Other versions
CN110019012A (en
Inventor
马怡安
陆绪海
杨迪
王铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN201711245143.9A priority Critical patent/CN110019012B/en
Publication of CN110019012A publication Critical patent/CN110019012A/en
Application granted granted Critical
Publication of CN110019012B publication Critical patent/CN110019012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The disclosure provides a data preprocessing method, a data preprocessing device and a computer readable storage medium, and relates to the technical field of big data. The data preprocessing method comprises the following steps: acquiring HTTP data; acquiring user agent field information in hypertext transfer protocol (HTTP) data; acquiring an identifier associated with the Useragent field information, wherein the length of the identifier is smaller than that of the Useragent field information; the UserAgent field information is replaced with an identification to reduce space usage of HTTP data. By the method, the Useragent field can be replaced by the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.

Description

Data preprocessing method, data preprocessing device and computer-readable storage medium
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data preprocessing method, apparatus, and computer-readable storage medium.
Background
After receiving DPI (Deep Packet Inspection) data, the big data platform generally performs preprocessing, including operations such as deleting an error ticket, checking a format, desensitizing, and the like, and then stores the data in an HDFS (Hadoop Distributed File System). In a large data processing environment, a large amount of DPI data puts a great strain on the storage space.
The current network generally solves the problem of insufficient storage space by intercepting Cookie fields or linearly expanding and preprocessing clusters. The preprocessing cluster decompresses, decodes, merges, desensitizes, cleans, formats, etc. the data, and then stores the data on the HDFS.
Disclosure of Invention
The inventor finds that the user agent of each user generally has less change, but occupies a large amount of space of 50-100 bits, and the space redundancy is large.
One purpose of this disclosure is to reduce DPI data's storage space, reduce data storage cost.
According to an aspect of the present disclosure, a data preprocessing method is provided, including: acquiring HTTP (Hypertext Transfer Protocol) data; acquiring user agent field information in HTTP data; acquiring an identifier associated with the Useragent field information, wherein the length of the identifier is smaller than that of the Useragent field information; the UserAgent field information is replaced with an identification to reduce space usage of HTTP data.
Optionally, obtaining the identifier associated with the usergent field information comprises: judging whether an identifier associated with the Useragent field information exists in the matching data; if the identifier associated with the Useragent field information exists, extracting the identifier associated with the Useragent field information in the matching data; and if the identifier associated with the UserAgent field information does not exist, assigning the associated unique identifier for the UserAgent field information, and recording the association relationship between the UserAgent field information and the identifier in the matching data.
Optionally, assigning an associated unique identifier to the usergent field information includes: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identification is not reached, a unique identification with the same length as the existing identification is distributed; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information.
Optionally, the method further comprises: acquiring information of Cookie fields in HTTP data; and arranging the information of the Cookie field into an identity map Id-Mapping format to reduce the space occupation of the HTTP data.
Optionally, the arranging the information of the Cookie field into an Id-Mapping format includes: analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same URL information as those in the Cookie field; if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information; if the user information which is the same as the user information in the Cookie field is included and the same URL information is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field; and if the user information which is the same as the user information in the Cookie field is not included, newly building the stored user information and URL information according to the URL information and the user information in the Cookie field.
Optionally, the method further comprises: and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.
By the method, the Useragent field can be replaced by the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.
According to another aspect of the present disclosure, a data preprocessing apparatus is provided, including: a data acquisition unit configured to acquire HTTP data; a field information acquisition unit configured to acquire user agent field information in the HTTP data; the identification acquisition unit is configured to acquire an identification associated with the Useragent field information, and the length of the identification is smaller than that of the Useragent field information; a replacement unit configured to replace the Useragent field information with the identification to reduce a space occupation amount of the HTTP data.
Optionally, the field information obtaining unit includes: the judging subunit is configured to judge whether an identifier associated with the Useragent field information exists in the matching data; the identification extraction subunit is configured to extract the identification associated with the Useragent field information in the matching data if the identification associated with the Useragent field information exists; and the identification allocation subunit is configured to allocate the associated unique identification for the UserAgent field information if the identification associated with the UserAgent field information does not exist, and record the association relationship between the UserAgent field information and the identification in the matching data.
Optionally, the identity assignment subunit is configured to: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identification is not reached, a unique identification with the same length as the existing identification is distributed; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information.
Optionally, the field information acquiring unit is further configured to acquire information of a Cookie field in the HTTP data; the data preprocessing apparatus further includes: and the Cookie information sorting unit is configured to sort the information of the Cookie field into an Id-Mapping format so as to reduce the space occupation amount of the HTTP data.
Optionally, the Cookie information collating unit is configured to: analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same URL information as those in the Cookie field; if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information; if the user information which is the same as the user information in the Cookie field is included and the same URL information is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field; and if the user information which is the same as the user information in the Cookie field is not included, newly building the stored user information and URL information according to the URL information and the user information in the Cookie field.
Optionally, the Cookie information collating unit is further configured to: and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.
According to still another aspect of the present disclosure, a data preprocessing apparatus is provided, including: a memory; and a processor coupled to the memory, the processor configured to perform any of the data preprocessing methods above based on instructions stored in the memory.
The device can replace the Useragent field with the identifier with shorter length, thereby compressing the storage space required by storing the Useragent field information and reducing the burden of big data storage and the cost of data storage.
According to yet another aspect of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement any of the above data pre-processing methods.
By executing the instructions, the computer-readable storage medium can replace the Useragent field with the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure. In the drawings:
FIG. 1 is a flow chart of one embodiment of a data preprocessing method of the present disclosure.
Fig. 2 is a flowchart of an embodiment of acquiring a user agent associated identifier in the data preprocessing method according to the present disclosure.
Fig. 3 is a flow chart of another embodiment of a data preprocessing method of the present disclosure.
FIG. 4 is a flow chart of one embodiment of processing Cookie in the data preprocessing method of the present disclosure.
Fig. 5 is a schematic diagram of an embodiment of a data preprocessing apparatus according to the present disclosure.
Fig. 6 is a schematic diagram of another embodiment of a data preprocessing apparatus according to the present disclosure.
Fig. 7 is a schematic diagram of another embodiment of a data preprocessing apparatus according to the present disclosure.
Detailed Description
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
A flow diagram of one embodiment of a data preprocessing method of the present disclosure is shown in fig. 1.
In step 101, HTTP data is acquired. In one embodiment, HTTP protocol type data may be retrieved from DPI data.
In step 102, the user agent field information in the HTTP data is acquired. In one embodiment, the user agent field information may be obtained by field segmentation.
In step 103, an identifier associated with the user agent field information is obtained, and the length of the identifier is smaller than that of the user agent field information. In one embodiment, the association of the UserAgent field information with the identification may be stored in a database (e.g., an in-memory database). In one embodiment, a browser identifier, an operating system identifier, an encryption level, a browser language, version information, and the like are distinguished from a user agent field in the DPI data. All the information can be limited and exhausted, and the number of the permutation and combination of the information is limited, so that the corresponding relation can be stored in the memory database, and the short character string is used for replacing the original field.
In step 104, the Useragent field information is replaced with an identification to reduce space usage of the HTTP data.
By the method, the Useragent field can be replaced by the identifier with shorter length, so that the storage space required for storing the Useragent field information is compressed, and the burden of large data storage and the data storage cost are reduced.
In one embodiment, the fields included in the user agent field, including the browser identifier, the operating system identifier, the encryption level identifier, the browser language, the rendering engine identifier and the version information, may be set as:
and (3) browser identification: the two-digit hexadecimal number representation is used, and 256 different browsers can be represented in total;
the operating system identification: the two-bit hexadecimal number representation is used, and 256 different operating systems can be represented;
encryption level: the one-bit hexadecimal number representation is used, and 16 different encryption levels can be represented;
browser language: 256 different languages can be represented by using two-bit hexadecimal number representation;
the rendering engine identifies: using a one-bit hexadecimal number representation, 16 different rendering engines can be represented in total;
version information: using a two-bit hexadecimal representation, a total of 256 different versions can be represented.
The length of the identification field is only 15 bits, but most of the user agents stored in the current network need 50 to 100 bits, so that the method in the embodiment can save a large amount of storage space under the condition of not losing information.
In one embodiment, when the identifier associated with the user agent field information is not stored, a unique identifier can be allocated to the user agent field information in real time and stored, so that each type of user agent field information can have the unique identifier, the matching success probability can be improved along with application, and the data processing efficiency is improved.
A flowchart of one embodiment of the data preprocessing method of the present disclosure for obtaining the identifier associated with the user agent is shown in fig. 2.
In step 201, judging whether the matching data has an identifier associated with the user agent field information, if so, executing step 202; if there is no identifier associated with the user agent field information, step 203 is executed to assign a new unique identifier to the user agent field information.
In step 202, the identity associated with the Useragent field information in the matching data is extracted.
In step 203, it is determined whether the existing tag reaches the capacity range of the length of the tag. If the capacity range of the length of the identifier is not reached, go to step 205; if the capacity range of the identified length is reached, step 204 is performed.
In step 204, the length of the identifier is increased according to the predetermined granularity and the identifier is allocated to the user agent field information. In one embodiment, it can be analyzed which part of the identifier of the user agent field information exceeds the capacity range of the corresponding identifier length, such as the category of the browser identifier exceeds 256, exceeds the two-digit hexadecimal range allocated to it, or the rendering engine exceeds 16, exceeds the one-digit hexadecimal range allocated to it.
In step 205, the Useragent field information is assigned an associated unique identifier. In one embodiment, which part or parts of the information in the useful field information cannot be queried to obtain the associated identifier can be analyzed, and then the identifier is only allocated to the part of the information, if only the identifier associated with the operating system cannot be found and other parts in the useful field information are successfully matched, the identifier is only allocated to the operating system, and the identifier and the identifiers of other successfully matched information form the identifier of the useful field information; in addition, the identification assigned to the operating system is stored for later use in matching operations.
By the method, data processing errors caused by insufficient pre-allocated space can be prevented, and the expandability of the system is improved.
In one embodiment, in addition to compressing the Useragent field information, Cookie fields may also be processed. Cookie data is also called Cookie data, and is data stored on a local terminal of a user by a website for distinguishing the identity of the user and tracking a session. A flow chart of another embodiment of the data preprocessing method of the present disclosure is shown in fig. 3.
In step 301, HTTP data is acquired. In one embodiment, HTTP protocol type data may be retrieved from DPI data.
In step 302, information of a Cookie field in HTTP data is acquired.
In step 303, the information of the Cookie field is arranged into an Id-Mapping format to reduce the space occupation amount of the HTTP data. In one embodiment, when data is stored in the Id-Mapping format, URL information accessed by each user may be recorded by using user information (such as a user Id) as an index.
By the method, the information of the Cookie field can be stored in the Id-Mapping format, so that the storage space required for storing the Cookie field information is compressed, the storage space of the preprocessed DPI data in the HDFS is obviously reduced, the phase is changed, the cost is reduced, and the efficiency is improved. In addition, the user agent field and the Cookie field are processed in advance, so that direct calling of subsequent application is facilitated.
In one embodiment, Cookie information stored in the Id-Mapping format may be as follows:
Figure BDA0001490594110000071
Figure BDA0001490594110000081
by the method, the url information related to the user can be recorded by taking the user information as the index, and compared with the simple Cookie text storage, the method saves the storage space, makes the data more organized and is beneficial to the later analysis application.
A flow diagram of one embodiment of processing cookies in the data pre-processing method of the present disclosure is shown in fig. 4.
In step 401, the information of the Cookie field is parsed. In one embodiment, the user identification included in the Cookie field and the URL information accessed by the user may be parsed out.
In step 402, it is determined whether the parsing is successful. If the analysis is successful, go to step 403; if the parsing is not successful, go to step 408.
In step 403, it is determined whether the stored Cookie information includes the same user information as that in the Cookie field. If yes, go to step 405; if not, go to step 404.
In step 404, newly creating stored user information and URL information according to the URL information and the user information in the Cookie field.
In step 405, it is determined whether the stored Cookie information includes the same URL information as that in the Cookie field. If yes, go to step 407; if not, go to step 406.
In step 406, the URL information stored with the user information as an index is newly created according to the URL information in the Cookie field.
In step 407, the new timestamp updates the same user information and URL information in the stored Cookie information.
In step 408, Cookie information is stored in a Cookie table to avoid data loss due to interpretation failure.
By the method, Cookie data is stored in an id-mapping mode, and meanwhile, the storage space occupation amount can be further compressed in a mode of updating the existing information time stamp. In addition, the data failed in analysis can be stored, and data loss is avoided.
After the existing network receives source DPI data, the preprocessing cluster directly stores the data on an HDFS after carrying out operations such as decompression, decoding, merging, desensitization, cleaning, formatting and the like on the data, and user agent and Cookie fields in the DPI data are directly stored on the HDFS. By the method in the embodiment of the disclosure, the user agent and Cookie information can be specially processed in the preprocessing process after the source DPI data is received, the occupied space is reduced, and the pressure on the HDFS storage space is reduced. In addition, the analysis and application in the later period can be facilitated.
A schematic diagram of one embodiment of a data preprocessing apparatus of the present disclosure is shown in fig. 5. The data acquisition unit 501 can acquire HTTP data. In one embodiment, HTTP protocol type data may be retrieved from DPI data. The field information acquisition unit 502 can acquire the user agent field information in the HTTP data. In one embodiment, the user agent field information may be obtained by field segmentation. The identifier obtaining unit 503 can obtain an identifier associated with the user agent field information, and the length of the identifier is smaller than that of the user agent field information. The replacement unit 504 can replace the usergent field information with the identification to reduce space occupation of the HTTP data.
The data preprocessing device can replace the Useragent field with the identifier with shorter length, thereby compressing the storage space required by storing the Useragent field information and reducing the burden of big data storage and the cost of data storage.
In one embodiment, the field information acquiring unit 502 may include a judging subunit, an identification extracting subunit, and an identification allocating subunit. The judging subunit can judge whether the matching data has an identifier associated with the Useragent field information; the identification extracting subunit can extract the identification associated with the Useragent field information in the matching data under the condition that the identification associated with the Useragent field information exists; the identification allocation subunit can allocate the associated unique identification to the Useragent field information under the condition that the identification associated with the Useragent field information does not exist, and record the association relationship between the Useragent field information and the identification in the matching data.
The data preprocessing device can allocate unique identification for the user agent field information in real time and store the unique identification, so that each type of user agent field information can have the unique identification, the successful matching probability can be improved along with application, and the data processing efficiency is improved.
In one embodiment, the identifier allocation subunit is further capable of determining whether the existing identifier reaches a capacity range of the length of the identifier; if the capacity range of the length of the identification is not reached, a unique identification with the length equal to the existing identification is distributed to the user agent field information; and if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to the preset granularity and distributing the identifier for the user agent field information. The device can prevent data processing errors caused by insufficient pre-allocated space and improve the expandability of the system.
In one embodiment, the field information obtaining unit 502 can also obtain information of a Cookie field in HTTP data. As shown in fig. 5, the data preprocessing device may further include a Cookie information sorting unit 505, which is capable of sorting information of a Cookie field into an Id-Mapping format to reduce a space occupation amount of HTTP data. In one embodiment, when data is stored in the Id-Mapping format, URL information accessed by each user may be recorded by using user information (such as a user Id) as an index.
The data preprocessing device can store the information of the Cookie field by adopting the Id-Mapping format, thereby compressing the storage space required by the Cookie field information and reducing the burden of large data storage and the cost of data storage.
A schematic structural diagram of an embodiment of the data preprocessing apparatus of the present disclosure is shown in fig. 6. The data preprocessing device includes a memory 601 and a processor 602. Wherein: the memory 601 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing the instructions in the corresponding embodiments of the data pre-processing method above. Processor 602 is coupled to memory 601 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 602 is configured to execute instructions stored in the memory, and can reduce the burden of large data storage and the cost of data storage.
In one embodiment, as also shown in fig. 7, the data preprocessing apparatus 700 includes a memory 701 and a processor 702. Processor 702 is coupled to memory 701 by a BUS BUS 703. The data pre-processing apparatus 700 may be further connected to an external storage apparatus 705 through a storage interface 704 for calling external data, and may be further connected to a network or another computer system (not shown) through a network interface 706. And will not be described in detail herein.
In this embodiment, the data instructions are stored in the memory, and then the instructions are processed by the processor, so that the burden of large data storage and the cost of data storage can be reduced.
In another embodiment, a computer-readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiment of the data pre-processing method. As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Thus far, the present disclosure has been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
Finally, it should be noted that: the above examples are intended only to illustrate the technical solutions of the present disclosure and not to limit them; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the disclosure or equivalent substitutions for parts of the technical features may still be made; all such modifications are intended to be included within the scope of the claims of this disclosure without departing from the spirit thereof.

Claims (14)

1. A method of data pre-processing, comprising:
acquiring hypertext transfer protocol (HTTP) data;
acquiring user agent field information in the HTTP data;
obtaining an identifier associated with the Useragent field information, including:
judging whether an identifier associated with the Useragent field information exists in the matching data;
if the identifier associated with the Useragent field information does not exist, assigning the associated unique identifier to the Useragent field information, wherein the identifier comprises: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to a preset granularity and distributing the identifier for the user agent field information;
recording the incidence relation between the Useragent field information and the identification in the matching data; wherein the length of the identifier is smaller than the length of the Useragent field information;
replacing the Useragent field information with the identification to reduce space usage of the HTTP data.
2. The method of claim 1, wherein the obtaining the identification associated with the Useragent field information further comprises:
and if the identifier associated with the UserAgent field information exists, extracting the identifier associated with the UserAgent field information in the matching data.
3. The method of claim 2, wherein said assigning the Useragent field information an associated unique identifier further comprises:
and if the capacity range of the length of the identifier is not reached, allocating a unique identifier with the same length as the existing identifier.
4. The method of claim 1, further comprising:
acquiring information of Cookie fields in the HTTP data;
and arranging the information of the Cookie field into an identity map Id-Mapping format to reduce the space occupation amount of the HTTP data, wherein the Id-Mapping format takes a user identity identification number (ID) as an index and records URL information accessed by each user.
5. The method of claim 4, wherein the collating information of the Cookie field into an Id-Mapping format comprises:
analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same Uniform Resource Locator (URL) information as those in the Cookie field;
if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information;
if the user information which is the same as the user information in the Cookie field is included and the URL information which is the same is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field;
and if the user information which is the same as the user information in the Cookie field is not included, newly establishing stored user information and URL information according to the URL information and the user information in the Cookie field.
6. The method of claim 5, further comprising:
and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.
7. A data pre-processing apparatus comprising:
a data acquisition unit configured to acquire hypertext transfer protocol (HTTP) data;
a field information acquiring unit configured to acquire user agent user field information in the HTTP data;
an identifier obtaining unit configured to obtain an identifier associated with the user agent field information, a length of the identifier being smaller than a length of the user agent field information, including:
the judging subunit is configured to judge whether an identifier associated with the UserAgent field information exists in the matching data;
an identifier assigning subunit configured to assign an associated unique identifier to the Useragent field information if there is no identifier associated with the Useragent field information, including: judging whether the existing mark reaches the capacity range of the length of the mark or not; if the capacity range of the length of the identifier is reached, increasing the length of the identifier according to a preset granularity and distributing the identifier for the user agent field information; the system is further configured to record the incidence relation between the Useragent field information and the identification in the matching data;
a replacement unit configured to replace the Useragent field information with the identification to reduce a space occupation amount of the HTTP data.
8. The apparatus of claim 7, wherein the field information obtaining unit further comprises:
and the identification extraction subunit is configured to extract the identification associated with the UserAgent field information in the matching data if the identification associated with the UserAgent field information exists.
9. The apparatus of claim 8, wherein the identification assignment subunit is further configured to:
and if the capacity range of the length of the identifier is not reached, allocating a unique identifier with the same length as the existing identifier.
10. The apparatus of claim 7, wherein,
the field information acquisition unit is further configured to acquire information of a Cookie field in the HTTP data;
further comprising:
and the Cookie information sorting unit is configured to sort the information of the Cookie field into an identity map Id-Mapping format so as to reduce the space occupation amount of the HTTP data, wherein the Id-Mapping format takes a user identity identification number (ID) as an index and records URL information accessed by each user.
11. The apparatus of claim 10, wherein the Cookie information collating unit is configured to:
analyzing the information of the Cookie field, and judging whether the stored Cookie information comprises the same user information and the same Uniform Resource Locator (URL) information as those in the Cookie field;
if the user information and the URL information which are the same as those in the Cookie field are included, a new timestamp is established to update the user information and the URL information which are the same as those in the stored Cookie information;
if the user information which is the same as the user information in the Cookie field is included and the URL information which is the same is not included, the URL information which is stored by taking the user information as an index is newly established according to the URL information in the Cookie field;
and if the user information which is the same as the user information in the Cookie field is not included, newly establishing stored user information and URL information according to the URL information and the user information in the Cookie field.
12. The apparatus of claim 11, the Cookie information collating unit further configured to:
and if the information of the Cookie field cannot be successfully analyzed, storing the Cookie information in a Cookie table.
13. A data pre-processing apparatus comprising:
a memory; and
a processor coupled to the memory, the processor configured to perform the method of any of claims 1-6 based on instructions stored in the memory.
14. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of claims 1 to 6.
CN201711245143.9A 2017-12-01 2017-12-01 Data preprocessing method, data preprocessing device and computer-readable storage medium Active CN110019012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711245143.9A CN110019012B (en) 2017-12-01 2017-12-01 Data preprocessing method, data preprocessing device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711245143.9A CN110019012B (en) 2017-12-01 2017-12-01 Data preprocessing method, data preprocessing device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN110019012A CN110019012A (en) 2019-07-16
CN110019012B true CN110019012B (en) 2021-05-11

Family

ID=67186548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711245143.9A Active CN110019012B (en) 2017-12-01 2017-12-01 Data preprocessing method, data preprocessing device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN110019012B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111190858B (en) * 2019-10-15 2023-07-14 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for storing software information
CN112000982A (en) * 2020-07-31 2020-11-27 青岛海尔科技有限公司 Method and device for processing user application data
CN115905924B (en) * 2022-12-06 2023-08-11 济南亚海凛米网络科技服务有限公司 Data processing method and system based on artificial intelligence Internet of things and cloud platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745383A (en) * 2013-12-27 2014-04-23 北京集奥聚合科技有限公司 Method and system of realizing redirection service based on operator data
CN103873443A (en) * 2012-12-13 2014-06-18 联想(北京)有限公司 Information processing method, local proxy server and network proxy server
CN106682925A (en) * 2015-11-06 2017-05-17 北京奇虎科技有限公司 Method and device for recommending advertisement content
US10262064B2 (en) * 2011-07-29 2019-04-16 Rakuten, Inc. Information processing apparatus, information processing method, information processing program, recording medium having stored therein information processing program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10262064B2 (en) * 2011-07-29 2019-04-16 Rakuten, Inc. Information processing apparatus, information processing method, information processing program, recording medium having stored therein information processing program
CN103873443A (en) * 2012-12-13 2014-06-18 联想(北京)有限公司 Information processing method, local proxy server and network proxy server
CN103745383A (en) * 2013-12-27 2014-04-23 北京集奥聚合科技有限公司 Method and system of realizing redirection service based on operator data
CN106682925A (en) * 2015-11-06 2017-05-17 北京奇虎科技有限公司 Method and device for recommending advertisement content

Also Published As

Publication number Publication date
CN110019012A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
CN109034993B (en) Account checking method, account checking equipment, account checking system and computer readable storage medium
CN107832406B (en) Method, device, equipment and storage medium for removing duplicate entries of mass log data
CN110019012B (en) Data preprocessing method, data preprocessing device and computer-readable storage medium
CN107368593B (en) Data import method and device and server
CN105049287A (en) Log processing method and log processing devices
CN109246163B (en) Terminal information identification method and device
CN110851209B (en) Data processing method and device, electronic equipment and storage medium
CN111740923A (en) Method and device for generating application identification rule, electronic equipment and storage medium
CN111177318A (en) Method, device and computer readable storage medium for executing international business
CN106909595B (en) Data migration method and device
CN110008192A (en) A kind of data file compression method, apparatus, equipment and readable storage medium storing program for executing
US20190065518A1 (en) Context aware delta algorithm for genomic files
CN112822260A (en) File transmission method and device, electronic equipment and storage medium
CN109800005A (en) A kind of hot update method of client and device
CN111831920A (en) User demand analysis method and device, computer equipment and storage medium
CN105704177A (en) UA identification method and device
CN109840103B (en) Method and device for updating application program container and storage medium
CN112100182A (en) Data warehousing processing method and device and server
CN111371649B (en) Deep packet detection method and device
CN113553301A (en) Header file processing method and device, computer readable storage medium and processor
CN103139298B (en) Method for transmitting network data and device
CN108845995B (en) Data processing method, data processing apparatus, storage medium, and electronic apparatus
CN105204937B (en) Kernel function call method, apparatus and operating system
CN106326310B (en) Resource encryption updating method for mobile phone client software
CN109635015B (en) Determination method and device for attribute data using object and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant