CN117112519A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN117112519A
CN117112519A CN202210534493.1A CN202210534493A CN117112519A CN 117112519 A CN117112519 A CN 117112519A CN 202210534493 A CN202210534493 A CN 202210534493A CN 117112519 A CN117112519 A CN 117112519A
Authority
CN
China
Prior art keywords
target
data
log file
tree
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210534493.1A
Other languages
Chinese (zh)
Inventor
万伟雄
杨慰民
罗卫鸿
郑银云
蔡鸿祥
陈志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Fujian Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Fujian Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Fujian Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202210534493.1A priority Critical patent/CN117112519A/en
Publication of CN117112519A publication Critical patent/CN117112519A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and a data processing device, belongs to the field of communication, and can solve the problem of low data processing efficiency in the related technology to a certain extent. The method comprises the following steps: acquiring a log file; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree.

Description

Data processing method and device
Technical Field
The application belongs to the field of communication, and particularly relates to a data processing method and device.
Background
Various operations of users on websites or Applications (APP) can generate a large number of log files, and content extraction and analysis of massive log data in the log files can obtain a large amount of valuable information.
The related art stores valuable data after extracting the valuable information, so that operations can be performed based on the stored data in a subsequent process.
However, the related art manner of operating based on stored data has a problem of low processing efficiency.
Disclosure of Invention
The embodiment of the application provides a data processing method and a data processing device, which can solve the problem of lower data processing efficiency in the related technology to a certain extent.
In a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring a log file;
screening out a target log file from the log files;
storing target data in the target log file in a B+ tree form;
based on the operation of the target data stored in the form of a b+ tree, a target result is obtained.
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
the acquisition module is used for acquiring the log file;
the screening module is used for screening target log files from the log files;
the processing module is used for storing the target data in the target log file in the form of a B+ tree; for obtaining a target result based on an operation on said target data stored in the form of a b+ tree.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, the memory storing a program or instructions that, when executed by the processor, implement the steps of the data processing method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the data processing method according to the first aspect.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement a data processing method according to the first aspect.
In the embodiment of the application, a log file is obtained; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree. Because the content in the B+ tree can be conveniently searched and searched for intervals without traversing, the technical scheme of the application can realize that the target data can be rapidly acquired from the B+ tree through the keywords, thereby improving the operation speed and shortening the time for acquiring the target result, and further solving the problem of lower data processing efficiency in the related technology to a certain extent.
Drawings
Fig. 1-1 is a schematic diagram of a data processing method according to an embodiment of the present application.
Fig. 1-2 are flowcharts of a data processing method according to an embodiment of the present application.
Fig. 1-3 are schematic diagrams of writing data in a b+ tree according to an embodiment of the present application.
Fig. 1-4 are schematic diagrams illustrating identification of application layer information by a deep packet inspection (Deep Packet Inspection, DPI) device according to an embodiment of the application.
Fig. 2 is a flowchart of a data processing method according to an embodiment of the present application.
Fig. 3 is a flowchart of a data processing method according to an embodiment of the present application.
Fig. 4 is a block diagram of a data processing apparatus according to an embodiment of the present application.
Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the application may be practiced otherwise than as specifically illustrated or described herein. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The data processing method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
Fig. 1-1 is a schematic diagram of a data processing method according to an embodiment of the present application. Referring to fig. 1-1, the data processing method provided by the present application may relate to a data acquisition process, a data cleansing and distribution process, a business rule matching process, a data reading and calculating process, a data persistence process, an upper layer application and a presentation process.
These several processes are described one by one below.
The data acquisition process is mainly used for acquiring log files of users. The log file may be a file in a fixed data format generated after reorganizing and identifying contents carried by an IP data packet, a transmission control protocol (Transmission Control Protocol, TCP) data stream, a user datagram protocol (User Datagram Protocol, UDP) data stream, or the like, which are transmitted in a network link. The log file contains a large amount of log data.
The data cleaning and distributing process is mainly used for classifying and screening the acquired log files to obtain target log files needing to be processed in real time and respectively sending the target log files to corresponding storage positions, so that the subsequent processing is facilitated. The data cleaning and distributing process mainly has two functions, namely judging whether the log files belong to a certain preset time interval or not, for example, realizing log file collection in the same time interval according to the file names of the log files, and screening out target log files used in the data reading and calculating module process; and secondly, distributing the log files, and storing the log files by taking the preset time interval as the minimum time unit.
The business rule matching process is mainly used for setting a target dimension, so that a target log file can be processed based on the target dimension in the data reading and calculating process. The target dimensions may include a single dimension and a composite dimension, among others. The single dimension may be, for example, at least one of the dimensions of a city, a business, a cell, a server IP, etc. The composite dimension may be a dimension that is composed of multiple single dimensions, such as dimensions of city traffic, traffic server IP, etc.
The data reading and calculating process is mainly used for reading a target log file, obtaining target data from the target log file, and storing the target data and key indexes of each target dimension obtained by calculating according to the target data in a B+ tree form and a hash table form respectively. The data reading and calculating process can be performed in a memory, for example, about twice of a memory space can be applied in advance according to the size of a target log file, the target log file is read into the memory, the target log file is divided into N parts, N threads are created and respectively processed to obtain target data, and the target data is stored in a B+ tree mode. The key index corresponding to at least one target dimension can be obtained according to the target data and the requirements; and storing each target dimension and the corresponding key index in a hash table in a key value pair mode.
The data persistence process is mainly used for storing data obtained after processing in the data reading and calculating process in a disk medium and/or generating a database table according to the data obtained after processing in the data reading and calculating process.
The upper layer application and display process is mainly used for realizing data visualization, real-time service dial testing and problem tracking. The data visualization may include a visual presentation of the processing results of the data reading and computing process, such as presenting the data in the b+ tree and hash table, graphically displaying total full-province traffic, current user number, etc. The real-time traffic dial testing may include displaying key indicators of the selected area, the selected time period, according to the target dimension.
After the processing result is visually displayed, the follow-up problem tracking can be conveniently carried out. The problem tracking can be performed by tracking analysis from angle cuts such as a service, a user plane, a signaling plane and the like based on the processing results of the data reading and calculating module, and the cause of the problem is positioned. From the service angle, the relevant data in aspects of terminal, cell, network element, service category, server IP, error code information and the like can be comprehensively analyzed, and the reasons generated according to the abnormal data positioning problem can be comprehensively analyzed. From the user plane angle, index information such as TCP success rate, TCP time delay, hypertext transfer protocol (Hypertext transfer protocol, HTTP) service success rate, HTTP service time delay, HTTP download rate and the like can be obtained, and according to abnormal indexes in the index information, relevant data of dimensions such as terminal, cell, network element, service category, server IP, error code information and the like are synthesized for analysis, so that the cause of the problem is located. From the angle cut of the signaling surface, the index information such as the attachment success rate, the attachment time delay, the EPS establishment success rate, the EPS establishment time delay and the like can be obtained, and according to the abnormal index in the index information, the relevant data of the dimensions such as the cell, the network element, the terminal and the like are synthesized for analysis, so that the cause of the problem is positioned.
The user's log file may be obtained in a number of ways during the data acquisition process, such as receiving the log file of the DPI device via a file transfer protocol (File Transfer Protocol, FTP) interface. Because the DPI device can deeply read the content carried by the IP data packet to reconstruct and identify the application layer information in the seven-layer protocol of the open system interconnection (Open System Interconnection, OSI), various service data are acquired. Therefore, the subsequent data processing and calculating process can analyze various service data, and the diversity of service analysis can be met.
Fig. 1-2 are flowcharts of a data processing method according to an embodiment of the present application. The methods illustrated in fig. 1-2 may be performed by an electronic device, which may be various types of computers. As shown in fig. 1-2, the data processing method provided by the embodiment of the present application may include:
step 110, obtaining a log file;
in the embodiment of the application, the log file may be a file for recording the internet surfing behavior data of the user. The log file may contain a large amount of log data, which may include at least one of: user identification, cell ID, city ID, source address, destination address, source IP, destination IP, source port, destination port, transport protocol, traffic data, request transmission time, download rate, time delay, request success/request failure identification, etc. Wherein the service data may include at least one of: application type of service data, access uniform resource locator (Uniform Resource Locator, URL), traffic, etc. The electronic device may obtain a large number of log files generated during the internet surfing process by using various modes, for example, obtain various log files output after being processed by the deep packet inspection (Deep Packet Inspection, DPI) device through a file transfer protocol (File Transfer Protocol, FTP) interface, obtain log files of the client through a flash system, and so on.
Step 120, screening out a target log file from the log files;
it will be appreciated that within the same time period, a large number of log files are acquired by the electronic device. In this case, the obtained log files may be out of order, and in order to ensure that the electronic device can process log files in the same time period in time and not repeatedly process the same log files, the obtained log files need to be screened. In the embodiment of the application, the target log files can be screened out from the log files according to the preset time interval, and the log files can be classified and stored according to the preset time interval. Taking a preset time interval as an example of one minute, when a user accesses a webpage, the electronic device can acquire one thousand or more log files in one minute, after acquiring the log files, the electronic device can screen out target log files in the same minute according to the file names of the log files, and store the target log files in the same minute under the same time directory, so that the subsequent processing is facilitated.
Step 130, storing the target data in the target log file in the form of a B+ tree;
in the embodiment of the application, in order to improve the operation efficiency, the target data in the target log file can be stored in the form of a B+ tree. The target data may include at least one of: user identification, time, service, cell, download rate, time delay, success rate, etc.
The b+ tree contains 2 types of nodes: internal nodes (also called inodes) and leaf nodes. The internal node is a non-leaf node, does not store substantial data, only stores an index, and the data corresponding to the index is stored in the leaf node. Keys (keys) in the internal nodes are arranged in a descending order, for one key in the internal nodes, all keys in the left tree are smaller than the key in the internal nodes, keys in the right subtree are larger than or equal to the key in the internal nodes, records in the leaf nodes are also arranged according to the size of the key, each leaf node stores pointers of adjacent leaf nodes, the leaf nodes are sequentially linked from descending to ascending according to the size of the key, and a father node stores an index of a first key of a right child. If the B+ tree has only one layer, then there is only a root node, which is also a leaf node. If the B+ tree is two layers or more, the uppermost layer is the root node, and the root node belongs to the internal node. The leaf nodes are positioned at the bottom layer of the B+ tree, and the data insertion is performed in the order from bottom to top. For the m-order B+ tree, at least one key is included in the root node, and the number of keys in the non-root node is greater than or equal to (m/2) -1 and less than or equal to m-1. In the embodiment of the present application, at least one of data such as time, service, cell, download rate, delay, success rate, etc. may be used as data actually stored in the leaf node by using a user identifier, for example, an international mobile subscriber identity (International Mobile Subscriber Identity, IMSI) as an index, to generate a b+ tree.
The process of storing the target data in the form of a b+ tree will be explained below taking the insertion process of a 5-stage b+ tree as an example.
The nodes of the 5-level B+ tree comprise 2 keys at least and 4 keys at most, when the number of the keys is larger than 4 in data writing, the keys are split into left and right parts according to the keys positioned in the middle, and the keys in the middle are split into parent nodes to be stored as indexes.
As shown in fig. 1-3, specific keys such as 5, 10, 15, 20, 25, 26, 30 are illustrated as examples.
Firstly, 5, 10, 15 and 20 are inserted into the empty tree in the sequence of key from small to large, and the number of keys in the node is equal to 4 without splitting, and the keys 5, 10, 15 and 20 are positioned on leaf nodes.
Then, the key25 is continuously inserted into the leaf node (the specific insertion position can be determined to be positioned behind the key20 according to the key size), the number of keys in the node is greater than 4, the key15 positioned in the middle is split into the parent node to be stored as an index.
The keys 26, 30 are reinserted (the specific insertion position can be determined to lie behind the key25 according to the key size). When the key26 is inserted, the number of keys in the right subtree node is equal to 4, no splitting is needed, when the key30 is inserted, the number of keys in the right subtree node is greater than 4, splitting is continued, and the key25 in the middle is split into a father node to be used as index storage.
The keys 5, 10, 15, 20, 25, 26, 30 and the like can be regarded as user identifiers, and the leaf nodes store data corresponding to the user identifiers, such as time, service, cell, download rate, time delay, success rate and the like. It should be noted that the above example is only used as an explanation, and the data amount contained in the target data is larger during actual processing, but it can be understood that the basic principle is the same when writing data, and each time a piece of target data is obtained, the target data can be written into the b+ tree according to the user identifier in the target data, so that the target data is stored in the form of the b+ tree, and the actual order and the node number of the b+ tree can be set according to the actual requirement.
And 140, obtaining a target result based on the operation on the target data stored in the form of the B+ tree.
The operation on the target data stored in the form of the b+ tree may include at least one of: a lookup operation for target data stored in the form of a b+ tree, a lookup and calculation operation for target data stored in the form of a b+ tree, a deletion operation for target data stored in the form of a b+ tree, a modification operation for target data stored in the form of a b+ tree, and the like.
When an operation performed on target data stored in the form of a b+ tree is a lookup operation, the target result may be: and taking the input data or the data interval as an index, and searching the data stored in the corresponding leaf node in the B+ tree. The input data may be, for example, a specific numerical value corresponding to the user identifier, the data interval is a numerical value range, and the input data may include specific numerical values corresponding to the plurality of user identifiers. When an operation performed on target data stored in the form of a b+ tree is a search and calculation operation, the target result may be: and classifying and calculating the searched data to obtain a result. Specifically, the data stored in the corresponding leaf node found in the b+ tree may be first indexed by the input data or the data interval. The data classification may then be computed to yield the target result. When an operation performed on target data stored in the form of a b+ tree is a delete operation, the target result may be: and deleting the data stored in the corresponding leaf node searched in the B+ tree by taking the input data or the data interval as an index. When an operation performed on target data stored in the form of a b+ tree is a modification operation, the target result may be: and under the condition that input data or data intervals are used as indexes, modifying the data stored in the corresponding leaf nodes searched in the B+ tree to obtain a result.
In the embodiment of the application, a log file is obtained; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree. Because the content in the B+ tree can be conveniently searched and searched for intervals without traversing, the technical scheme of the application can realize that the target data can be rapidly acquired from the B+ tree through the keywords, thereby improving the operation speed and shortening the time for acquiring the target result, and further solving the problem of lower data processing efficiency in the related technology to a certain extent.
Optionally, obtaining the log file includes: log files from the DPI device are received via the FTP interface.
Deep packet inspection (Deep Packet Inspection, DPI) technology is an application layer based traffic detection and control technology. When an IP packet, TCP or UDP data stream passes through a system based on DPI technology, the system reorganizes and identifies application layer information in OSI seven layer protocols by deeply reading the content carried by the IP packet, thereby obtaining the content of the entire application program. Currently, DPI systems can recognize application layer traffic protocols in a total of 40 classes, including hypertext transfer protocol (Hypertext transfer protocol, HTTP), hypertext transfer security protocol (Hyper Text Transfer Protocol over Secure Socket Layer, HTTPs), wireless application communication protocol (Wireless Application Protocol, WAP), domain name system (DomainNameSystem, DNS) protocol, internet control messaging protocol (Internet Control Message Protocol, ICMP), file transfer protocol (File Transfer Protocol, FTP), secure socket layer (Secure Sockets Layer, SSL) protocol, transport layer security (Transport Layer Security, TSL) protocol, simple mail transfer protocol (Simple Mail Transfer Protocol, SMTP), post office protocol (Post Office Protocol, POP), internet mail access protocol (Internet Mail Access Protocol, IMAP), extensible communication and presentation protocol (Extensible Messaging and Presence Protocol, XMPP), real-time streaming protocol (Real Time Streaming Protocol, RTSP), session initiation protocol (Session Initiation Protocol, SIP), h.323 audio-video transfer protocol, firewall security session conversion protocol (Protocol for sessions traversal across firewall securely, sos), secure Shell (Secure Shell, SSH) protocol, simple network management protocol (Simple Network Management Protocol, SNMP), remote authentication dial-in user service (Remote Authentication Dial In User Service, RADIUS) protocol, telecom network (telecom munication net work protocol, telent) 89, dynamic Host Configuration Protocol (DHCP) protocol, TFTP), network news transmission protocol (Network News Transport Protocol, NNTP), network time protocol (Network Time Protocol, NTP), simple network time protocol (Simple Network Time Protocol, SNTP), network configuration protocol (Network Configuration, netcon f) protocol, remote network monitoring (Remote Network Monitoring, RMON) protocol, general management information protocol (Common Management Information Protocol, CMIP), virtual network console (Virtual Network Console, VNC) protocol, remote monitoring (pcanywheree) protocol, remote procedure call protocol (Remote Procedure Call Protocol, RPC), real-time transmission protocol (Real-time Transport Protocol, RTP), media gateway control protocol (Media Gateway Control Protocol, MGCP), UDP simple traversal of NAT (Simple Traversal of UDP over NATs, STUN) protocol, internet relay chat (Internet Relay Chat, IRC) protocol, two-Layer tunneling protocol (Layer 2Tunneling Protocol,L2TP), point-to-point tunneling protocol (Point to Point Tunneling Protocol, PPTP), network management communication (Encapsulating Security Payloads, ESP) protocol, HTTP-based streaming media network transmission (HTTP Live Streaming, HLS) protocol, fibre channel protocol (HTTP Dynamic Streaming, HDS), and the like.
The application layer in the OSI seven layer protocol is used to specify the data format of the application program, in particular implemented by a service protocol. When a user surfs the internet, the client firstly generates an IP data packet, a TCP or a UDP data stream according to an OSI seven-layer protocol, then establishes connection with a server by utilizing the generated IP data packet, TCP or UDP data stream, and surfs the internet after establishing connection. In the embodiment of the application, DPI equipment can be arranged in a network link. The DPI equipment can acquire an IP data packet, a TCP or UDP data stream generated in the user internet surfing process, reorganize and identify the data, and generate a log file with a fixed data format according to an application layer protocol. So that the electronic device can obtain log files transmitted by the DPI device via the FTP interface based on the FTP protocol (text transfer protocol). As shown in fig. 1-4, taking the HTTP protocol as an example, after the DPI device obtains the IP data packet, it can identify the ethernet protocol, the network layer protocol, the transport layer protocol (TCP protocol or UDP), and the service protocol, obtain the source address, the destination address, the source IP, the destination IP, the source port, the destination port, the transport protocol, the service data, the application type of the service data, the access URL, the traffic, the request sending time, the download rate, the time delay, the request success/request failure identifier, and the like according to the data carried in the protocol, and generate the identified data into the log file with the fixed data format according to the service protocol.
Therefore, the DPI system can analyze the application layer to obtain service data, generate a log file with a fixed data format according to a service protocol, and rapidly receive the log file through a fixed FTP interface.
Optionally, the target data in the target log file is stored in the disk medium in the form of a b+ tree, and/or the target result is stored in the disk medium.
It can be understood that the disk can permanently store data, after storing the target data according to the method, the target data in the target log file can be stored in the disk medium in the form of a B+ tree, so that data loss is avoided, meanwhile, the front-end application can conveniently read, and further, data display and analysis are performed based on the target data stored in the disk medium in the form of the B+ tree. In addition, the target result obtained by the operation may be stored in the disk medium according to the user's demand. In this way, data persistence may be achieved, avoiding data loss by storing the target data in the form of a B+ tree in the disk medium, and/or storing the target result in the disk medium.
In one embodiment, storing the target data in the target log file in the form of a B+ tree may include the following steps A1-A3. Wherein:
A1, loading the screened target log file into a memory;
a2, acquiring target data from the target log file;
and A3, storing the target data into the memory in a B+ tree mode.
In the embodiment of the application, a sufficiently large memory space can be applied from an operating system, taking the log file size acquired per minute as L GB (where L is a positive number) as an example, taking additional overhead into consideration, preferably, the applied memory size can be preset to be more than or equal to 2L GB, and according to the condition of system hardware resources, the applied memory size can also be preset to be not less than 1.3L GB. And loading the target log file into a memory, reading the target log file in the memory, thereby obtaining target data, writing the target data into a B+ tree, and storing the target data into the memory in the form of the B+ tree.
Thus, the target data is obtained from the target log file in the memory, and the data processing efficiency is further improved based on the characteristic of high memory read-write speed.
In one embodiment, obtaining the target data from the target log file may include the following steps S1-S4. Wherein:
s1, dividing the target log file into N sub-files in the memory;
Step S2, creating N threads, wherein each thread corresponds to one part of sub-file;
s3, processing the N sub-files by using the N threads to obtain N sub-results;
and S4, combining the N sub-results to obtain target data.
Because the log files are continuously generated every minute, the real-time reading and analyzing speed of the log files is high. In the embodiment of the application, a multithreading mode is adopted to process the target log file. The method specifically comprises the steps of reading a target log file into a memory, dividing the target log file into N parts of subfiles in the memory, simultaneously creating N threads, binding each thread with one part of subfiles, and processing the content of 1/N part of the target log file by one thread to obtain one sub-result, wherein the number of threads is set according to the CPU (Central processing Unit) capability of a server. After the N threads are processed, N sub-results are obtained in total, and the N sub-results are combined to obtain target data. In order to realize data persistence, after the target log file is processed in a multithreading mode, the processing result data can be output in real time to generate a disk file. Meanwhile, a database table is generated, and the result data is input into a relational database such as a MySQL database in real time.
Therefore, the target log file is read and analyzed in a multithreading mode, target data is obtained, real-time processing of the target log file can be achieved, and data processing efficiency is further improved.
Fig. 2 is a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 2, the data processing method provided by the embodiment of the present application may include:
step 210, obtaining a log file;
step 220, screening out target log files from the log files;
step 230, storing the target data in the target log file in the form of a B+ tree; acquiring key indexes corresponding to at least one target dimension according to the target data; storing each target dimension and the corresponding key index in a hash table in a key value pair mode; the key is a target dimension, and the value is a key index corresponding to the target dimension;
multiple target dimensions may be determined from data of different dimensions contained in the target data, and the target dimensions may include a single dimension and a composite dimension. Wherein the single dimension may comprise at least one of a city, cell, service, server IP, network element IP, access point (Access Point Name, APN), etc. dimension. The composite dimension may include at least one of the dimensions of city traffic, traffic server IP, cell traffic, etc. It should be noted that, according to a single dimension, multiple composite dimensions may be formed by combining, and the actual target dimension is often determined by the service requirement, where the above composite dimensions are only used for illustration, and should not be considered as limiting the present application. The key indicators may include success rate, time delay, download rate, etc. In the embodiment of the application, the information of cities, cells, services, server IP, network element IP, APN and the like in the target data can be obtained, the target data is divided according to a plurality of target dimensions, and the key indexes corresponding to the target dimensions are obtained by calculating the target data based on the target dimensions. And respectively taking a plurality of target dimensions as keys, wherein key indexes corresponding to the target dimensions are 1 value, and storing the key indexes in the hash table in a key value pair mode. Taking city, district, service, server IP, city service, service server IP target dimensions as examples, the above target dimensions may be used as keys, and then city dimension key indicators, district dimension key indicators, service dimension key indicators, server IP dimension key indicators, city service dimension key indicators, service server IP dimension key indicators are obtained according to the target data, and the key indicators of the dimensions are respectively corresponding to the keys and stored as corresponding values (values).
Key (key) Value (value)
City City dimension key index
Cell Cell dimension key index
Service Business dimension key index
Server IP Server IP dimension key index
Urban business City business dimension key index
Service server IP Service server IP dimension key index
TABLE 1
Taking a key as an example of a city, the city dimension key indexes can comprise average download rate, average download time delay, ratio (success rate) of the number of times of request success to the total number of times of request of all users in the same city, namely, the target data of the same city are classified according to the information contained in the target data, and the city dimension key indexes are obtained by calculating the download rate, download time delay, the number of times of request success and the total number of times of request of the users in the target data (for example, calculating the average value and the duty ratio according to the key indexes in the target data). Taking key as an example of the city service, the city service dimension key index may include average download rate, average download delay, ratio (success rate) of the number of successful requests to the total number of requests of all users accessing the same service in the same city, that is, the target data accessing the same service in the same city is classified according to the information contained in the target data, and the city dimension key index is obtained by calculating the download rate, download delay, the number of successful requests and the total number of requests of the users in the target data. Analysis in other dimensions is similar.
Therefore, the corresponding key indexes can be quickly obtained by taking each target dimension as a key through the hash table, and the whole data of each target dimension is preliminarily known through the key indexes, so that the analysis and evaluation are convenient.
And step 240, obtaining a target result based on the operation on the target data stored in the form of the B+ tree and the key value pairs stored in the hash table.
After storing the target data in the form of a B+ tree and storing each target dimension and the corresponding key index in the hash table in a key value pair manner, the target data in the B+ tree and the data in the hash table can be queried and accessed, and based on the data obtained by query, the total traffic, the number of users, the key index trend and the like are analyzed. When an abnormal situation occurs (for example, a user in a certain city cannot normally access a webpage, and a certain service cannot normally use), the operation can be performed to obtain a target result. For example, comparing the target data and the key index of the target dimension with a preset threshold value, outputting a target result according to the comparison result, and carrying out problem tracking. Therefore, the target data can be quickly obtained through the B+ tree, corresponding data is obtained from the hash table by taking a specific dimension in the target data as a key, and the target result is obtained by integrating multiple aspects of data, so that data analysis display and quick problem positioning and tracking are realized.
For ease of understanding, the following explanation will be given by taking the objective dimension as an example of the city service and the service server IP.
Taking city a as an example, different services in city a, such as city a service a, can be taken as keys, and key indexes of dimension a of city a service a are taken as values to process target data. Meanwhile, the target data is processed by taking the service A server IP, such as the service A server IP A, the service A server IP B and the service A server IP C, as keys, and taking the key indexes of the service A server IP dimension, such as the key index of the service A server IP A dimension, the key index of the service A server IP B dimension and the key index of the service A server IP C dimension, as values. And calculating the IP duty ratio of the quality difference server, wherein the IP duty ratio of the quality difference server is the number of the quality difference servers/the number of the servers. Specifically, taking the key index as an example, a first threshold may be set, if the success rate is lower than the first threshold when the server provides the service a, the server IP is identified as the quality difference server IP, and the quality difference server IP duty ratio is calculated according to the quality difference server IP number and the server IP number providing the service a.
If the user A is abnormal in surfing the internet, after the consultation information of the user A is obtained, the identification of the user A is used as an index, and target data can be obtained in the B+ tree, so that the information of the city, the cell, the service, the downloading rate, the time delay, the success rate and the like of the user A can be obtained. And then, taking the urban business of the user A and the like as keys, acquiring key indexes from the hash table for analysis and judgment, and determining whether the urban overall business is abnormal, or whether part of servers are abnormal or user equipment is abnormal. Specifically, for example, when the user a urban a service a analyzes the success rate in the key index, the success rate A1 of the urban a service a can be obtained, and meanwhile, the IP duty ratio B1 of the poor quality server is obtained, and the A1 and the B1 are compared with a preset threshold. It can be appreciated that the higher the success rate, the better the quality difference server IP duty cycle is, the smaller the quality difference server IP duty cycle is. If the first preset threshold is M1, the second preset threshold is N1, A1< M1 and B1> N1, the output target result is that the whole city A business A is abnormal according to the comparison result, and if A1< M1 and B1< N1, the output target result is that the city A business part server is abnormal according to the comparison result, and meanwhile, the corresponding server IP is output. If A1> M1 and B1< N1, outputting a target result as the abnormality of the user A equipment according to the comparison result.
When the downloading rate in the key index is analyzed, the downloading rate A2 of the city A service A can be obtained, meanwhile, the IP duty ratio B2 of the quality difference server is obtained, and the A2 and the B2 are compared with a preset threshold value. It can be appreciated that the faster the download rate, the better the bad server IP duty cycle. If the first preset threshold is M2, the second preset threshold is N2, A2< M2 and B2> N2, the output target result is that the city A business A is abnormal as a whole according to the comparison result, if A2< M2 and B2< N2, the output target result is that the city A business part server is abnormal according to the comparison result, and meanwhile, the corresponding server IP is output. If A2> M2 and B2< N2, outputting a target result as the abnormality of the user A equipment according to the comparison result.
When analyzing the time delay in the key index, the time delay A3 of the city A service A can be obtained, meanwhile, the IP duty ratio B3 of the quality difference server is obtained, and the A3 and the B3 are compared with a preset threshold value. It can be appreciated that the smaller the delay, the better the bad server IP duty cycle. If the first preset threshold is M3, the second preset threshold is N3, A3> M3 and B3> N3, the output target result is that the whole city A business A is abnormal according to the comparison result, and if A3> M3 and B3< N3, the output target result is that the city A business part server is abnormal according to the comparison result, and meanwhile, the corresponding server IP is output. If A3 is less than M3 and B3 is less than N3, outputting a target result as the abnormality of the user A equipment according to the comparison result.
The service a is merely an example in the foregoing, and it should be understood that the analysis of other services is the same as the analysis principle for the service a.
In the embodiment of the application, a log file is obtained; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree. Because the content in the B+ tree can be conveniently searched and searched for intervals without traversing, the technical scheme of the application can realize that the target data can be rapidly acquired from the B+ tree through the keywords, thereby improving the operation speed and shortening the time for acquiring the target result, and further solving the problem of lower data processing efficiency in the related technology to a certain extent.
Fig. 3 is a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 3, the data processing method provided by the embodiment of the present application may include:
step 310, receiving a log file from the DPI device through the FTP interface;
step 320, screening out a target log file from the log files;
step 330, storing the target data in the target log file in the form of a b+ tree; acquiring key indexes corresponding to at least one target dimension according to the target data; storing each target dimension and the corresponding key index in a hash table in a key value pair mode; the key is a target dimension, and the value is a key index corresponding to the target dimension;
Wherein, the storing the target data in the target log file in the form of a b+ tree includes: loading the screened target log file into a memory; acquiring target data from the target log file; and storing the target data into the memory in the form of a B+ tree. The obtaining the target data from the target log file comprises the following steps: dividing the target log file into N subfiles in the memory; creating N threads, wherein each thread corresponds to one sub-file; processing the N sub-files by using the N threads to obtain N sub-results; and merging the N sub-results to obtain target data.
After the target log file is screened out in the embodiment of the application, the target log file is read in a multithreading processing mode, and target data in the target log file and key indexes obtained by calculation according to the target data are stored in a B+ tree and hash table form based on key value pairs according to the user identification IMSI and the target dimension.
And step 340, obtaining a target result based on the operation of the target data stored in the form of the B+ tree and the key value pairs stored in the hash table.
Wherein, the target data in the target log file is stored in the disk medium in the form of a B+ tree, and/or the target result is stored in the disk medium.
In the embodiment of the application, a log file is obtained; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree. Because the content in the B+ tree can be conveniently searched and searched for intervals without traversing, the technical scheme of the application can realize that the target data can be rapidly acquired from the B+ tree through the keywords, thereby improving the operation speed and shortening the time for acquiring the target result, and further solving the problem of lower data processing efficiency in the related technology to a certain extent.
It should be noted that, in the data processing method provided in the embodiment of the present application, the execution body may be a data processing apparatus, or a control module in the data processing apparatus for executing the data processing method. In the embodiment of the present application, a data processing device is described by taking a data processing method performed by the data processing device as an example.
Fig. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present application. Referring to fig. 4, the data processing apparatus may include:
an acquisition module 410, configured to acquire a log file;
a screening module 420, configured to screen a target log file from the log files;
a processing module 430, configured to store the target data in the target log file in a form of a b+ tree; for obtaining a target result based on an operation on said target data stored in the form of a b+ tree.
In the embodiment of the application, a log file is obtained; screening out a target log file from the log files; storing target data in the target log file in a B+ tree form; a target result is obtained based on an operation on the target data stored in the form of a b+ tree. Because the content in the B+ tree can be conveniently searched and searched for intervals without traversing, the technical scheme of the application can realize that the target data can be rapidly acquired from the B+ tree through the keywords, thereby improving the operation speed and shortening the time for acquiring the target result, and further solving the problem of lower data processing efficiency in the related technology to a certain extent.
In one embodiment, in the process of obtaining the log file, the obtaining module 410 is specifically configured to: log files from the DPI device are received via the FTP interface. Therefore, the DPI system can analyze the application layer to obtain service data, generate a log file with a fixed data format according to a service protocol, and rapidly receive the log file through a fixed FTP interface.
In one embodiment, the target data in the target log file is stored in the disk medium in the form of a b+ tree, and/or the target result is stored in the disk medium. In this way, data persistence may be achieved, avoiding data loss by storing the target data in the form of a B+ tree in the disk medium, and/or storing the target result in the disk medium.
In one embodiment, in the process of storing the target data in the target log file in the form of a b+ tree, the processing module 430 is specifically configured to: loading the screened target log file into a memory; acquiring target data from the target log file; and storing the target data into the memory in the form of a B+ tree. Thus, the target data is obtained from the target log file in the memory, and the data processing efficiency is further improved based on the characteristic of high memory read-write speed.
In one embodiment, in the process of obtaining the target data from the target log file, the processing module 430 is specifically configured to: dividing the target log file into N subfiles in the memory; creating N threads, wherein each thread corresponds to one sub-file; processing the N sub-files by using the N threads to obtain N sub-results; and merging the N sub-results to obtain target data. Therefore, the target log file is read and analyzed in a multithreading mode, target data is obtained, real-time processing of the target log file can be achieved, and data processing efficiency is further improved.
In one embodiment, after the target data is obtained from the target log file, the processing module 430 is further configured to: acquiring key indexes corresponding to at least one target dimension according to the target data; storing each target dimension and the corresponding key index in a hash table in a key value pair mode; the key is a target dimension, and the value is a key index corresponding to the target dimension. Therefore, the corresponding key indexes can be quickly obtained by taking each target dimension as a key through the hash table, and the whole data of each target dimension is preliminarily known through the key indexes, so that the analysis and evaluation are convenient.
In one embodiment, in the process of obtaining a target result based on the operation on the target data stored in the form of b+ tree, the processing module 430 is specifically configured to: and obtaining a target result based on the operation of the target data stored in the form of the B+ tree and the key value pairs stored in the hash table. Therefore, the target data can be quickly obtained through the B+ tree, corresponding data is obtained from the hash table by taking a specific dimension in the target data as a key, and the target result is obtained by integrating multiple aspects of data, so that data analysis display and quick problem positioning and tracking are realized.
The data processing device in the embodiment of the application can be a device, or can be a component, an integrated circuit, or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and embodiments of the present application are not limited in particular.
The data processing device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
It should be noted that, the data processing apparatus provided in the embodiment of the present application corresponds to the above-mentioned data processing method. The relevant content can refer to the description of the data processing method above, and will not be repeated here.
In addition, as shown in fig. 5, the embodiment of the present application further provides an electronic device 500, where the electronic device 500 includes a processor 510, a memory 520, and a program or an instruction stored in the memory 520 and capable of running on the processor 510, where the program or the instruction implements each process of the above-mentioned data processing method embodiment when executed by the processor 510, and the process can achieve the same technical effect, and for avoiding repetition, a detailed description is omitted herein.
It should be noted that, the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.
The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the above-mentioned data processing method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the data processing method embodiment, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (10)

1. A method of data processing, the method comprising:
acquiring a log file;
screening out a target log file from the log files;
storing target data in the target log file in a B+ tree form;
based on the operation of the target data stored in the form of a b+ tree, a target result is obtained.
2. The method of claim 1, wherein the obtaining the log file comprises: the log file from the deep packet inspection device is received via a file transfer protocol interface.
3. The method of claim 1, wherein the target data in the target log file is stored in a disk medium in the form of a b+ tree and/or the target result is stored in a disk medium.
4. The method of claim 1, wherein storing the target data in the target log file in the form of a b+ tree comprises:
loading the screened target log file into a memory;
acquiring target data from the target log file;
and storing the target data into the memory in the form of a B+ tree.
5. The method of claim 4, wherein the obtaining the target data from the target log file comprises:
Dividing the target log file into N subfiles in the memory;
creating N threads, wherein each thread corresponds to one sub-file;
processing the N sub-files by using the N threads to obtain N sub-results;
and merging the N sub-results to obtain target data.
6. The method of claim 4, wherein after the obtaining the target data from the target log file, the method further comprises:
acquiring key indexes corresponding to at least one target dimension according to the target data;
storing each target dimension and the corresponding key index in a hash table in a key value pair mode;
the key is a target dimension, and the value is a key index corresponding to the target dimension.
7. The method of claim 6, wherein the obtaining a target result based on the operation on the target data stored in the form of a b+ tree comprises:
and obtaining a target result based on the operation of the target data stored in the form of the B+ tree and the key value pairs stored in the hash table.
8. A data processing apparatus, comprising:
the acquisition module is used for acquiring the log file;
The screening module is used for screening target log files from the log files;
the processing module is used for storing the target data in the target log file in the form of a B+ tree; for obtaining a target result based on an operation on said target data stored in the form of a b+ tree.
9. The apparatus of claim 8, wherein in the process of storing the target data in the target log file in the form of a b+ tree, the processing module is specifically configured to:
loading the screened target log file into a memory;
acquiring target data from the target log file;
and storing the target data into the memory in the form of a B+ tree.
10. An electronic device comprising a processor and a memory storing a program or instructions which, when executed by the processor, implement the steps in the data processing method according to any one of claims 1-7.
CN202210534493.1A 2022-05-17 2022-05-17 Data processing method and device Pending CN117112519A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210534493.1A CN117112519A (en) 2022-05-17 2022-05-17 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210534493.1A CN117112519A (en) 2022-05-17 2022-05-17 Data processing method and device

Publications (1)

Publication Number Publication Date
CN117112519A true CN117112519A (en) 2023-11-24

Family

ID=88802602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210534493.1A Pending CN117112519A (en) 2022-05-17 2022-05-17 Data processing method and device

Country Status (1)

Country Link
CN (1) CN117112519A (en)

Similar Documents

Publication Publication Date Title
US10313494B2 (en) Methods and systems for identifying data sessions at a VPN gateway
US9674316B2 (en) Methods and systems for identifying data sessions at a VPN gateway
US20110125748A1 (en) Method and Apparatus for Real Time Identification and Recording of Artifacts
US20120182891A1 (en) Packet analysis system and method using hadoop based parallel computation
WO2015165296A1 (en) Method and device for identifying protocol type
CN103297270A (en) Application type recognition method and network equipment
CN110691080B (en) Automatic tracing method, device, equipment and medium
CN111222019B (en) Feature extraction method and device
CN110768875A (en) Application identification method and system based on DNS learning
Mazhar Rathore et al. Exploiting encrypted and tunneled multimedia calls in high-speed big data environment
US9917747B2 (en) Problem detection in a distributed digital network through distributed packet analysis
CN111953552A (en) Data flow classification method and message forwarding equipment
US10419351B1 (en) System and method for extracting signatures from controlled execution of applications and application codes retrieved from an application source
CN117112519A (en) Data processing method and device
Spiekermann et al. Using network data to improve digital investigation in cloud computing environments
CN111106980B (en) Bandwidth binding detection method and device
CN113037551B (en) Quick identification and positioning method for sensitive-related services based on traffic slice
CN110620682B (en) Resource information acquisition method and device, storage medium and terminal
CN111163184B (en) Method and device for extracting message features
US20130205015A1 (en) Method and Device for Analyzing Data Intercepted on an IP Network in order to Monitor the Activity of Users on a Website
CN112954027B (en) Network service characteristic determination method and device
Rychl et al. Big data security analysis with tarzan platform
CN112702445B (en) Recursive log extraction method and device based on DNS response data message
CN112714033B (en) Method and device for determining characteristic information of video set
CN114422232B (en) Method, device, electronic equipment, system and medium for monitoring illegal flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination