WO2020232880A1

WO2020232880A1 - Data processing method and apparatus, storage medium and terminal device

Info

Publication number: WO2020232880A1
Application number: PCT/CN2019/103039
Authority: WO
Inventors: 孙云雷
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-21
Filing date: 2019-08-28
Publication date: 2020-11-26
Also published as: CN110245155A

Abstract

The present application falls within the technical field of computers, and relates in particular to a data processing method and apparatus, a non-volatile computer-readable storage medium and a terminal device. The method comprises: receiving a data packet collected and sent by a preset packet capturing tool, wherein the data packet comprises one or more data records; carrying out format matching on a target record according to a preset regular expression resource library, and determining a data format of the data packet, wherein the regular expression resource library comprises one or more regular expressions, and each regular expression corresponds to a data format; searching a preset data processing rule base for a target processing rule, wherein the target processing rule is a data processing rule corresponding to the data format of the data packet; and respectively processing each data record in the data packet according to the target processing rule to obtain a processed data packet. No manual intervention is needed throughout the process, a large amount of time cost and labor cost is saved, and the efficiency is greatly improved.

Description

Data processing method, device, storage medium and terminal equipment

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201910423175.6, and the invention title is "data processing methods, devices, computer-readable storage media and terminal equipment" on May 21, 2019, and its entire contents Incorporated in this application by reference.

Technical field

This application belongs to the field of computer technology, and in particular relates to a data processing method, device, computer non-volatile readable storage medium, and terminal equipment.

Background technique

With the continuous popularization of big data technology, more and more scenarios need to analyze and calculate massive amounts of data. Before analyzing and calculating these data, the data needs to be preprocessed and converted into data analysis. The tool is easy to analyze and calculate the data format. At present, these data processing tasks are mainly done manually. In the case of a large amount of data, it takes a lot of time and labor costs, and the efficiency is very low.

technical problem

In view of this, the embodiments of the present application provide a data processing method, device, computer non-volatile readable storage medium, and terminal equipment to solve the existing manual data processing that consumes a lot of time and labor costs. , The problem of very low efficiency.

Technical solutions

The first aspect of the embodiments of the present application provides a data processing method, which may include:

Receive data packets collected and sent by the preset packet capture tool;

Performing format matching on the target record according to a preset regular expression resource library, and determining the data format of the data packet;

Find the target processing rule in the preset data processing rule library;

Each data record in the data packet is processed separately according to the target processing rule to obtain a processed data packet.

The second aspect of the embodiments of the present application provides a data processing device, which may include a module for implementing the steps of the above data processing method.

The third aspect of the embodiments of the present application provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are executed by a processor When realizing the steps of the above data processing method.

The fourth aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer The steps of the above data processing method are realized when the instructions are readable.

Beneficial effect

Through the embodiment of this application, the data format of the data packet is automatically determined by regular matching, and the data processing is further automatically performed according to the corresponding processing rules, that is, the complete process of data format matching and data processing is realized in a fully automated manner. The entire process does not require any manual intervention, saving a lot of time and labor costs, and greatly improving the efficiency of data processing.

Description of the drawings

FIG. 1 is a flowchart of an embodiment of a data processing method in an embodiment of this application;

Figure 2 is a schematic flow chart of format matching of data packets;

FIG. 3 is a schematic diagram of setting multiple standby data processing terminals to perform parallel processing on data packets;

Fig. 4 is a schematic flow chart of offloading processing of data packets;

FIG. 5 is a structural diagram of an embodiment of a data processing device in an embodiment of the application;

Fig. 6 is a schematic block diagram of a terminal device in an embodiment of the application.

Embodiments of the invention

Referring to FIG. 1, an embodiment of a data processing method in the embodiment of the present application may include:

Step S101: Receive a data packet collected and sent by a preset packet capture tool.

The packet capture tool is a tool for collecting data transmitted on the network. In this embodiment, the packet capture tool includes but is not limited to tools such as fiddler and wireshark.

The packet capture tool packs the collected data into several data packets, and sends the data packets to a preset data processing terminal. The data processing terminal is the implementation subject of this embodiment. Among them, each data packet includes more than one data record. The number of data records contained in each data packet can be set according to the actual situation. For example, it can be set to 1000, 2000, 5000 or Other values and so on.

The following is a specific example of each data record in a data packet:

c0-e1=string:tsalesApplyCustContact

c0-e4=string:tsalesApplyCust

c0-e6=string:true

c0-e7=string:upd

c0-e8=string:2011068

c0-e9=string: 2

c0-e10=string:1

c0-e11=string:1

c0-e12=string:

c0-e13=string:9

c0-e14=string:tsalesApplyCust

c0-e15=string:1

Among them, each row is a data record.

It should be noted that each data record in each data package is data collected for the same business scenario. Each data record in the same data package has the same data format. The data records in different data packages can be Have the same data format, or different data formats. Among them, the data format refers to the regular format characteristics of the data record. As shown in the previous example, the data format is: each data record starts with c, followed by a number of decimal digits (at least one), and then It is "-e", followed by a number of decimal digits (at least one), followed by "=string:", followed by a character string consisting of several decimal digits or characters (the string length can be 0).

Step S102: Perform format matching on the target record according to a preset regular expression resource library, and determine the data format of the data packet.

The regular expression resource library includes more than one regular expression, and each regular expression corresponds to a data format. For each business scenario, the regular expression corresponding to its data format can be preset, and the regular expression of each data format can be constructed into the regular expression resource library as shown in the following table:

数据格式Data Format	正则表达式Regular expression
数据格式aData format a	正则表达式1Regular expression 1
数据格式bData format b	正则表达式2Regular expression 2
数据格式cData format c	正则表达式3Regular expression 3
……...	……...
……...	……...

Regular expression, also known as regular expression (Regular Expression), is a concept of computer science. It is usually used to retrieve and replace texts that conform to a certain pattern (rule), which is to use certain pre-defined characters, and The combination of these specific characters forms a "rule string", which is used to express a filtering logic for the string. A regular expression is a text pattern that describes one or more strings to be matched when searching for text.

When the data processing terminal needs to match the format of the received data packet, it first selects one of the regular expressions from the regular expression resource library to match the data records in the data packet, and if the matching is successful, It can be determined that the data format of the data packet is the data format corresponding to the regular expression in the regular expression resource library.

Since each data record in the same data packet has the same data format, when using regular expressions for format matching, one data record (that is, the target record) can be arbitrarily selected from the data packet. The format matching is sufficient, and format matching is not required for all data records in the data packet.

If the format matching fails, the next regular expression is selected to perform format matching on the data packet, and the above process is repeated continuously until the format matching succeeds.

It should be noted that the specific content of the above regular expression resource library can be adjusted according to the actual situation. For example, when some data format data no longer needs to be analyzed, the corresponding entry can be removed from the regular expression resource library When data in some new data formats needs to be analyzed, the corresponding entries can be added to the regular expression resource library, and the regular expression corresponding to a certain data format can also be determined according to the actual situation. Make modifications to keep the regular expression resource library applicable to the latest business scenarios.

Preferably, considering that there may be large differences in the probability of data packets in different data formats, for example, the total number of data packets in a certain or several data formats may occupy most of the total number of data packets, and There may be only a small number of data packets in other data formats. In order to reduce the number of matching times, the data packets can be format matched according to the process shown in Figure 2:

In step S1021, the matching success rate of each regular expression in the regular expression resource library is calculated according to the historical matching records in the preset statistical time period.

The statistical period can be set to 1 month, 2 months, 3 months, half a year, one year, or other values according to the actual situation. Because the data that is too long has little reference, it is generally set within one year. Appropriate within.

The matching success rate is positively correlated with the number of matching successes of the regular expression in the historical matching record, that is, the more matching successes, the higher the matching success rate, and the fewer the matching successes, the lower the matching success rate. The historical matching record records the regular expression used every time the data packet format is successfully matched. For example, if a total of 50 data packets are formatted in history, 30 of them are caused by regular expression 1. The matching is successful, 14 times are matched by regular expression 2 and 6 times are matched by regular expression 3, indicating that the success rate of matching using regular expression 1 is the highest, and using regular expression 2 for matching The success rate of matching is the second, and the matching success rate using regular expression 3 is the lowest. You can set regular expression 1 to the highest matching success rate, and regular expression 2 to the second highest matching success rate, and use regular expression Equation 3 is set as the lowest matching success rate.

In order to perform accurate calculations, in a specific implementation of this embodiment, the statistical period can be first divided into T sub-periods, where T is a positive integer, and the value of T can be set according to actual conditions, for example, it can be set It is 5, 10, 20 or other values. It should be noted that the larger the value of T, the greater the amount of calculation, but the higher the calculation accuracy; the smaller the value of T, the greater the amount of calculation, but the lower the calculation accuracy, you need to adjust the two according to the actual situation The trade-off.

Then, the number of matching successes of each regular expression in the regular expression resource library in each sub-period is counted separately, and the matching success rate of each regular expression in the regular expression resource library is calculated separately according to the following formula :

Among them, n is the sequence number of the regular expression, 1≤n≤N, N is the total number of regular expressions in the regular expression resource library, t is the sequence number of the sub-period in chronological order, 1≤t≤T , The earlier the sub-period in the time dimension, the smaller the value of t, MatSucNum _n,t is the number of successful matches of the n-th regular expression in the regular expression resource library in the t-th sub-period, Weight _t It is the preset weight coefficient, and Weight _t <Weight _t+1 , that is, the later the sub-period has the larger the weight coefficient. This is because the closer the data to the current moment, the greater the reference significance, and the greater the reference The longer the data, the smaller the reference meaning. For example, the data recorded this week obviously reflects the current user habits better than the data a few months ago. MatSucRatio _n is the _nth in the regular expression resource library. The matching success rate of a regular expression.

Step S1022, from the regular expression resource library, select a regular expression with the highest matching success rate that has not been selected as a candidate expression.

Step S1023: Use the candidate expression to perform format matching on the target record.

For example, if the candidate expression is: "^c[0-9]{1,}-e[0-9]{1,}=string:", where ^ represents the position at the beginning of the line and [0-9] represents Any number from 0 to 9, {1,} means at least one match, then the regular expression can start the data record with c, followed by a number of decimal digits (at least one), followed by "-e ", followed by a number of decimal digits (at least one), followed by "=string:", followed by a string of decimal digits or characters (string length can be 0) such data recording For matching, still taking the data packet mentioned above as an example, any one of the data records can be successfully matched with the candidate expression, so it can be determined that the format matching is successful, otherwise, it can be determined that the format matching fails.

Step S1024: Determine whether the format matching is successful.

If the format matching fails, return to step S1022 and subsequent steps until the format matching succeeds; if the format matching succeeds, perform step S1025.

Step S1025: Determine the data format corresponding to the candidate expression as the data format of the data packet.

When format matching is performed, each regular expression is selected from the regular expression resource library in order of the matching success rate from high to low. In this way, the format matching process can be completed with the least number of matches, and the data packet The speed of format matching.

Step S103: Search for the target processing rule in a preset data processing rule library.

The target processing rule is a data processing rule corresponding to the data format of the data packet.

In this embodiment, different data processing rules will be adopted for data packets of various data formats, so as to generate data formats that are convenient for subsequent data analysis tools to analyze and calculate. Data processing rules corresponding to each data format can be preset , Construct the data processing rules of each data format into the data processing rule library shown in the following table:

数据格式Data Format	数据处理规则Data processing rules
数据格式aData format a	数据处理规则1Data Processing Rule 1
数据格式bData format b	数据处理规则2Data processing rules 2
数据格式cData format c	数据处理规则3Data Processing Rule 3
……...	……...
……...	……...

It should be noted that the specific content of the above-mentioned regular database data processing rule database can be adjusted according to actual conditions, including but not limited to adding, deleting, and modifying data processing rules.

Step S104: Process each data record in the data packet separately according to the target processing rule to obtain a processed data packet.

Take the data packet in the following data format as an example:

c0-e1=string:tsalesApplyCustContact

c0-e4=string:tsalesApplyCust

c0-e6=string:true

c0-e7=string:upd

c0-e8=string:2011068

c0-e9=string: 2

c0-e10=string:1

c0-e11=string:1

c0-e12=string:

c0-e13=string:9

c0-e14=string:tsalesApplyCust

c0-e15=string:1

The corresponding data processing rules can be set as follows: divide each data record into two parts, the first part is the data before the equal sign (c0-e1), the first part is the data after the equal sign (string:tsalesApplyCustContact), each One part of the data is enclosed in quotation marks, and the two parts are separated by a colon ("c0-e1": "string:tsalesApplyCustContact"). Finally, each piece of data is separated by a comma, and the whole set of braces is added to form the following Data packet showing the data format:

{"c0-e1": "string:tsalesApplyCustContact","c0-e4":"string:tsalesApplyCust","c0-e6":"string:true","c0-e7":"string:upd"," c0-e8":"string:2011068","c0-e9":"string:2","c0-e10":"string:1","c0-e11":"string:1","c0- e12":"string:","c0-e13":"string:9","c0-e14":"string:tsalesApplyCust","c0-e15":"string:1"}

It should be noted that the above is only an example of the data processing rule. In actual use, the data processing rule corresponding to the data format of the data packet to be processed can be set according to the specific scenario, which will not be repeated here.

Further, considering that there may be extreme cases of massive data packets to be processed in practical applications, and in this extreme case, only processing through the data processing terminal will be overloaded, in order to solve this problem As shown in FIG. 3, in this embodiment, multiple standby data processing terminals can also be set to process data packets in parallel.

Specifically, after the data format of the data packet is determined in step S102, the total number of data packets waiting to be processed in the data processing terminal may be counted first, if the total number of data packets waiting to be processed is less than or equal to a preset The number threshold is still processed in accordance with the process shown in FIG. 1. The number threshold can be set according to actual conditions, for example, it can be set to 100, 200, 500 or other values. If the total number of data packets waiting to be processed is greater than the number threshold, processing is performed according to the process shown in FIG. 4:

Step S401: Obtain the preset configuration files of each standby data processing terminal, and determine the data format corresponding to each standby data processing terminal according to the configuration file.

Each spare data processing terminal is dedicated to processing data packets of a certain data format, and this corresponding relationship will be stored in advance in the configuration files of each spare data processing terminal, and the data processing terminal can obtain these configuration files. Based on this, the data format corresponding to each standby data processing terminal is determined.

Step S402: Divide each standby data processing terminal into a corresponding data processing cluster.

As shown in FIG. 4, in this embodiment, all data processing terminals are preferably divided into two or more data processing clusters, wherein the data formats corresponding to the spare data processing terminals in the same data processing cluster are all consistent.

Step S403: Select a target cluster corresponding to the data packet.

The data format corresponding to each spare data processing terminal in the target cluster is consistent with the data format of the data packet.

Step S404: Send the data packet to the target cluster for processing.

Since each data processing terminal in the target cluster has the same data format as the data packet, the data packet can be processed more quickly.

Further, the data processing terminal may respectively send a data packet query request to each backup data processing terminal in the target cluster, and respectively receive the number of to-be-processed data packets fed back by each backup data processing terminal in the target cluster, Then select the backup data processing terminal with the smallest number of data packets to be processed from the target cluster as the preferred processing terminal, and allocate the data packets to the preferred processing terminal for processing. The processing procedure of the preferred terminal is the same as step S104 The processing process in is similar. For details, please refer to the foregoing specific content, which will not be repeated here.

Through the process shown in FIG. 4, after format matching is performed on each data packet in the data stream, each data packet is distributed to the data processing cluster corresponding to its data format for processing according to the result of the format matching. At this time, each standby data processing terminal in each data processing cluster will simultaneously process data packets in each data format in parallel, thereby improving overall data processing efficiency.

In summary, the embodiment of the application uses regular matching to automatically determine the data format of the data packet, and further automatically performs data processing according to the corresponding processing rules, that is, the data format matching and data processing are realized in a fully automated manner. The whole process, without any manual intervention, saves a lot of time and labor costs, and greatly improves the efficiency of data processing.

Corresponding to the data processing method described in the above embodiment, FIG. 5 shows a structural diagram of an embodiment of a data processing apparatus provided in an embodiment of the present application.

In this embodiment, a data processing device may include:

The data packet receiving module 501 is configured to receive data packets collected and sent by a preset packet capture tool;

The format matching module 502 is configured to perform format matching on the target record according to a preset regular expression resource library, and determine the data format of the data packet;

The processing rule search module 503 is used to search for the target processing rule in a preset data processing rule library;

The data processing module 504 is configured to separately process each data record in the data packet according to the target processing rule to obtain a processed data packet.

Further, the format matching module may include:

The matching success rate calculation unit is configured to calculate the matching success rate of each regular expression in the regular expression resource library according to historical matching records in a preset statistical period;

The candidate expression selection unit is used to select a regular expression with the highest matching success rate that has not been selected as a candidate expression from the regular expression resource library;

A format matching unit, configured to use the candidate expression to perform format matching on the target record;

The first processing unit is configured to return and execute the step of selecting a regular expression with the highest matching success rate from the regular expression resource library that has not been selected as a candidate expression if the format matching fails, until the format Until the match is successful;

The second processing unit is configured to determine the data format corresponding to the candidate expression successfully matched as the data format of the data packet if the format matching is successful.

Further, the matching success rate calculation unit may include:

The sub-period division sub-unit is used to divide the statistical period into T sub-periods, where T is a positive integer;

The frequency counting subunit is used to separately count the number of successful matches of each regular expression in the regular expression resource library in each sub-period;

The matching success rate calculation subunit is used to calculate the matching success rate of each regular expression in the regular expression resource library.

Further, the data processing device may further include:

Data packet number statistics module, used to count the total number of data packets waiting to be processed;

The configuration file obtaining module is configured to obtain the preset configuration files of each standby data processing terminal if the total number of data packets waiting to be processed is greater than the preset number threshold, and determine each standby data processing according to the configuration file The data format corresponding to the terminal;

The cluster division module is used to divide each standby data processing terminal into the corresponding data processing cluster;

A cluster selection module for selecting a target cluster corresponding to the data packet;

The data packet sending module is used to send the data packet to the target cluster for processing.

Further, the data processing device may further include:

The number query module is configured to send a data packet query request to each backup data processing terminal in the target cluster, and respectively receive the number of data packets to be processed fed back by each backup data processing terminal in the target cluster;

A terminal selection module, configured to select a backup data processing terminal with the smallest number of data packets to be processed from the target cluster as a preferred processing terminal;

The data packet distribution module is used to distribute the data packet to the preferred processing terminal for processing.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the devices, modules and units described above can refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 6 shows a schematic block diagram of a terminal device according to an embodiment of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown.

In this embodiment, the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device 6 may include: a processor 60, a memory 61, and computer-readable instructions 62 stored in the memory 61 and running on the processor 60, such as computer-readable instructions for executing the aforementioned data processing method . When the processor 60 executes the computer-readable instructions 62, the steps in the foregoing embodiments of the data processing method are implemented, such as steps S101 to S104 shown in FIG. 1. Alternatively, when the processor 60 executes the computer-readable instructions 62, the functions of the modules/units in the foregoing device embodiments, such as the functions of the modules 501 to 504 shown in FIG. 5, are realized.

Exemplarily, the computer-readable instruction 62 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 61 and executed by the processor 60, To complete the present invention. The one or more modules/units may be a series of computer-readable instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer-readable instructions 62 in the terminal device 6.

The processor 60 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

The memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or memory of the terminal device 6. The memory 61 may also be an external storage device of the terminal device 6, for example, a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart Media Card, SMC), or a Secure Digital (SD) Card, Flash Card, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device. The memory 61 is used to store the computer-readable instructions and other instructions and data required by the terminal device 6. The memory 61 can also be used to temporarily store data that has been output or will be output.

Those skilled in the art can clearly understand that for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist alone physically, or two or more units can be integrated into one unit. The above-mentioned integrated units can be hardware-based Formal realization can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only used to facilitate distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the foregoing system, reference may be made to the corresponding process in the foregoing method embodiment, which is not repeated here.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail or recorded in an embodiment, reference may be made to related descriptions of other embodiments.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

If the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. Readable storage medium.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A data processing method, characterized by comprising:

Receiving a data packet collected and sent by a preset packet capture tool, the data packet including more than one data record;

The format of the target record is matched according to the preset regular expression resource library to determine the data format of the data packet. The regular expression resource library includes more than one regular expression, and each regular expression corresponds to one A data format, the target record is any data record in the data packet;

Searching for a target processing rule in a preset data processing rule library, where the target processing rule is a data processing rule corresponding to the data format of the data packet;

Each data record in the data packet is processed separately according to the target processing rule to obtain a processed data packet.
The data processing method according to claim 1, wherein the format matching of the target record according to a preset regular expression resource library, and determining the data format of the data packet comprises:

Respectively calculating the matching success rate of each regular expression in the regular expression resource library according to historical matching records in a preset statistical period;

Selecting a regular expression with the highest matching success rate that has not been selected as a candidate expression from the regular expression resource library;

Use the candidate expression to perform format matching on the target record;

If the format matching fails, returning to the step of selecting a regular expression with the highest matching success rate from the regular expression resource library that has not been selected as a candidate expression until the format matching is successful;

If the format matching is successful, the data format corresponding to the candidate expression that is successfully matched is determined as the data format of the data packet.
The data processing method according to claim 2, wherein the calculating the matching success rate of each regular expression in the regular expression resource library according to historical matching records within a preset statistical time period comprises:

Divide the statistical period into T sub-periods, where T is a positive integer;

Respectively count the number of successful matching of each regular expression in the regular expression resource library in each sub-period;

The matching success rate of each regular expression in the regular expression resource library is calculated according to the following formula:

Among them, n is the sequence number of the regular expression, 1≤n≤N, N is the total number of regular expressions in the regular expression resource library, t is the sequence number of the sub-period in chronological order, 1≤t≤T , MatSucNum n,t is the number of successful matching of the nth regular expression in the t-th sub-period in the regular expression resource library, Weight t is the preset weight coefficient, and Weight t <Weight t+1 , MatSucRatio n is the matching success rate of the nth regular expression in the regular expression resource library.
The data processing method according to any one of claims 1 to 3, wherein after determining the data format of the data packet, it further comprises:

Count the total number of data packets waiting to be processed;

If the total number of data packets waiting to be processed is greater than a preset number threshold, acquiring preset configuration files of each standby data processing terminal, and determining the data format corresponding to each standby data processing terminal according to the configuration file;

Divide each spare data processing terminal into corresponding data processing clusters, wherein the data formats corresponding to the spare data processing terminals in the same data processing cluster are all consistent;

Selecting a target cluster corresponding to the data packet, and the data format corresponding to each standby data processing terminal in the target cluster is consistent with the data format of the data packet;

The data packet is sent to the target cluster for processing.
The data processing method according to claim 4, wherein after the data packet is sent to the target cluster for processing, the method further comprises:

Respectively sending a data packet query request to each backup data processing terminal in the target cluster, and respectively receiving the number of to-be-processed data packets fed back by each backup data processing terminal in the target cluster;

Selecting a backup data processing terminal with the smallest number of data packets to be processed from the target cluster as a preferred processing terminal;

The data packet is distributed to the preferred processing terminal for processing.
A data processing device, characterized by comprising:

The data packet receiving module is configured to receive data packets collected and sent by a preset packet capture tool, and the data packets include more than one data record;

The format matching module is used to perform format matching on the target record according to a preset regular expression resource library to determine the data format of the data packet. The regular expression resource library includes more than one regular expression, and each regular expression The expressions all correspond to a data format, and the target record is any data record in the data packet;

A processing rule search module, configured to search for a target processing rule in a preset data processing rule library, where the target processing rule is a data processing rule corresponding to the data format of the data packet;

The data processing module is configured to process each data record in the data packet separately according to the target processing rule to obtain a processed data packet.
The data processing device according to claim 6, wherein the format matching module comprises:

The matching success rate calculation unit is configured to calculate the matching success rate of each regular expression in the regular expression resource library according to historical matching records in a preset statistical period;

The candidate expression selection unit is used to select a regular expression with the highest matching success rate that has not been selected as a candidate expression from the regular expression resource library;

A format matching unit, configured to use the candidate expression to perform format matching on the target record;

The first processing unit is configured to return and execute the step of selecting a regular expression with the highest matching success rate from the regular expression resource library that has not been selected as a candidate expression if the format matching fails, until the format Until the match is successful;

The second processing unit is configured to determine the data format corresponding to the candidate expression successfully matched as the data format of the data packet if the format matching is successful.
The data processing device according to claim 7, wherein the matching success rate calculation unit comprises:

The sub-period division sub-unit is used to divide the statistical period into T sub-periods, where T is a positive integer;

The frequency counting subunit is used to separately count the number of successful matches of each regular expression in the regular expression resource library in each sub-period;

The matching success rate calculation subunit is used to calculate the matching success rate of each regular expression in the regular expression resource library according to the following formula:

Among them, n is the sequence number of the regular expression, 1≤n≤N, N is the total number of regular expressions in the regular expression resource library, t is the sequence number of the sub-period in chronological order, 1≤t≤T , MatSucNum n,t is the number of successful matching of the nth regular expression in the t-th sub-period in the regular expression resource library, Weight t is the preset weight coefficient, and Weight t <Weight t+1 , MatSucRatio n is the matching success rate of the nth regular expression in the regular expression resource library.
The data processing device according to any one of claims 6 to 8, further comprising:

Data packet number statistics module, used to count the total number of data packets waiting to be processed;

The configuration file obtaining module is configured to obtain the preset configuration files of each standby data processing terminal if the total number of data packets waiting to be processed is greater than the preset number threshold, and determine each standby data processing according to the configuration file The data format corresponding to the terminal;

The cluster division module is used to divide each standby data processing terminal into a corresponding data processing cluster, wherein the data formats corresponding to the standby data processing terminals in the same data processing cluster are all consistent;

A cluster selection module, configured to select a target cluster corresponding to the data packet, and the data format corresponding to each standby data processing terminal in the target cluster is consistent with the data format of the data packet;

The data packet sending module is used to send the data packet to the target cluster for processing.
The data processing device according to claim 9, further comprising:

The number query module is configured to send a data packet query request to each backup data processing terminal in the target cluster, and respectively receive the number of data packets to be processed fed back by each backup data processing terminal in the target cluster;

A terminal selection module, configured to select a backup data processing terminal with the smallest number of data packets to be processed from the target cluster as a preferred processing terminal;

The data packet distribution module is used to distribute the data packet to the preferred processing terminal for processing.
A computer non-volatile readable storage medium, the computer non-volatile readable storage medium storing computer readable instructions, wherein the computer readable instructions are executed by a processor to implement the following steps:

Receiving a data packet collected and sent by a preset packet capture tool, the data packet including more than one data record;

The format of the target record is matched according to the preset regular expression resource library to determine the data format of the data packet. The regular expression resource library includes more than one regular expression, and each regular expression corresponds to one A data format, the target record is any data record in the data packet;

Searching for a target processing rule in a preset data processing rule library, where the target processing rule is a data processing rule corresponding to the data format of the data packet;

Each data record in the data packet is processed separately according to the target processing rule to obtain a processed data packet.
The computer non-volatile readable storage medium according to claim 11, wherein the format matching of the target record according to a preset regular expression resource library, and determining the data format of the data packet comprises:

Respectively calculating the matching success rate of each regular expression in the regular expression resource library according to historical matching records in a preset statistical period;

Selecting a regular expression with the highest matching success rate that has not been selected as a candidate expression from the regular expression resource library;

Use the candidate expression to perform format matching on the target record;

If the format matching fails, returning to the step of selecting a regular expression with the highest matching success rate from the regular expression resource library that has not been selected as a candidate expression until the format matching is successful;

If the format matching is successful, the data format corresponding to the candidate expression that is successfully matched is determined as the data format of the data packet.
The computer non-volatile readable storage medium according to claim 12, wherein the regular expressions in the regular expression resource library are respectively calculated according to historical matching records within a preset statistical period The matching success rate includes:

Divide the statistical period into T sub-periods, where T is a positive integer;

Respectively count the number of successful matching of each regular expression in the regular expression resource library in each sub-period;

The matching success rate of each regular expression in the regular expression resource library is calculated according to the following formula:

Among them, n is the sequence number of the regular expression, 1≤n≤N, N is the total number of regular expressions in the regular expression resource library, t is the sequence number of the sub-period in chronological order, 1≤t≤T , MatSucNum n,t is the number of successful matching of the nth regular expression in the t-th sub-period in the regular expression resource library, Weight t is the preset weight coefficient, and Weight t <Weight t+1 , MatSucRatio n is the matching success rate of the nth regular expression in the regular expression resource library.
The computer non-volatile readable storage medium according to any one of claims 11 to 13, characterized in that, after the data format of the data packet is determined, further comprising:

Count the total number of data packets waiting to be processed;

If the total number of data packets waiting to be processed is greater than a preset number threshold, acquiring preset configuration files of each standby data processing terminal, and determining the data format corresponding to each standby data processing terminal according to the configuration file;

Divide each spare data processing terminal into corresponding data processing clusters, wherein the data formats corresponding to the spare data processing terminals in the same data processing cluster are all consistent;

Selecting a target cluster corresponding to the data packet, and the data format corresponding to each standby data processing terminal in the target cluster is consistent with the data format of the data packet;

The data packet is sent to the target cluster for processing.
The computer non-volatile readable storage medium according to claim 14, wherein after sending the data packet to the target cluster for processing, further comprising:

Respectively sending a data packet query request to each backup data processing terminal in the target cluster, and respectively receiving the number of to-be-processed data packets fed back by each backup data processing terminal in the target cluster;

Selecting a backup data processing terminal with the smallest number of data packets to be processed from the target cluster as a preferred processing terminal;

The data packet is distributed to the preferred processing terminal for processing.
A terminal device, comprising a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, wherein the processor executes the computer-readable instructions as follows step:

Receiving a data packet collected and sent by a preset packet capture tool, the data packet including more than one data record;

The format of the target record is matched according to the preset regular expression resource library to determine the data format of the data packet. The regular expression resource library includes more than one regular expression, and each regular expression corresponds to one A data format, the target record is any data record in the data packet;

Searching for a target processing rule in a preset data processing rule library, where the target processing rule is a data processing rule corresponding to the data format of the data packet;

Each data record in the data packet is processed separately according to the target processing rule to obtain a processed data packet.
The terminal device according to claim 16, wherein the format matching of the target record according to a preset regular expression resource library, and determining the data format of the data packet comprises:

Respectively calculating the matching success rate of each regular expression in the regular expression resource library according to historical matching records in a preset statistical period;

Selecting a regular expression with the highest matching success rate that has not been selected as a candidate expression from the regular expression resource library;

Use the candidate expression to perform format matching on the target record;

If the format matching fails, returning to the step of selecting a regular expression with the highest matching success rate from the regular expression resource library that has not been selected as a candidate expression until the format matching is successful;

If the format matching is successful, the data format corresponding to the candidate expression that is successfully matched is determined as the data format of the data packet.
The terminal device according to claim 17, wherein the calculation of the matching success rate of each regular expression in the regular expression resource library according to historical matching records within a preset statistical period comprises:

Divide the statistical period into T sub-periods, where T is a positive integer;

Respectively count the number of successful matching of each regular expression in the regular expression resource library in each sub-period;

The matching success rate of each regular expression in the regular expression resource library is calculated according to the following formula:

Among them, n is the sequence number of the regular expression, 1≤n≤N, N is the total number of regular expressions in the regular expression resource library, t is the sequence number of the sub-period in chronological order, 1≤t≤T , MatSucNum n,t is the number of successful matching of the nth regular expression in the t-th sub-period in the regular expression resource library, Weight t is the preset weight coefficient, and Weight t <Weight t+1 , MatSucRatio n is the matching success rate of the nth regular expression in the regular expression resource library.
The terminal device according to any one of claims 16 to 18, characterized in that, after determining the data format of the data packet, it further comprises:

Count the total number of data packets waiting to be processed;

If the total number of data packets waiting to be processed is greater than a preset number threshold, acquiring preset configuration files of each standby data processing terminal, and determining the data format corresponding to each standby data processing terminal according to the configuration file;

Divide each spare data processing terminal into corresponding data processing clusters, wherein the data formats corresponding to the spare data processing terminals in the same data processing cluster are all consistent;

Selecting a target cluster corresponding to the data packet, and the data format corresponding to each standby data processing terminal in the target cluster is consistent with the data format of the data packet;

The data packet is sent to the target cluster for processing.
The terminal device according to claim 19, characterized in that, after sending the data packet to the target cluster for processing, further comprising:

Respectively sending a data packet query request to each backup data processing terminal in the target cluster, and respectively receiving the number of to-be-processed data packets fed back by each backup data processing terminal in the target cluster;

Selecting a backup data processing terminal with the smallest number of data packets to be processed from the target cluster as a preferred processing terminal;

The data packet is distributed to the preferred processing terminal for processing.