CN107798060B

CN107798060B - Real-time streaming data processing application software feature recognition method

Info

Publication number: CN107798060B
Application number: CN201710833546.9A
Authority: CN
Inventors: 饶翔
Original assignee: Nanjing Axon Science & Technology Co ltd
Current assignee: Nanjing Axon Science & Technology Co ltd
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2023-06-30
Anticipated expiration: 2037-09-15
Also published as: CN107798060A

Abstract

The invention relates to a real-time streaming data processing application software feature recognition method which comprises the processes of defining an M-tree, making node constraint rules, constructing an algorithm of the M-tree, recognizing feature triples and application software APP and the like. Through custom M-tree, including root node, character node, circulation node, milestone node and APP node, and corresponding node constraint rule is formulated for this, then algorithm is constructed, finally through data analysis processing of characteristic triples, identification of application software is realized. The invention can provide the characteristic identification method of the real-time streaming data processing application software, which has the advantages of less processor resource occupation, high operation speed, large expansion capacity and outstanding algorithm performance.

Description

Real-time streaming data processing application software feature recognition method

Technical Field

The invention relates to the technical field of information classification, in particular to a real-time streaming data processing application software feature recognition method.

Background

The data, namely the resources, are mined and applied in the big data age, so that enterprises can be helped to better know users, and the service quality and the market competitiveness are improved. The data accumulation is the basis of big data application, in the mobile Internet era, the main Internet surfing approach of a user is to perform characteristic recognition on an Internet surfing log of the user through application software (hereinafter referred to as APP), and the Internet surfing log is matched with the APP, so that basic data is provided for later batch analysis, and a preparation condition for data mining is provided.

The server address HOST and request PATH, plus the terminal UA field, form a data triplet: (HOST, PATH, UA) may be an input element identifying APP. The APP feature library refers to an organization rule of a data triplet corresponding to a specific APP, and is given by a mode of adding a group of static character strings and dynamic fuzzy matching, for example (%. Taobao.com,/push/%,%) is a feature triplet, wherein the character '%' is used for representing general matching and represents characters with 0 to N lengths, and UA in the description is User-Agent, that is, a special character string header, so that a server can identify a used operating system and version, a processor type, a browser and version, a browser rendering engine, a browser language, a browser plug-in and the like.

One APP will typically correspond to N feature triplets, but one feature triplet can only correspond to one APP. When a data triplet satisfies a plurality of feature triples simultaneously, then the feature triplet that is the most matching must be used as the recognition result. In actual operation, the feature character with the longest feature character length is selected from the feature triplet sets to be used as a recognition result. In practical project applications, there are typically more than 1000 APPs, with an average of 10 feature triplets per APP, and more than 10000 corresponding feature triplets.

Traditionally, this is identified by the following method: extracting data triples (HOST, PATH, UA) for each Internet log; traversing all the feature triples, comparing one by one, and caching the feature triples meeting the matching condition; and after all the traversals are completed, the APP corresponding to the feature triplet with the longest feature character is taken out from the traversals to obtain the recognition result.

The above conventional identification method has the following disadvantages: occupy too much CPU clock resources: for example, 1 ten thousand internet logs are used for APP identification, and the number of triples of the feature library is 1 ten thousand, so that 1 ten thousand times 1 ten thousand=1 hundred million times of matching is needed, and the calculated amount is extremely large; when the data flow or the feature library capacity is gradually increased, the calculation efficiency is exponentially reduced, and the algorithm performance is sharply reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a real-time streaming data processing application software characteristic identification method which has the advantages of less processor resource occupation, high operation speed, large capacity expansion and outstanding algorithm performance.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A real-time stream data processing application software feature recognition method comprises defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, feature triples and recognition of application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode and a milestone node StoneNode, APP node Applide; the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node Applide is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;

the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.

The construction algorithm of the M-tree specifically comprises the following steps:

step one: creating root nodes and starting traversing the full-quantity feature triples;

step two: setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;

step three: according to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;

step four: if the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;

step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.

The identification of the feature triples and the application software APP specifically comprises the following steps:

step one: the input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;

step two: starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;

step three: starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;

step four: starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;

step five: and (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.

Due to the application of the technical scheme, the beneficial technical effects brought by the technical scheme of the invention are as follows: according to the technical scheme, through self-defining an M-tree, the method comprises the steps of preparing corresponding node constraint rules for root nodes, character nodes, circulation nodes, milestone nodes and APP nodes, constructing an algorithm, and finally realizing identification of application software through analysis processing of feature triples, the method has the advantage of strong operation performance, and through testing, twenty-thousand Internet logs can be identified per second; the method has the beneficial technical effects that the expansion performance is good, and when the capacity of the feature library triples is increased, the performance is not affected basically; the method has the beneficial technical effects that the algorithm performance is outstanding, and when the data flow of the internet log is increased, the use requirement of the hardware resources is in a linear growth relation.

Drawings

Fig. 1 is a schematic diagram of the overall structure of the present invention.

FIG. 2 is a schematic flow chart of an algorithm for constructing an M-tree according to the present invention.

Fig. 3 is a schematic diagram of the identification flow of the feature triplet and the application software APP of the present invention.

Description of the embodiments

The present invention will be described in further detail with reference to the following schemes and examples.

As shown in FIG. 1, the method for identifying features of real-time streaming data processing application software comprises defining an M-tree1, formulating a node constraint rule 2, a construction algorithm 3 of the M-tree, a feature triplet and identification 4 of application software APP, wherein the defined M-tree1 comprises a Root node Root, a character node CharNode, a loop node LoopNode and a milestone node StoneNode, APP node Applide.

The Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node ApplidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP.

The node constraint rule 2 is formulated specifically including: only one root node is allowed in one M-tree; the next stage of the root node can not generate milestone nodes and APP nodes; the next stage of the circulating node can not present the circulating node; the next stage of the milestone node can not generate the milestone node and the APP node; the APP node must be a leaf node without any kind of next level node; at most, one circulation node and one milestone node under the father node can be provided; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.

The M-tree construction algorithm 3 specifically comprises the following steps:

(1) Creating root nodes and starting traversing the full-quantity feature triples;

(2) Setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;

(3) According to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;

(4) If the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;

(5) Creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.

The identification 4 of the feature triples and the application software APP specifically comprises the following steps:

(1) The input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;

(2) Starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;

(3) Starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;

(4) Starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;

(5) And (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.

The foregoing is merely a specific application example of the present invention, and the protection scope of the present invention is not limited in any way. All technical schemes formed by equivalent transformation or equivalent substitution fall within the protection scope of the invention.

Claims

1. A real-time streaming data processing application software feature recognition method is characterized in that: defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, identifying characteristic triples and application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode, a milestone node StoneNode, APP and an Applide;

the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card character '%', and one character node CharNode only comprises one character; the loop node is used for expressing a wild card symbol '%', which means that the sub node of the node is itself; the milestone node StoneNode is used for expressing element boundaries in the feature triples, meaning that a child node of the node will be the beginning of the next triplet element; the APP node AppidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;

the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; a plurality of character nodes under the same father node are required to be uniquely corresponding to specific characters, and characters cannot be repeatedly appeared;

step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; creating APP nodes for UA elements and adding the APP nodes into a subset of the current nodes, and ending the traversal of the feature group;