CN107798060B - Real-time streaming data processing application software feature recognition method - Google Patents
Real-time streaming data processing application software feature recognition method Download PDFInfo
- Publication number
- CN107798060B CN107798060B CN201710833546.9A CN201710833546A CN107798060B CN 107798060 B CN107798060 B CN 107798060B CN 201710833546 A CN201710833546 A CN 201710833546A CN 107798060 B CN107798060 B CN 107798060B
- Authority
- CN
- China
- Prior art keywords
- node
- nodes
- milestone
- character
- app
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention relates to a real-time streaming data processing application software feature recognition method which comprises the processes of defining an M-tree, making node constraint rules, constructing an algorithm of the M-tree, recognizing feature triples and application software APP and the like. Through custom M-tree, including root node, character node, circulation node, milestone node and APP node, and corresponding node constraint rule is formulated for this, then algorithm is constructed, finally through data analysis processing of characteristic triples, identification of application software is realized. The invention can provide the characteristic identification method of the real-time streaming data processing application software, which has the advantages of less processor resource occupation, high operation speed, large expansion capacity and outstanding algorithm performance.
Description
Technical Field
The invention relates to the technical field of information classification, in particular to a real-time streaming data processing application software feature recognition method.
Background
The data, namely the resources, are mined and applied in the big data age, so that enterprises can be helped to better know users, and the service quality and the market competitiveness are improved. The data accumulation is the basis of big data application, in the mobile Internet era, the main Internet surfing approach of a user is to perform characteristic recognition on an Internet surfing log of the user through application software (hereinafter referred to as APP), and the Internet surfing log is matched with the APP, so that basic data is provided for later batch analysis, and a preparation condition for data mining is provided.
The server address HOST and request PATH, plus the terminal UA field, form a data triplet: (HOST, PATH, UA) may be an input element identifying APP. The APP feature library refers to an organization rule of a data triplet corresponding to a specific APP, and is given by a mode of adding a group of static character strings and dynamic fuzzy matching, for example (%. Taobao.com,/push/%,%) is a feature triplet, wherein the character '%' is used for representing general matching and represents characters with 0 to N lengths, and UA in the description is User-Agent, that is, a special character string header, so that a server can identify a used operating system and version, a processor type, a browser and version, a browser rendering engine, a browser language, a browser plug-in and the like.
One APP will typically correspond to N feature triplets, but one feature triplet can only correspond to one APP. When a data triplet satisfies a plurality of feature triples simultaneously, then the feature triplet that is the most matching must be used as the recognition result. In actual operation, the feature character with the longest feature character length is selected from the feature triplet sets to be used as a recognition result. In practical project applications, there are typically more than 1000 APPs, with an average of 10 feature triplets per APP, and more than 10000 corresponding feature triplets.
Traditionally, this is identified by the following method: extracting data triples (HOST, PATH, UA) for each Internet log; traversing all the feature triples, comparing one by one, and caching the feature triples meeting the matching condition; and after all the traversals are completed, the APP corresponding to the feature triplet with the longest feature character is taken out from the traversals to obtain the recognition result.
The above conventional identification method has the following disadvantages: occupy too much CPU clock resources: for example, 1 ten thousand internet logs are used for APP identification, and the number of triples of the feature library is 1 ten thousand, so that 1 ten thousand times 1 ten thousand=1 hundred million times of matching is needed, and the calculated amount is extremely large; when the data flow or the feature library capacity is gradually increased, the calculation efficiency is exponentially reduced, and the algorithm performance is sharply reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a real-time streaming data processing application software characteristic identification method which has the advantages of less processor resource occupation, high operation speed, large capacity expansion and outstanding algorithm performance.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A real-time stream data processing application software feature recognition method comprises defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, feature triples and recognition of application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode and a milestone node StoneNode, APP node Applide; the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node Applide is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;
the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.
The construction algorithm of the M-tree specifically comprises the following steps:
step one: creating root nodes and starting traversing the full-quantity feature triples;
step two: setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
step three: according to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
step four: if the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.
The identification of the feature triples and the application software APP specifically comprises the following steps:
step one: the input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
step two: starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
step three: starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
step four: starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
step five: and (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
Due to the application of the technical scheme, the beneficial technical effects brought by the technical scheme of the invention are as follows: according to the technical scheme, through self-defining an M-tree, the method comprises the steps of preparing corresponding node constraint rules for root nodes, character nodes, circulation nodes, milestone nodes and APP nodes, constructing an algorithm, and finally realizing identification of application software through analysis processing of feature triples, the method has the advantage of strong operation performance, and through testing, twenty-thousand Internet logs can be identified per second; the method has the beneficial technical effects that the expansion performance is good, and when the capacity of the feature library triples is increased, the performance is not affected basically; the method has the beneficial technical effects that the algorithm performance is outstanding, and when the data flow of the internet log is increased, the use requirement of the hardware resources is in a linear growth relation.
Drawings
Fig. 1 is a schematic diagram of the overall structure of the present invention.
FIG. 2 is a schematic flow chart of an algorithm for constructing an M-tree according to the present invention.
Fig. 3 is a schematic diagram of the identification flow of the feature triplet and the application software APP of the present invention.
Description of the embodiments
The present invention will be described in further detail with reference to the following schemes and examples.
As shown in FIG. 1, the method for identifying features of real-time streaming data processing application software comprises defining an M-tree1, formulating a node constraint rule 2, a construction algorithm 3 of the M-tree, a feature triplet and identification 4 of application software APP, wherein the defined M-tree1 comprises a Root node Root, a character node CharNode, a loop node LoopNode and a milestone node StoneNode, APP node Applide.
The Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node ApplidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP.
The node constraint rule 2 is formulated specifically including: only one root node is allowed in one M-tree; the next stage of the root node can not generate milestone nodes and APP nodes; the next stage of the circulating node can not present the circulating node; the next stage of the milestone node can not generate the milestone node and the APP node; the APP node must be a leaf node without any kind of next level node; at most, one circulation node and one milestone node under the father node can be provided; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.
The M-tree construction algorithm 3 specifically comprises the following steps:
(1) Creating root nodes and starting traversing the full-quantity feature triples;
(2) Setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
(3) According to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
(4) If the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
(5) Creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.
The identification 4 of the feature triples and the application software APP specifically comprises the following steps:
(1) The input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
(2) Starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
(3) Starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
(4) Starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
(5) And (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
The foregoing is merely a specific application example of the present invention, and the protection scope of the present invention is not limited in any way. All technical schemes formed by equivalent transformation or equivalent substitution fall within the protection scope of the invention.
Claims (1)
1. A real-time streaming data processing application software feature recognition method is characterized in that: defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, identifying characteristic triples and application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode, a milestone node StoneNode, APP and an Applide;
the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card character '%', and one character node CharNode only comprises one character; the loop node is used for expressing a wild card symbol '%', which means that the sub node of the node is itself; the milestone node StoneNode is used for expressing element boundaries in the feature triples, meaning that a child node of the node will be the beginning of the next triplet element; the APP node AppidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;
the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; a plurality of character nodes under the same father node are required to be uniquely corresponding to specific characters, and characters cannot be repeatedly appeared;
the construction algorithm of the M-tree specifically comprises the following steps:
step one: creating root nodes and starting traversing the full-quantity feature triples;
step two: setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
step three: according to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
step four: if the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; creating APP nodes for UA elements and adding the APP nodes into a subset of the current nodes, and ending the traversal of the feature group;
the identification of the feature triples and the application software APP specifically comprises the following steps:
step one: the input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
step two: starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
step three: starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
step four: starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
step five: and (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710833546.9A CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710833546.9A CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107798060A CN107798060A (en) | 2018-03-13 |
CN107798060B true CN107798060B (en) | 2023-06-30 |
Family
ID=61531981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710833546.9A Active CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107798060B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051725A (en) * | 2012-12-31 | 2013-04-17 | 华为技术有限公司 | Application identification method, data mining method, device and system |
CN105591973A (en) * | 2015-12-31 | 2016-05-18 | 杭州数梦工场科技有限公司 | Application recognition method and apparatus |
-
2017
- 2017-09-15 CN CN201710833546.9A patent/CN107798060B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051725A (en) * | 2012-12-31 | 2013-04-17 | 华为技术有限公司 | Application identification method, data mining method, device and system |
CN105591973A (en) * | 2015-12-31 | 2016-05-18 | 杭州数梦工场科技有限公司 | Application recognition method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN107798060A (en) | 2018-03-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7818303B2 (en) | Web graph compression through scalable pattern mining | |
CN111339382B (en) | Character string data retrieval method, device, computer equipment and storage medium | |
WO2021068547A1 (en) | Log schema extraction method and apparatus | |
CN107402798B (en) | Method and apparatus for converting sequencing scripts to reuse JCL in different coding environments | |
CN111177491A (en) | Regular expression matching method and device, electronic equipment and storage medium | |
CN101794318A (en) | URL (Uniform Resource Location) analyzing method and equipment | |
CN105827603A (en) | Inexplicit protocol feature library establishment method and device and inexplicit message classification method and device | |
CN112445997A (en) | Method and device for extracting CMS multi-version identification feature rule | |
CN111562920A (en) | Method and device for determining similarity of small program codes, server and storage medium | |
CN113315851A (en) | Domain name detection method, device and storage medium | |
CN114861746A (en) | Anti-fraud identification method and device based on big data and related equipment | |
CN110333990B (en) | Data processing method and device | |
CN114760369A (en) | Protocol metadata extraction method, device, equipment and storage medium | |
CN105718463A (en) | Keyword fuzzy matching method and device | |
CN112650529B (en) | System and method for configurable generation of mobile terminal APP codes | |
CN106844553B (en) | Data detection and expansion method and device based on sample data | |
CN107798060B (en) | Real-time streaming data processing application software feature recognition method | |
CN113806647A (en) | Method for identifying development framework and related equipment | |
US20220171815A1 (en) | System and method for generating filters for k-mismatch search | |
CN113688240B (en) | Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium | |
CN114238131A (en) | Code detection method and device, computer readable medium and electronic equipment | |
CN116501781B (en) | Data rapid statistical method for enhanced prefix tree | |
CN112217896A (en) | JSON message conversion method and related device | |
CN117828382B (en) | Network interface clustering method and device based on URL | |
US11893048B1 (en) | Automated indexing and extraction of multiple information fields in digital records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |