CN107798060B - Real-time streaming data processing application software feature recognition method - Google Patents

Real-time streaming data processing application software feature recognition method Download PDF

Info

Publication number
CN107798060B
CN107798060B CN201710833546.9A CN201710833546A CN107798060B CN 107798060 B CN107798060 B CN 107798060B CN 201710833546 A CN201710833546 A CN 201710833546A CN 107798060 B CN107798060 B CN 107798060B
Authority
CN
China
Prior art keywords
node
nodes
milestone
character
app
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710833546.9A
Other languages
Chinese (zh)
Other versions
CN107798060A (en
Inventor
饶翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Axon Science & Technology Co ltd
Original Assignee
Nanjing Axon Science & Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Axon Science & Technology Co ltd filed Critical Nanjing Axon Science & Technology Co ltd
Priority to CN201710833546.9A priority Critical patent/CN107798060B/en
Publication of CN107798060A publication Critical patent/CN107798060A/en
Application granted granted Critical
Publication of CN107798060B publication Critical patent/CN107798060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a real-time streaming data processing application software feature recognition method which comprises the processes of defining an M-tree, making node constraint rules, constructing an algorithm of the M-tree, recognizing feature triples and application software APP and the like. Through custom M-tree, including root node, character node, circulation node, milestone node and APP node, and corresponding node constraint rule is formulated for this, then algorithm is constructed, finally through data analysis processing of characteristic triples, identification of application software is realized. The invention can provide the characteristic identification method of the real-time streaming data processing application software, which has the advantages of less processor resource occupation, high operation speed, large expansion capacity and outstanding algorithm performance.

Description

Real-time streaming data processing application software feature recognition method
Technical Field
The invention relates to the technical field of information classification, in particular to a real-time streaming data processing application software feature recognition method.
Background
The data, namely the resources, are mined and applied in the big data age, so that enterprises can be helped to better know users, and the service quality and the market competitiveness are improved. The data accumulation is the basis of big data application, in the mobile Internet era, the main Internet surfing approach of a user is to perform characteristic recognition on an Internet surfing log of the user through application software (hereinafter referred to as APP), and the Internet surfing log is matched with the APP, so that basic data is provided for later batch analysis, and a preparation condition for data mining is provided.
The server address HOST and request PATH, plus the terminal UA field, form a data triplet: (HOST, PATH, UA) may be an input element identifying APP. The APP feature library refers to an organization rule of a data triplet corresponding to a specific APP, and is given by a mode of adding a group of static character strings and dynamic fuzzy matching, for example (%. Taobao.com,/push/%,%) is a feature triplet, wherein the character '%' is used for representing general matching and represents characters with 0 to N lengths, and UA in the description is User-Agent, that is, a special character string header, so that a server can identify a used operating system and version, a processor type, a browser and version, a browser rendering engine, a browser language, a browser plug-in and the like.
One APP will typically correspond to N feature triplets, but one feature triplet can only correspond to one APP. When a data triplet satisfies a plurality of feature triples simultaneously, then the feature triplet that is the most matching must be used as the recognition result. In actual operation, the feature character with the longest feature character length is selected from the feature triplet sets to be used as a recognition result. In practical project applications, there are typically more than 1000 APPs, with an average of 10 feature triplets per APP, and more than 10000 corresponding feature triplets.
Traditionally, this is identified by the following method: extracting data triples (HOST, PATH, UA) for each Internet log; traversing all the feature triples, comparing one by one, and caching the feature triples meeting the matching condition; and after all the traversals are completed, the APP corresponding to the feature triplet with the longest feature character is taken out from the traversals to obtain the recognition result.
The above conventional identification method has the following disadvantages: occupy too much CPU clock resources: for example, 1 ten thousand internet logs are used for APP identification, and the number of triples of the feature library is 1 ten thousand, so that 1 ten thousand times 1 ten thousand=1 hundred million times of matching is needed, and the calculated amount is extremely large; when the data flow or the feature library capacity is gradually increased, the calculation efficiency is exponentially reduced, and the algorithm performance is sharply reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a real-time streaming data processing application software characteristic identification method which has the advantages of less processor resource occupation, high operation speed, large capacity expansion and outstanding algorithm performance.
In order to achieve the above purpose, the present invention adopts the following technical scheme.
A real-time stream data processing application software feature recognition method comprises defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, feature triples and recognition of application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode and a milestone node StoneNode, APP node Applide; the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node Applide is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;
the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.
The construction algorithm of the M-tree specifically comprises the following steps:
step one: creating root nodes and starting traversing the full-quantity feature triples;
step two: setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
step three: according to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
step four: if the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.
The identification of the feature triples and the application software APP specifically comprises the following steps:
step one: the input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
step two: starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
step three: starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
step four: starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
step five: and (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
Due to the application of the technical scheme, the beneficial technical effects brought by the technical scheme of the invention are as follows: according to the technical scheme, through self-defining an M-tree, the method comprises the steps of preparing corresponding node constraint rules for root nodes, character nodes, circulation nodes, milestone nodes and APP nodes, constructing an algorithm, and finally realizing identification of application software through analysis processing of feature triples, the method has the advantage of strong operation performance, and through testing, twenty-thousand Internet logs can be identified per second; the method has the beneficial technical effects that the expansion performance is good, and when the capacity of the feature library triples is increased, the performance is not affected basically; the method has the beneficial technical effects that the algorithm performance is outstanding, and when the data flow of the internet log is increased, the use requirement of the hardware resources is in a linear growth relation.
Drawings
Fig. 1 is a schematic diagram of the overall structure of the present invention.
FIG. 2 is a schematic flow chart of an algorithm for constructing an M-tree according to the present invention.
Fig. 3 is a schematic diagram of the identification flow of the feature triplet and the application software APP of the present invention.
Description of the embodiments
The present invention will be described in further detail with reference to the following schemes and examples.
As shown in FIG. 1, the method for identifying features of real-time streaming data processing application software comprises defining an M-tree1, formulating a node constraint rule 2, a construction algorithm 3 of the M-tree, a feature triplet and identification 4 of application software APP, wherein the defined M-tree1 comprises a Root node Root, a character node CharNode, a loop node LoopNode and a milestone node StoneNode, APP node Applide.
The Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card symbol '%', and one CharNode only contains one character; the loop node LoopNode is used for expressing a wild card symbol '%', meaning that the child node of the node is itself; the milestone node StoneNode is used to express element boundaries in feature triples, meaning that the child node of that node will be the beginning of the next triplet element; the APP node ApplidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP.
The node constraint rule 2 is formulated specifically including: only one root node is allowed in one M-tree; the next stage of the root node can not generate milestone nodes and APP nodes; the next stage of the circulating node can not present the circulating node; the next stage of the milestone node can not generate the milestone node and the APP node; the APP node must be a leaf node without any kind of next level node; at most, one circulation node and one milestone node under the father node can be provided; the character nodes under the same father node have a plurality of character nodes, and the character nodes and specific characters must be uniquely corresponding, so that the characters cannot be repeatedly appeared.
The M-tree construction algorithm 3 specifically comprises the following steps:
(1) Creating root nodes and starting traversing the full-quantity feature triples;
(2) Setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
(3) According to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
(4) If the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
(5) Creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; and creating an APP node for the UA element, adding the APP node into the subset of the current node, and ending the traversal of the feature group.
The identification 4 of the feature triples and the application software APP specifically comprises the following steps:
(1) The input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
(2) Starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
(3) Starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
(4) Starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
(5) And (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
The foregoing is merely a specific application example of the present invention, and the protection scope of the present invention is not limited in any way. All technical schemes formed by equivalent transformation or equivalent substitution fall within the protection scope of the invention.

Claims (1)

1. A real-time streaming data processing application software feature recognition method is characterized in that: defining an M-tree, formulating node constraint rules, constructing an algorithm of the M-tree, identifying characteristic triples and application software APP, wherein the defined M-tree comprises a Root node Root, a character node CharNode, a circulating node LoopNode, a milestone node StoneNode, APP and an Applide;
the Root node Root is a special milestone node and is the Root node of the M-tree; the character nodes CharNode are entity characters in the characteristic triplets, namely, the characters except the wild card character '%', and one character node CharNode only comprises one character; the loop node is used for expressing a wild card symbol '%', which means that the sub node of the node is itself; the milestone node StoneNode is used for expressing element boundaries in the feature triples, meaning that a child node of the node will be the beginning of the next triplet element; the APP node AppidNode is a special milestone node, is a leaf node of the whole M-tree, and represents a specific APP;
the node constraint rule is formulated, which specifically comprises: only one root node is allowed in one defined M-tree; the next stage of Root of the Root node can not generate milestone nodes Stonenode and APP nodes; the next stage of the loop node LoopNode can not present a loop node; the next stage of the milestone node Stonenode can not generate a milestone node and an APP node; the APP node AppidNode is necessarily a leaf node, and has no next level node of any kind; the loop node Loopnode and the milestone node StoneeNode under the father node can only have one at most; a plurality of character nodes under the same father node are required to be uniquely corresponding to specific characters, and characters cannot be repeatedly appeared;
the construction algorithm of the M-tree specifically comprises the following steps:
step one: creating root nodes and starting traversing the full-quantity feature triples;
step two: setting a current node as a root node for each characteristic triplet, and traversing HOST, PATH, UA in the triples sequentially;
step three: according to the characters, the same character node is obtained from the child set of the current node, if the character node is not found, a new character node is created and added into the child set of the current node, and the current node is set as the character node;
step four: if the next character is a wild card character '%', if the current node is a character node or a milestone node, creating a circulating node and adding the circulating node into a sub-node set of the current node, and setting the current node as the newly added circulating node;
step five: creating milestone nodes for non-UA elements, adding the milestone nodes into a subset of the current nodes, and setting the current nodes as the milestone nodes; creating APP nodes for UA elements and adding the APP nodes into a subset of the current nodes, and ending the traversal of the feature group;
the identification of the feature triples and the application software APP specifically comprises the following steps:
step one: the input parameters are data triples (HOST, PATH, UA), and the PATH of each data triplet is searched in the defined M-tree;
step two: starting from the root node, converting HOST of the data triplet into character stream, and starting searching from top to bottom until one or more milestone nodes are found;
step three: starting from the milestone nodes, converting PATH of the data triplet into character stream, and starting searching from top to bottom until the milestone node of the next link is found;
step four: starting from milestone nodes of the PATH link, converting UA of the data triplet into character stream, and starting searching from top to bottom until all APP nodes of the last link are found;
step five: and (5) taking out the APP nodes farthest from the root node from all the APP nodes conforming to the feature matching as search results.
CN201710833546.9A 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method Active CN107798060B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710833546.9A CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710833546.9A CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Publications (2)

Publication Number Publication Date
CN107798060A CN107798060A (en) 2018-03-13
CN107798060B true CN107798060B (en) 2023-06-30

Family

ID=61531981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710833546.9A Active CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Country Status (1)

Country Link
CN (1) CN107798060B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051725A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Application identification method, data mining method, device and system
CN105591973A (en) * 2015-12-31 2016-05-18 杭州数梦工场科技有限公司 Application recognition method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051725A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Application identification method, data mining method, device and system
CN105591973A (en) * 2015-12-31 2016-05-18 杭州数梦工场科技有限公司 Application recognition method and apparatus

Also Published As

Publication number Publication date
CN107798060A (en) 2018-03-13

Similar Documents

Publication Publication Date Title
US7818303B2 (en) Web graph compression through scalable pattern mining
CN111339382B (en) Character string data retrieval method, device, computer equipment and storage medium
WO2021068547A1 (en) Log schema extraction method and apparatus
CN107402798B (en) Method and apparatus for converting sequencing scripts to reuse JCL in different coding environments
CN111177491A (en) Regular expression matching method and device, electronic equipment and storage medium
CN101794318A (en) URL (Uniform Resource Location) analyzing method and equipment
CN105827603A (en) Inexplicit protocol feature library establishment method and device and inexplicit message classification method and device
CN112445997A (en) Method and device for extracting CMS multi-version identification feature rule
CN111562920A (en) Method and device for determining similarity of small program codes, server and storage medium
CN113315851A (en) Domain name detection method, device and storage medium
CN114861746A (en) Anti-fraud identification method and device based on big data and related equipment
CN110333990B (en) Data processing method and device
CN114760369A (en) Protocol metadata extraction method, device, equipment and storage medium
CN105718463A (en) Keyword fuzzy matching method and device
CN112650529B (en) System and method for configurable generation of mobile terminal APP codes
CN106844553B (en) Data detection and expansion method and device based on sample data
CN107798060B (en) Real-time streaming data processing application software feature recognition method
CN113806647A (en) Method for identifying development framework and related equipment
US20220171815A1 (en) System and method for generating filters for k-mismatch search
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN114238131A (en) Code detection method and device, computer readable medium and electronic equipment
CN116501781B (en) Data rapid statistical method for enhanced prefix tree
CN112217896A (en) JSON message conversion method and related device
CN117828382B (en) Network interface clustering method and device based on URL
US11893048B1 (en) Automated indexing and extraction of multiple information fields in digital records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant