CN107798060A - A kind of real time streaming data handles application features recognition methods - Google Patents
A kind of real time streaming data handles application features recognition methods Download PDFInfo
- Publication number
- CN107798060A CN107798060A CN201710833546.9A CN201710833546A CN107798060A CN 107798060 A CN107798060 A CN 107798060A CN 201710833546 A CN201710833546 A CN 201710833546A CN 107798060 A CN107798060 A CN 107798060A
- Authority
- CN
- China
- Prior art keywords
- node
- character
- triple
- milestone
- app
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of real time streaming data to handle application features recognition methods, including defines the flows such as M tree, the identification for formulating node constraint rule, M tree developing algorithm, character triple and application software APP.Pass through self-defined M tree, comprising root node, character node, circulation node, milestone node and APP nodes, and corresponding node constraint rule has been formulated this, then developing algorithm, finally by the Data Analysis Services to character triple, the identification to application software is realized.The present invention can provide the real time streaming data processing application features recognition methods that a kind of occupancy processor resource is few, arithmetic speed is fast, dilatation amount is big and algorithm performance protrudes.
Description
Technical field
The present invention relates to information classification technical field, more particularly to a kind of real time streaming data processing application features to know
Other method.
Background technology
Big data epoch, data are resource, excavation and application to data, enterprise can be helped to be better understood by user,
Improve service quality and the market competitiveness.Data accumulation is the basis of big data application, the mobile Internet epoch, and user is main
Online approach is to pass through application software(Hereinafter referred to as APP), feature recognition is carried out to the internet log of user, matched with APP, will
Basic data is provided for the batch quantity analysis in later stage, is data mining preparatory condition.
Server address HOST and request PATH, a data triple is formed plus terminal UA fields:(HOST, PATH,
UA)Can be as identification APP input element.APP feature databases refer to the tissue rule of the data triple corresponding to specific APP
Rule, the mode matched with one group of static strings plus dynamic fuzzy provide, such as(%.taobao.com ,/push/%, %)Just
It is a character triple, wherein character ' % ' is used to represent wildcard, represents 0 character for arriving N number of length, the UA in foregoing description
It is User-Agent, i.e. the abbreviation of user agent, it is a special string head so that server can identify the behaviour used
Make system and version, processor type, browser and version, browser rendering engine, browser language, browser plug-in etc..
One APP would generally correspond to N number of character triple, but a character triple can only correspond to an APP.Work as data
When triple meets multiple character triples simultaneously, then must be tied using the character triple wherein matched the most as identification
Fruit.When practical operation, it is necessary to from these character triple set, that most long conduct of characteristic character length is picked out
Recognition result.In the project application of reality, generally there are the APP of more than 1000, average each APP there are 10 feature ternarys
Group, corresponding character triple have more than 10000.
Traditional is to identify by the following method:For every internet log, data triple is extracted(HOST, PATH,
UA);All character triples are traveled through, are compared one by one, and cache the character triple for meeting matching condition;Treat all to have traveled through
Cheng Hou, be taken out that most long character triple of characteristic character corresponding to APP be recognition result.
Above-mentioned traditional recognition method has the disadvantage that:Take excessive cpu clock resource:Such as there are 10,000 online days
Will carries out APP identifications, and feature database triple quantity is 10,000, then will carry out 10,000 *, 10,000=100,000,000 matchings, amount of calculation
Greatly;When data traffic or feature storage capacity incrementally increase, computational efficiency will exponentially decline, algorithm performance drastically under
Drop.
The content of the invention
There is provided the invention aims to overcome the deficiencies in the prior art and a kind of take that processor resource is few, arithmetic speed
It hurry up, the real time streaming data processing application features recognition methods that dilatation amount is big and algorithm performance protrudes.
To reach above-mentioned purpose, present invention employs following technical scheme.
A kind of real time streaming data handles application features recognition methods, including defines M-tree, formulates node constraint
The identification of rule, M-tree developing algorithm, character triple and application software APP, the definition M-tree include root node
Root, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP node AppidNode;
The root node Root is special milestone node, is M-tree root node;Character node CharNode is characterized ternary
Entity character in group, the i.e. character in addition to asterisk wildcard ' % ', only a character is included in a CharNode;Circulate node
LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node can be itself;Milestone node
StoneNode is the element border that the node is used in expression characteristic triple, it is meant that the child node of the node would is that down
The beginning of one triple element;APP nodes AppidNode is special milestone node, is whole M-tree leaf knot
Point, represent a specific APP;
The formulation node constraint rule, is specifically included:Only allow have a root node in one definition M-tree;It is described
Milestone node StoneNode and APP node cannot occur in root node Root next stage;The circulation node LoopNode
Next stage circulation node cannot occur;Milestone knot cannot occur in the next stage of the milestone node StoneNode
Point and APP nodes;The APP nodes AppidNode must be leaf nodes, without any kind of next stage node;Father ties
Circulation node LoopNode and milestone node StoneNode under point can only at most have one;Word under same father node
Symbol node can have multiple, must uniquely be corresponded between specific character, not reproducible character occur.
The developing algorithm of the M-tree, specifically comprises the following steps:
Step 1:Root node is created, begins stepping through the character triple of full dose;
Step 2:To each character triple, setting current node is root node, HOST, PATH in order traversal triple,
UA;
Step 3:Identical character node is obtained in gathering according to the child of character to current node, as created one without if
New character node is added in child's set of current node, and current node is arranged into the character node;
Step 4:If character late is asterisk wildcard ' % ', if current node is character node or milestone node, create
Build circulation node and be added in the child node set of current node, and the circulation knot that current node is arranged to specifically increase newly
Point;
Step 5:Milestone node is created for non-UA elements to be added in the subclass of current node, setting current node is
The milestone node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal knot
Beam.
The identification of the character triple and application software APP, specifically includes following steps:
Step 1:Input parameter is data triple(HOST, PATH, UA), to each data ternary in the M-tree of definition
The path of group is begun look for;
Step 2:From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until looking for
To one or more milestone node;
Step 3:From these milestone nodes, the PATH of data triple is switched into character stream, starts to search from top to bottom
Rope, the milestone node until finding next link;
Step 4:From the milestone node of PATH links, the UA of data triple is switched into character stream, started from top to bottom
Search, all APP nodes until finding final tache;
Step 5:The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search
As a result.
Due to the utilization of above-mentioned technical proposal, advantageous effects that technical scheme is brought:The technical program
By self-defined M-tree, comprising root node, character node, circulation node, milestone node and APP nodes, and to this formulation
Corresponding node constraint rule, then developing algorithm, finally by the analyzing and processing to character triple, is realized soft to application
The identification of part, there is the advantages of operational performance is strong, after tested its monokaryon beneficial skill per second that can recognize that 200,000 internet logs
Art effect;Also have scalability good, when feature database ternary pool-size increase, the Advantageous that performance is substantially unaffected is imitated
Fruit;Also there is algorithm performance to protrude, when internet log data traffic increases, the use demand of hardware resource is linearly increased
The advantageous effects of relation.
Brief description of the drawings
Accompanying drawing 1 is overall structure diagram of the invention.
Accompanying drawing 2 is the M-tree of present invention developing algorithm schematic flow sheet.
Accompanying drawing 3 is the character triple of the present invention and application software APP identification process schematic diagram.
In figure:1. define M-tree;2 formulate node constraint rule;3.M-tree developing algorithm;4. character triple and
Application software APP identification.
Embodiment
With reference to reaction scheme and specific embodiment, the present invention is described in further detail.
As shown in figure 1, a kind of real time streaming data processing application features recognition methods, including define M-tree1, system
Determine node constraint rule 2, M-tree developing algorithm 3, character triple and application software APP identification 4, the definition M-
Tree1 includes root node Root, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP
Node AppidNode.
The root node Root is special milestone node, is M-tree root node;Character node CharNode is
Entity character in character triple, the i.e. character in addition to asterisk wildcard ' % ', only a character is included in a CharNode;
Circulation node LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node can be itself;In
Journey upright stone tablet node StoneNode is the element border that the node is used in expression characteristic triple, it is meant that the child node of the node
It would is that the beginning of next triple element;APP nodes AppidNode is special milestone node, is whole M-tree
Leaf nodes, represent a specific APP.
The formulation node constraint rule 2, specifically includes:Only allow have a root node in one M-tree;Root knot
Milestone node and APP nodes cannot occur in the next stage of point;Circulation node cannot occur in the next stage of circulation node;In
Milestone node and APP nodes cannot occur in the next stage of journey upright stone tablet node;APP nodes must be leaf nodes, not any
The next stage node of species;Circulation node and milestone node under father node can only at most have one;Under same father node
Character node can have multiple, must uniquely be corresponded between specific character, it is not reproducible character occur.
The developing algorithm 3 of the M-tree, specifically comprises the following steps:
(1)Root node is created, begins stepping through the character triple of full dose;
(2)To each character triple, setting current node is root node, HOST, PATH, UA in order traversal triple;
(3)Identical character node is obtained in gathering according to the child of character to current node, as without then creating one newly
Character node is added in child's set of current node, and current node is arranged into the character node;
(4)If character late is asterisk wildcard ' % ', if current node is character node or milestone node, establishment follows
Ring node is simultaneously added in the child node set of current node, and the circulation node that current node is arranged to specifically increase newly;
(5)Milestone node is created for non-UA elements to be added in the subclass of current node, it is in this to set current node
Journey upright stone tablet node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal terminates.
The identification 4 of the character triple and application software APP, specifically includes following steps:
(1)Input parameter is data triple(HOST, PATH, UA), to each data triple in the M-tree of definition
Path is begun look for;
(2)From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until finding one
Individual or multiple milestone nodes;
(3)From these milestone nodes, the PATH of data triple is switched into character stream, starts to search for from top to bottom, directly
To the milestone node for finding next link;
(4)From the milestone node of PATH links, the UA of data triple is switched into character stream, starts to search from top to bottom
Rope, all APP nodes until finding final tache;
(5)The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search result.
It the above is only the concrete application example of the present invention, protection scope of the present invention be not limited in any way.All uses
Equivalent transformation or equivalent replacement and the technical scheme formed, all fall within rights protection scope of the present invention.
Claims (3)
1. a kind of real time streaming data handles application features recognition methods, it is characterised in that:Including defining M-tree, formulating
The identification of node constraint rule, M-tree developing algorithm, character triple and application software APP, the definition M-tree bags
Root containing root node, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP node
AppidNode;
The root node Root is special milestone node, is M-tree root node;The character node CharNode is
Entity character in character triple, the i.e. character in addition to asterisk wildcard ' % ', only wrap in a character node CharNode
Containing a character;The circulation node LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node
Can be itself;The milestone node StoneNode is the element border that the node is used in expression characteristic triple, is anticipated
The child node that taste the node would is that beginning of next triple element;In the APP nodes AppidNode is special
Journey upright stone tablet node, it is whole M-tree leaf nodes, represents a specific APP;
The formulation node constraint rule, specifically includes:Only allow have a root node in one definition M-tree;Institute
Milestone node StoneNode and APP node cannot be occurred by stating root node Root next stage;The circulation node
Circulation node cannot occur in LoopNode next stage;The next stage of the milestone node StoneNode cannot occur
Milestone node and APP nodes;The APP nodes AppidNode must be leaf nodes, without any kind of next stage knot
Point;Circulation node LoopNode and milestone node StoneNode under father node can only at most have one;Same father node
Under character node can have multiple, must uniquely be corresponded between specific character, it is not reproducible character occur.
A kind of 2. real time streaming data processing application features recognition methods according to claim 1, it is characterised in that:
The developing algorithm of the M-tree, specifically comprises the following steps:
Step 1:Root node is created, begins stepping through the character triple of full dose;
Step 2:To each character triple, setting current node is root node, HOST, PATH in order traversal triple,
UA;
Step 3:Identical character node is obtained in gathering according to the child of character to current node, as created one without if
New character node is added in child's set of current node, and current node is arranged into the character node;
Step 4:If character late is asterisk wildcard ' % ', if current node is character node or milestone node, create
Build circulation node and be added in the child node set of current node, and the circulation knot that current node is arranged to specifically increase newly
Point;
Step 5:Milestone node is created for non-UA elements to be added in the subclass of current node, setting current node is
The milestone node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal knot
Beam.
A kind of 3. real time streaming data processing application features recognition methods according to claim 1, it is characterised in that:
The identification of the character triple and application software APP, specifically includes following steps:
Step 1:Input parameter is data triple(HOST, PATH, UA), to each data ternary in the M-tree of definition
The path of group is begun look for;
Step 2:From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until looking for
To one or more milestone node;
Step 3:From these milestone nodes, the PATH of data triple is switched into character stream, starts to search from top to bottom
Rope, the milestone node until finding next link;
Step 4:From the milestone node of PATH links, the UA of data triple is switched into character stream, started from top to bottom
Search, all APP nodes until finding final tache;
Step 5:The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search
As a result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710833546.9A CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710833546.9A CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107798060A true CN107798060A (en) | 2018-03-13 |
CN107798060B CN107798060B (en) | 2023-06-30 |
Family
ID=61531981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710833546.9A Active CN107798060B (en) | 2017-09-15 | 2017-09-15 | Real-time streaming data processing application software feature recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107798060B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051725A (en) * | 2012-12-31 | 2013-04-17 | 华为技术有限公司 | Application identification method, data mining method, device and system |
CN105591973A (en) * | 2015-12-31 | 2016-05-18 | 杭州数梦工场科技有限公司 | Application recognition method and apparatus |
-
2017
- 2017-09-15 CN CN201710833546.9A patent/CN107798060B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103051725A (en) * | 2012-12-31 | 2013-04-17 | 华为技术有限公司 | Application identification method, data mining method, device and system |
CN105591973A (en) * | 2015-12-31 | 2016-05-18 | 杭州数梦工场科技有限公司 | Application recognition method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN107798060B (en) | 2023-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200364033A1 (en) | API Specification Generation | |
CN103138981B (en) | A kind of social network analysis method and apparatus | |
US8655805B2 (en) | Method for classification of objects in a graph data stream | |
Nourian et al. | Demystifying automata processing: GPUs, FPGAs or Micron's AP? | |
CN109902274A (en) | A kind of method and system converting json character string to thrift binary stream | |
US20100083194A1 (en) | System and method for finding connected components in a large-scale graph | |
CN107566150A (en) | Handle the method and physical node of cloud resource | |
CN109697456A (en) | Business diagnosis method, apparatus, equipment and storage medium | |
US11797534B2 (en) | Efficient SQL-based graph random walk | |
CN104778164B (en) | Detection repeats URL method and device | |
US20220237220A1 (en) | Template generation using directed acyclic word graphs | |
CN106126383A (en) | A kind of log processing method and device | |
Matsuda et al. | Extension of graph-based induction for general graph structured data | |
Graham et al. | Finding and visualizing graph clusters using pagerank optimization | |
CN103927325B (en) | A kind of method and device classified to URL | |
CN110598417B (en) | Software vulnerability detection method based on graph mining | |
CN102684997A (en) | Classification method, classification device, training method and training device of communication messages | |
CN105808729B (en) | Academic big data analysis method based on adduction relationship between paper | |
CN106599241A (en) | Big data visual management method for GIS software | |
CN107688594B (en) | The identifying system and method for risk case based on social information | |
US20190197161A1 (en) | Program synthesis for query optimization | |
CN112069305A (en) | Data screening method and device and electronic equipment | |
CN107798060A (en) | A kind of real time streaming data handles application features recognition methods | |
CN112231481A (en) | Website classification method and device, computer equipment and storage medium | |
CN105550240B (en) | A kind of method and device of recommendation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |