CN107798060A - A kind of real time streaming data handles application features recognition methods - Google Patents

A kind of real time streaming data handles application features recognition methods Download PDF

Info

Publication number
CN107798060A
CN107798060A CN201710833546.9A CN201710833546A CN107798060A CN 107798060 A CN107798060 A CN 107798060A CN 201710833546 A CN201710833546 A CN 201710833546A CN 107798060 A CN107798060 A CN 107798060A
Authority
CN
China
Prior art keywords
node
character
triple
milestone
app
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710833546.9A
Other languages
Chinese (zh)
Other versions
CN107798060B (en
Inventor
饶翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING AXON TECHNOLOGY Co Ltd
Original Assignee
NANJING AXON TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING AXON TECHNOLOGY Co Ltd filed Critical NANJING AXON TECHNOLOGY Co Ltd
Priority to CN201710833546.9A priority Critical patent/CN107798060B/en
Publication of CN107798060A publication Critical patent/CN107798060A/en
Application granted granted Critical
Publication of CN107798060B publication Critical patent/CN107798060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of real time streaming data to handle application features recognition methods, including defines the flows such as M tree, the identification for formulating node constraint rule, M tree developing algorithm, character triple and application software APP.Pass through self-defined M tree, comprising root node, character node, circulation node, milestone node and APP nodes, and corresponding node constraint rule has been formulated this, then developing algorithm, finally by the Data Analysis Services to character triple, the identification to application software is realized.The present invention can provide the real time streaming data processing application features recognition methods that a kind of occupancy processor resource is few, arithmetic speed is fast, dilatation amount is big and algorithm performance protrudes.

Description

A kind of real time streaming data handles application features recognition methods
Technical field
The present invention relates to information classification technical field, more particularly to a kind of real time streaming data processing application features to know Other method.
Background technology
Big data epoch, data are resource, excavation and application to data, enterprise can be helped to be better understood by user, Improve service quality and the market competitiveness.Data accumulation is the basis of big data application, the mobile Internet epoch, and user is main Online approach is to pass through application software(Hereinafter referred to as APP), feature recognition is carried out to the internet log of user, matched with APP, will Basic data is provided for the batch quantity analysis in later stage, is data mining preparatory condition.
Server address HOST and request PATH, a data triple is formed plus terminal UA fields:(HOST, PATH, UA)Can be as identification APP input element.APP feature databases refer to the tissue rule of the data triple corresponding to specific APP Rule, the mode matched with one group of static strings plus dynamic fuzzy provide, such as(%.taobao.com ,/push/%, %)Just It is a character triple, wherein character ' % ' is used to represent wildcard, represents 0 character for arriving N number of length, the UA in foregoing description It is User-Agent, i.e. the abbreviation of user agent, it is a special string head so that server can identify the behaviour used Make system and version, processor type, browser and version, browser rendering engine, browser language, browser plug-in etc..
One APP would generally correspond to N number of character triple, but a character triple can only correspond to an APP.Work as data When triple meets multiple character triples simultaneously, then must be tied using the character triple wherein matched the most as identification Fruit.When practical operation, it is necessary to from these character triple set, that most long conduct of characteristic character length is picked out Recognition result.In the project application of reality, generally there are the APP of more than 1000, average each APP there are 10 feature ternarys Group, corresponding character triple have more than 10000.
Traditional is to identify by the following method:For every internet log, data triple is extracted(HOST, PATH, UA);All character triples are traveled through, are compared one by one, and cache the character triple for meeting matching condition;Treat all to have traveled through Cheng Hou, be taken out that most long character triple of characteristic character corresponding to APP be recognition result.
Above-mentioned traditional recognition method has the disadvantage that:Take excessive cpu clock resource:Such as there are 10,000 online days Will carries out APP identifications, and feature database triple quantity is 10,000, then will carry out 10,000 *, 10,000=100,000,000 matchings, amount of calculation Greatly;When data traffic or feature storage capacity incrementally increase, computational efficiency will exponentially decline, algorithm performance drastically under Drop.
The content of the invention
There is provided the invention aims to overcome the deficiencies in the prior art and a kind of take that processor resource is few, arithmetic speed It hurry up, the real time streaming data processing application features recognition methods that dilatation amount is big and algorithm performance protrudes.
To reach above-mentioned purpose, present invention employs following technical scheme.
A kind of real time streaming data handles application features recognition methods, including defines M-tree, formulates node constraint The identification of rule, M-tree developing algorithm, character triple and application software APP, the definition M-tree include root node Root, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP node AppidNode; The root node Root is special milestone node, is M-tree root node;Character node CharNode is characterized ternary Entity character in group, the i.e. character in addition to asterisk wildcard ' % ', only a character is included in a CharNode;Circulate node LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node can be itself;Milestone node StoneNode is the element border that the node is used in expression characteristic triple, it is meant that the child node of the node would is that down The beginning of one triple element;APP nodes AppidNode is special milestone node, is whole M-tree leaf knot Point, represent a specific APP;
The formulation node constraint rule, is specifically included:Only allow have a root node in one definition M-tree;It is described Milestone node StoneNode and APP node cannot occur in root node Root next stage;The circulation node LoopNode Next stage circulation node cannot occur;Milestone knot cannot occur in the next stage of the milestone node StoneNode Point and APP nodes;The APP nodes AppidNode must be leaf nodes, without any kind of next stage node;Father ties Circulation node LoopNode and milestone node StoneNode under point can only at most have one;Word under same father node Symbol node can have multiple, must uniquely be corresponded between specific character, not reproducible character occur.
The developing algorithm of the M-tree, specifically comprises the following steps:
Step 1:Root node is created, begins stepping through the character triple of full dose;
Step 2:To each character triple, setting current node is root node, HOST, PATH in order traversal triple, UA;
Step 3:Identical character node is obtained in gathering according to the child of character to current node, as created one without if New character node is added in child's set of current node, and current node is arranged into the character node;
Step 4:If character late is asterisk wildcard ' % ', if current node is character node or milestone node, create Build circulation node and be added in the child node set of current node, and the circulation knot that current node is arranged to specifically increase newly Point;
Step 5:Milestone node is created for non-UA elements to be added in the subclass of current node, setting current node is The milestone node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal knot Beam.
The identification of the character triple and application software APP, specifically includes following steps:
Step 1:Input parameter is data triple(HOST, PATH, UA), to each data ternary in the M-tree of definition The path of group is begun look for;
Step 2:From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until looking for To one or more milestone node;
Step 3:From these milestone nodes, the PATH of data triple is switched into character stream, starts to search from top to bottom Rope, the milestone node until finding next link;
Step 4:From the milestone node of PATH links, the UA of data triple is switched into character stream, started from top to bottom Search, all APP nodes until finding final tache;
Step 5:The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search As a result.
Due to the utilization of above-mentioned technical proposal, advantageous effects that technical scheme is brought:The technical program By self-defined M-tree, comprising root node, character node, circulation node, milestone node and APP nodes, and to this formulation Corresponding node constraint rule, then developing algorithm, finally by the analyzing and processing to character triple, is realized soft to application The identification of part, there is the advantages of operational performance is strong, after tested its monokaryon beneficial skill per second that can recognize that 200,000 internet logs Art effect;Also have scalability good, when feature database ternary pool-size increase, the Advantageous that performance is substantially unaffected is imitated Fruit;Also there is algorithm performance to protrude, when internet log data traffic increases, the use demand of hardware resource is linearly increased The advantageous effects of relation.
Brief description of the drawings
Accompanying drawing 1 is overall structure diagram of the invention.
Accompanying drawing 2 is the M-tree of present invention developing algorithm schematic flow sheet.
Accompanying drawing 3 is the character triple of the present invention and application software APP identification process schematic diagram.
In figure:1. define M-tree;2 formulate node constraint rule;3.M-tree developing algorithm;4. character triple and Application software APP identification.
Embodiment
With reference to reaction scheme and specific embodiment, the present invention is described in further detail.
As shown in figure 1, a kind of real time streaming data processing application features recognition methods, including define M-tree1, system Determine node constraint rule 2, M-tree developing algorithm 3, character triple and application software APP identification 4, the definition M- Tree1 includes root node Root, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP Node AppidNode.
The root node Root is special milestone node, is M-tree root node;Character node CharNode is Entity character in character triple, the i.e. character in addition to asterisk wildcard ' % ', only a character is included in a CharNode; Circulation node LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node can be itself;In Journey upright stone tablet node StoneNode is the element border that the node is used in expression characteristic triple, it is meant that the child node of the node It would is that the beginning of next triple element;APP nodes AppidNode is special milestone node, is whole M-tree Leaf nodes, represent a specific APP.
The formulation node constraint rule 2, specifically includes:Only allow have a root node in one M-tree;Root knot Milestone node and APP nodes cannot occur in the next stage of point;Circulation node cannot occur in the next stage of circulation node;In Milestone node and APP nodes cannot occur in the next stage of journey upright stone tablet node;APP nodes must be leaf nodes, not any The next stage node of species;Circulation node and milestone node under father node can only at most have one;Under same father node Character node can have multiple, must uniquely be corresponded between specific character, it is not reproducible character occur.
The developing algorithm 3 of the M-tree, specifically comprises the following steps:
(1)Root node is created, begins stepping through the character triple of full dose;
(2)To each character triple, setting current node is root node, HOST, PATH, UA in order traversal triple;
(3)Identical character node is obtained in gathering according to the child of character to current node, as without then creating one newly Character node is added in child's set of current node, and current node is arranged into the character node;
(4)If character late is asterisk wildcard ' % ', if current node is character node or milestone node, establishment follows Ring node is simultaneously added in the child node set of current node, and the circulation node that current node is arranged to specifically increase newly;
(5)Milestone node is created for non-UA elements to be added in the subclass of current node, it is in this to set current node Journey upright stone tablet node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal terminates.
The identification 4 of the character triple and application software APP, specifically includes following steps:
(1)Input parameter is data triple(HOST, PATH, UA), to each data triple in the M-tree of definition Path is begun look for;
(2)From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until finding one Individual or multiple milestone nodes;
(3)From these milestone nodes, the PATH of data triple is switched into character stream, starts to search for from top to bottom, directly To the milestone node for finding next link;
(4)From the milestone node of PATH links, the UA of data triple is switched into character stream, starts to search from top to bottom Rope, all APP nodes until finding final tache;
(5)The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search result.
It the above is only the concrete application example of the present invention, protection scope of the present invention be not limited in any way.All uses Equivalent transformation or equivalent replacement and the technical scheme formed, all fall within rights protection scope of the present invention.

Claims (3)

1. a kind of real time streaming data handles application features recognition methods, it is characterised in that:Including defining M-tree, formulating The identification of node constraint rule, M-tree developing algorithm, character triple and application software APP, the definition M-tree bags Root containing root node, character node CharNode, circulation node LoopNode, milestone node StoneNode, APP node AppidNode;
The root node Root is special milestone node, is M-tree root node;The character node CharNode is Entity character in character triple, the i.e. character in addition to asterisk wildcard ' % ', only wrap in a character node CharNode Containing a character;The circulation node LoopNode is that the node is used to express asterisk wildcard ' % ', it is meant that the child node of the node Can be itself;The milestone node StoneNode is the element border that the node is used in expression characteristic triple, is anticipated The child node that taste the node would is that beginning of next triple element;In the APP nodes AppidNode is special Journey upright stone tablet node, it is whole M-tree leaf nodes, represents a specific APP;
The formulation node constraint rule, specifically includes:Only allow have a root node in one definition M-tree;Institute Milestone node StoneNode and APP node cannot be occurred by stating root node Root next stage;The circulation node Circulation node cannot occur in LoopNode next stage;The next stage of the milestone node StoneNode cannot occur Milestone node and APP nodes;The APP nodes AppidNode must be leaf nodes, without any kind of next stage knot Point;Circulation node LoopNode and milestone node StoneNode under father node can only at most have one;Same father node Under character node can have multiple, must uniquely be corresponded between specific character, it is not reproducible character occur.
A kind of 2. real time streaming data processing application features recognition methods according to claim 1, it is characterised in that: The developing algorithm of the M-tree, specifically comprises the following steps:
Step 1:Root node is created, begins stepping through the character triple of full dose;
Step 2:To each character triple, setting current node is root node, HOST, PATH in order traversal triple, UA;
Step 3:Identical character node is obtained in gathering according to the child of character to current node, as created one without if New character node is added in child's set of current node, and current node is arranged into the character node;
Step 4:If character late is asterisk wildcard ' % ', if current node is character node or milestone node, create Build circulation node and be added in the child node set of current node, and the circulation knot that current node is arranged to specifically increase newly Point;
Step 5:Milestone node is created for non-UA elements to be added in the subclass of current node, setting current node is The milestone node;APP nodes are then created for UA elements to be added in the subclass of current node, this feature group traversal knot Beam.
A kind of 3. real time streaming data processing application features recognition methods according to claim 1, it is characterised in that: The identification of the character triple and application software APP, specifically includes following steps:
Step 1:Input parameter is data triple(HOST, PATH, UA), to each data ternary in the M-tree of definition The path of group is begun look for;
Step 2:From root node, the HOST of data triple is switched into character stream, starts to search for from top to bottom, until looking for To one or more milestone node;
Step 3:From these milestone nodes, the PATH of data triple is switched into character stream, starts to search from top to bottom Rope, the milestone node until finding next link;
Step 4:From the milestone node of PATH links, the UA of data triple is switched into character stream, started from top to bottom Search, all APP nodes until finding final tache;
Step 5:The APP node farthest apart from root node is taken out from all APP nodes for meeting characteristic matching, as search As a result.
CN201710833546.9A 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method Active CN107798060B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710833546.9A CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710833546.9A CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Publications (2)

Publication Number Publication Date
CN107798060A true CN107798060A (en) 2018-03-13
CN107798060B CN107798060B (en) 2023-06-30

Family

ID=61531981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710833546.9A Active CN107798060B (en) 2017-09-15 2017-09-15 Real-time streaming data processing application software feature recognition method

Country Status (1)

Country Link
CN (1) CN107798060B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051725A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Application identification method, data mining method, device and system
CN105591973A (en) * 2015-12-31 2016-05-18 杭州数梦工场科技有限公司 Application recognition method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103051725A (en) * 2012-12-31 2013-04-17 华为技术有限公司 Application identification method, data mining method, device and system
CN105591973A (en) * 2015-12-31 2016-05-18 杭州数梦工场科技有限公司 Application recognition method and apparatus

Also Published As

Publication number Publication date
CN107798060B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
US20200364033A1 (en) API Specification Generation
CN103138981B (en) A kind of social network analysis method and apparatus
US8655805B2 (en) Method for classification of objects in a graph data stream
Nourian et al. Demystifying automata processing: GPUs, FPGAs or Micron's AP?
CN109902274A (en) A kind of method and system converting json character string to thrift binary stream
US20100083194A1 (en) System and method for finding connected components in a large-scale graph
CN107566150A (en) Handle the method and physical node of cloud resource
CN109697456A (en) Business diagnosis method, apparatus, equipment and storage medium
US11797534B2 (en) Efficient SQL-based graph random walk
CN104778164B (en) Detection repeats URL method and device
US20220237220A1 (en) Template generation using directed acyclic word graphs
CN106126383A (en) A kind of log processing method and device
Matsuda et al. Extension of graph-based induction for general graph structured data
Graham et al. Finding and visualizing graph clusters using pagerank optimization
CN103927325B (en) A kind of method and device classified to URL
CN110598417B (en) Software vulnerability detection method based on graph mining
CN102684997A (en) Classification method, classification device, training method and training device of communication messages
CN105808729B (en) Academic big data analysis method based on adduction relationship between paper
CN106599241A (en) Big data visual management method for GIS software
CN107688594B (en) The identifying system and method for risk case based on social information
US20190197161A1 (en) Program synthesis for query optimization
CN112069305A (en) Data screening method and device and electronic equipment
CN107798060A (en) A kind of real time streaming data handles application features recognition methods
CN112231481A (en) Website classification method and device, computer equipment and storage medium
CN105550240B (en) A kind of method and device of recommendation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant