CN110929107A - Method, system, device and storage medium for analyzing network access log - Google Patents

Method, system, device and storage medium for analyzing network access log Download PDF

Info

Publication number
CN110929107A
CN110929107A CN201911009697.8A CN201911009697A CN110929107A CN 110929107 A CN110929107 A CN 110929107A CN 201911009697 A CN201911009697 A CN 201911009697A CN 110929107 A CN110929107 A CN 110929107A
Authority
CN
China
Prior art keywords
matching
character string
host character
network access
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911009697.8A
Other languages
Chinese (zh)
Inventor
张毅
符伟彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ai Media Guangzhou Number Of Poly Information Consulting Ltd By Share Ltd
Original Assignee
Ai Media Guangzhou Number Of Poly Information Consulting Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ai Media Guangzhou Number Of Poly Information Consulting Ltd By Share Ltd filed Critical Ai Media Guangzhou Number Of Poly Information Consulting Ltd By Share Ltd
Priority to CN201911009697.8A priority Critical patent/CN110929107A/en
Publication of CN110929107A publication Critical patent/CN110929107A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a method, a system, a device and a storage medium for analyzing a network access log, wherein the method comprises the following steps: acquiring url information in a log record, and acquiring a host character string according to the url information; and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result. The dictionary tree of the reverse order query carries out polling matching on the host character string, avoids the backtracking problem during polling regular matching, greatly improves the matching speed of the host part, indirectly improves the analysis speed of the network access log, and can be widely applied to the computer data processing technology.

Description

Method, system, device and storage medium for analyzing network access log
Technical Field
The present invention relates to computer data processing technology, and in particular, to a method, system, apparatus, and storage medium for analyzing a network access log.
Background
Under the hadoop environment, a large number of network access log records are recorded every day, the log records information such as url, access time, ip and user-agent of different users accessing different applications (including websites, apps and the like), the system analyzes which applications are accessed by the users in each time period through the recording purpose, and the realization principle is that according to the characteristics such as url, user-agent and the like accessed by the users, the application used by the users and the operation in the application are obtained. Since information such as url is of a character string type, the data amount per day is large, the information needs to be compressed as much as possible, and the system needs to encode user tag information, hit application tags, and the like.
To complete the numbering of application records accessed by a user, the current main scheme is as follows: 1. using hql directly to match one by one using a regular command; 2. MapReduce script was used. In the case that a large number of applications need matching, the first scheme is not preferable, while the second scheme is generally a rule-by-rule method, matching is circulated by using the java's own regular class, and the matching speed is unstable due to the diversity of the matching rules and backtracking of the java's regular class during use.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method, system, apparatus, and storage medium for analyzing a network access log, which can perform matching stably and quickly.
The first technical scheme adopted by the invention is as follows:
a method of analyzing a network access log, comprising the steps of:
acquiring url information in a log record, and acquiring a host character string according to the url information;
and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
Further, the step of acquiring url information in the log record and acquiring a host character string according to the url information includes:
and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
Further, the step of performing matching query on the host character string by using the dictionary tree of the reverse query and acquiring corresponding application information according to a matching query result specifically includes the following steps:
matching and querying are carried out from the tail letter of the host character string by adopting a dictionary tree;
after the mark of the host character string is identified according to the preset rule matching configuration file, a target number is obtained from the host character string;
and sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
Further, the data structure of the trie includes an array of child nodes.
Further, the url information further includes uri path characteristics, request parameter characteristics, and user-agent characteristics.
The second technical scheme adopted by the invention is as follows:
a system for analyzing a network access log, comprising:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
Further, the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
Further, the match query module includes:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
The third technical scheme adopted by the invention is as follows:
an apparatus to analyze a network access log, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method.
The fourth technical scheme adopted by the invention is as follows:
a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.
The invention has the beneficial effects that: according to the invention, the dictionary tree of the reverse query carries out polling matching on the host character string, so that the backtracking problem in polling regular matching is avoided, the matching speed of the host part is greatly improved, and the analysis speed of the network access log is indirectly improved.
Drawings
FIG. 1 is a flow chart of the steps of a method of analyzing a network access log of the present invention;
fig. 2 is a block diagram of a system for analyzing a network access log according to the present invention.
Detailed Description
As shown in fig. 1, the present embodiment provides a method for analyzing a network access log, including the following steps:
s1, acquiring url information in the log record, and acquiring a host character string according to the url information;
and S2, performing polling matching on the host character string by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
Step S1 specifically includes: and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
Wherein, the step S2 specifically includes steps S21 to S23:
s21, matching and querying from the tail letter of the host character string by adopting a dictionary tree;
s22, after the mark of the host character string is identified according to the preset rule matching configuration file, the target number is obtained from the host character string;
and S23, sending the host character string to the corresponding matching number list according to the destination number for polling, and then obtaining the application information corresponding to the host character string.
In this embodiment, the url features of the application, including the host feature, the uri path feature, the request parameter feature, and the user-agent feature, are analyzed in advance, and a rule matching configuration file is generated and uploaded to the hadoop environment. By running MapReduce to read log records, based on a rule matching configuration file, aiming at a character string of a host part, the embodiment adopts the idea of adopting a dictionary tree, wherein an inserted character starts from the back of a certain rule, if a certain host is ". SP. This is because most of the applied rules are fuzzy for the host part in the url information, while the post part is definite, e.g. some applied rule is ". sport.qq.com", where ". prime" represents the uncertain part of the front part and ". sport.qq.com" is definite part, when matching to the host definite part, it is considered to match.
Specifically, in order to obtain information corresponding to characters in the matching process more quickly, and because the value range of the characters of host is not large and most of ASCII characters are serial numbers, arrays are selected. The data structure of the nodes of the tree includes: characters, a number list and a child node array. The range of characters is considered to be 94 characters in decimal 33 to 126, so that the number of nodes of each layer is a multiple of 94, and if there are children, 94 children must exist.
Specifically, during matching, a host part is quickly intercepted according to url recorded by a log, then, from the tail part of the intercepted character string, dictionary tree polling matching is carried out according to the dictionary looking-up idea, when a 'x' corresponding node object in a brother node or a next layer node of the polled character is not empty, the corresponding application is considered to be matched, namely, the corresponding application number can be added into a matching application number set until the head part is polled. Since the speed of the host matching part is mainly increased in the embodiment, the host matching step is completed according to the matching application number set, further screening can be performed according to other characteristics (such as uri path characteristic, request parameter characteristic and user-agent characteristic), and the method can be realized by adopting the existing technical means.
According to the embodiment, the regular matching of polling is directly avoided by constructing the dictionary tree of reverse order query, the backtracking problem during regular matching is avoided, and the matching speed of a single record is at least doubled. Assuming that there are 1000 rules, the estimated time of match for each record is not stable and cannot be estimated in the worst case when using regular polling matching. However, with the method of this embodiment, considering that the length of host is not too long conventionally, most of the host is below 50 characters, and therefore, in the worst case, 50 characters need to be polled for one record, and this situation is still based on the matching rule that each layer has a character correspondence. Therefore, the method of the embodiment greatly improves the matching speed, reduces the operation pressure of the system, correspondingly improves the service processing speed, and avoids the consumption of computing resources.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The embodiment provides a method for analyzing a network access log, which comprises the following steps:
the first step is as follows: the following matching rules are preset: ". snssdk.com", ". uczd.cn", ". news.qq.com", obtaining a rule matching configuration file;
the second step is as follows: the dictionary tree part data generated according to the rule matching configuration file in the file is as follows:
Figure BDA0002243832440000041
the dictionary tree above omits the loading effect of other rules, and in the omitted part, if the corresponding characters are regular, the object at the corresponding position of the array is not null, and if not, the object is null. For example, if the first layer does not have a rule corresponding to a, the object at the position corresponding to a in the array is empty. The subscript of the character correspondence array is obtained by subtracting 33 from the ASCII decimal value of the character, 33 being the ASCII decimal value of the starting character of the character range under consideration.
And in the third step, assuming that the host of url in the analyzed log is' api. After the characters are sequentially matched, because the object is not empty in the next node, the application number corresponding to the rule of' snssdk. And continuing to poll until the ' a ' is matched with no other rule, and finally obtaining the application number of the ' api.
As shown in fig. 2, this embodiment further provides a system for analyzing a network access log, including:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
Further as a preferred embodiment, the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
Further as a preferred embodiment, the matching query module includes:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
The system for analyzing the network access log according to the embodiment of the invention can execute the method for analyzing the network access log provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The present embodiment further provides an apparatus for analyzing a network access log, including:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method.
The device for analyzing the network access log according to the embodiment of the invention can execute the method for analyzing the network access log provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The present embodiments also provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method as described above.
The storage medium of this embodiment may execute the method for analyzing a network access log provided in the method embodiment of the present invention, may execute any combination of the implementation steps of the method embodiment, and has corresponding functions and advantageous effects of the method.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method of analyzing a network access log, comprising the steps of:
acquiring url information in a log record, and acquiring a host character string according to the url information;
and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
2. The method for analyzing a network access log according to claim 1, wherein the step of obtaining url information in a log record and obtaining a host character string according to the url information includes:
and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
3. The method for analyzing the network access log according to claim 1, wherein the step of performing matching query on host character strings by using a dictionary tree of reverse query and acquiring corresponding application information according to a matching query result specifically comprises the following steps:
matching and querying are carried out from the tail letter of the host character string by adopting a dictionary tree;
after the mark of the host character string is identified according to the preset rule matching configuration file, a target number is obtained from the host character string;
and sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
4. The method of claim 3, wherein the data structure of the trie comprises an array of child nodes.
5. The method of analyzing a network access log of claim 1, wherein the url information further includes uri path feature, request parameter feature and user-agent feature.
6. A system for analyzing a network access log, comprising:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
7. The system for analyzing a network access log according to claim 6, wherein the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
8. The system for analyzing logs of network access of claim 6, wherein the match query module comprises:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
9. An apparatus for analyzing a network access log, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method of analyzing a network access log as recited in any of claims 1-5.
10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-5.
CN201911009697.8A 2019-10-23 2019-10-23 Method, system, device and storage medium for analyzing network access log Pending CN110929107A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911009697.8A CN110929107A (en) 2019-10-23 2019-10-23 Method, system, device and storage medium for analyzing network access log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911009697.8A CN110929107A (en) 2019-10-23 2019-10-23 Method, system, device and storage medium for analyzing network access log

Publications (1)

Publication Number Publication Date
CN110929107A true CN110929107A (en) 2020-03-27

Family

ID=69849170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911009697.8A Pending CN110929107A (en) 2019-10-23 2019-10-23 Method, system, device and storage medium for analyzing network access log

Country Status (1)

Country Link
CN (1) CN110929107A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878924A (en) * 2021-09-27 2023-03-31 小沃科技有限公司 Data processing method, device, medium and electronic equipment based on double dictionary trees

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102316099A (en) * 2011-07-28 2012-01-11 中国科学院计算机网络信息中心 Network fishing detection method and apparatus thereof
CN108549679A (en) * 2018-04-03 2018-09-18 国家计算机网络与信息安全管理中心 File extension fast matching method and device for URL analysis systems
CN110222238A (en) * 2019-04-30 2019-09-10 上海交通大学 The querying method and system of character string and identifier biaxial stress structure

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102316099A (en) * 2011-07-28 2012-01-11 中国科学院计算机网络信息中心 Network fishing detection method and apparatus thereof
CN108549679A (en) * 2018-04-03 2018-09-18 国家计算机网络与信息安全管理中心 File extension fast matching method and device for URL analysis systems
CN110222238A (en) * 2019-04-30 2019-09-10 上海交通大学 The querying method and system of character string and identifier biaxial stress structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张成等: "基于安全字典树的关键词密文模糊搜索方案", 《基于安全字典树的关键词密文模糊搜索方案 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878924A (en) * 2021-09-27 2023-03-31 小沃科技有限公司 Data processing method, device, medium and electronic equipment based on double dictionary trees
CN115878924B (en) * 2021-09-27 2024-03-12 小沃科技有限公司 Data processing method, device, medium and electronic equipment based on double dictionary trees

Similar Documents

Publication Publication Date Title
US20220035775A1 (en) Data field extraction model training for a data intake and query system
CN108090064B (en) Data query method and device, data storage server and system
CN107257390B (en) URL address resolution method and system
KR20120106978A (en) Methods and apparatuses for reducing power consumption in a pattern recognition processor
CN110768875A (en) Application identification method and system based on DNS learning
WO2020199603A1 (en) Server vulnerability detection method and apparatus, device, and storage medium
US20150128280A1 (en) Network service interface analysis
KR102587776B1 (en) System and method for managing connectivity in a scalable cluster
CN112347165A (en) Log processing method and device, server and computer readable storage medium
CN110851136A (en) Data acquisition method and device, electronic equipment and storage medium
CN111897828A (en) Data batch processing implementation method, device, equipment and storage medium
CN111803917A (en) Resource processing method and device
CN111585963A (en) Data acquisition method, system and storage medium
CN114579533A (en) Method and device for acquiring user activity index, electronic equipment and storage medium
CN110929107A (en) Method, system, device and storage medium for analyzing network access log
CN111078975B (en) Multi-node incremental data acquisition system and acquisition method
CN110795915B (en) Method, system, device and computer readable storage medium for modifying xml files in batches
CN112269726A (en) Data processing method and device
CN115495462A (en) Batch data updating method and device, electronic equipment and readable storage medium
CN111124883A (en) Test case library introduction method, system and equipment based on tree form
CN106250440B (en) Document management method and device
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium
CN113051333A (en) Data processing method and device, electronic equipment and storage medium
CN105190598A (en) Resource reference classification
CN114697271A (en) Method and device for determining data flow label and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200327