CN110929107A - Method, system, device and storage medium for analyzing network access log - Google Patents
Method, system, device and storage medium for analyzing network access log Download PDFInfo
- Publication number
- CN110929107A CN110929107A CN201911009697.8A CN201911009697A CN110929107A CN 110929107 A CN110929107 A CN 110929107A CN 201911009697 A CN201911009697 A CN 201911009697A CN 110929107 A CN110929107 A CN 110929107A
- Authority
- CN
- China
- Prior art keywords
- matching
- character string
- host character
- network access
- acquiring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Debugging And Monitoring (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a method, a system, a device and a storage medium for analyzing a network access log, wherein the method comprises the following steps: acquiring url information in a log record, and acquiring a host character string according to the url information; and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result. The dictionary tree of the reverse order query carries out polling matching on the host character string, avoids the backtracking problem during polling regular matching, greatly improves the matching speed of the host part, indirectly improves the analysis speed of the network access log, and can be widely applied to the computer data processing technology.
Description
Technical Field
The present invention relates to computer data processing technology, and in particular, to a method, system, apparatus, and storage medium for analyzing a network access log.
Background
Under the hadoop environment, a large number of network access log records are recorded every day, the log records information such as url, access time, ip and user-agent of different users accessing different applications (including websites, apps and the like), the system analyzes which applications are accessed by the users in each time period through the recording purpose, and the realization principle is that according to the characteristics such as url, user-agent and the like accessed by the users, the application used by the users and the operation in the application are obtained. Since information such as url is of a character string type, the data amount per day is large, the information needs to be compressed as much as possible, and the system needs to encode user tag information, hit application tags, and the like.
To complete the numbering of application records accessed by a user, the current main scheme is as follows: 1. using hql directly to match one by one using a regular command; 2. MapReduce script was used. In the case that a large number of applications need matching, the first scheme is not preferable, while the second scheme is generally a rule-by-rule method, matching is circulated by using the java's own regular class, and the matching speed is unstable due to the diversity of the matching rules and backtracking of the java's regular class during use.
Disclosure of Invention
In order to solve the above technical problems, an object of the present invention is to provide a method, system, apparatus, and storage medium for analyzing a network access log, which can perform matching stably and quickly.
The first technical scheme adopted by the invention is as follows:
a method of analyzing a network access log, comprising the steps of:
acquiring url information in a log record, and acquiring a host character string according to the url information;
and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
Further, the step of acquiring url information in the log record and acquiring a host character string according to the url information includes:
and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
Further, the step of performing matching query on the host character string by using the dictionary tree of the reverse query and acquiring corresponding application information according to a matching query result specifically includes the following steps:
matching and querying are carried out from the tail letter of the host character string by adopting a dictionary tree;
after the mark of the host character string is identified according to the preset rule matching configuration file, a target number is obtained from the host character string;
and sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
Further, the data structure of the trie includes an array of child nodes.
Further, the url information further includes uri path characteristics, request parameter characteristics, and user-agent characteristics.
The second technical scheme adopted by the invention is as follows:
a system for analyzing a network access log, comprising:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
Further, the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
Further, the match query module includes:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
The third technical scheme adopted by the invention is as follows:
an apparatus to analyze a network access log, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method.
The fourth technical scheme adopted by the invention is as follows:
a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.
The invention has the beneficial effects that: according to the invention, the dictionary tree of the reverse query carries out polling matching on the host character string, so that the backtracking problem in polling regular matching is avoided, the matching speed of the host part is greatly improved, and the analysis speed of the network access log is indirectly improved.
Drawings
FIG. 1 is a flow chart of the steps of a method of analyzing a network access log of the present invention;
fig. 2 is a block diagram of a system for analyzing a network access log according to the present invention.
Detailed Description
As shown in fig. 1, the present embodiment provides a method for analyzing a network access log, including the following steps:
s1, acquiring url information in the log record, and acquiring a host character string according to the url information;
and S2, performing polling matching on the host character string by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
Step S1 specifically includes: and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
Wherein, the step S2 specifically includes steps S21 to S23:
s21, matching and querying from the tail letter of the host character string by adopting a dictionary tree;
s22, after the mark of the host character string is identified according to the preset rule matching configuration file, the target number is obtained from the host character string;
and S23, sending the host character string to the corresponding matching number list according to the destination number for polling, and then obtaining the application information corresponding to the host character string.
In this embodiment, the url features of the application, including the host feature, the uri path feature, the request parameter feature, and the user-agent feature, are analyzed in advance, and a rule matching configuration file is generated and uploaded to the hadoop environment. By running MapReduce to read log records, based on a rule matching configuration file, aiming at a character string of a host part, the embodiment adopts the idea of adopting a dictionary tree, wherein an inserted character starts from the back of a certain rule, if a certain host is ". SP. This is because most of the applied rules are fuzzy for the host part in the url information, while the post part is definite, e.g. some applied rule is ". sport.qq.com", where ". prime" represents the uncertain part of the front part and ". sport.qq.com" is definite part, when matching to the host definite part, it is considered to match.
Specifically, in order to obtain information corresponding to characters in the matching process more quickly, and because the value range of the characters of host is not large and most of ASCII characters are serial numbers, arrays are selected. The data structure of the nodes of the tree includes: characters, a number list and a child node array. The range of characters is considered to be 94 characters in decimal 33 to 126, so that the number of nodes of each layer is a multiple of 94, and if there are children, 94 children must exist.
Specifically, during matching, a host part is quickly intercepted according to url recorded by a log, then, from the tail part of the intercepted character string, dictionary tree polling matching is carried out according to the dictionary looking-up idea, when a 'x' corresponding node object in a brother node or a next layer node of the polled character is not empty, the corresponding application is considered to be matched, namely, the corresponding application number can be added into a matching application number set until the head part is polled. Since the speed of the host matching part is mainly increased in the embodiment, the host matching step is completed according to the matching application number set, further screening can be performed according to other characteristics (such as uri path characteristic, request parameter characteristic and user-agent characteristic), and the method can be realized by adopting the existing technical means.
According to the embodiment, the regular matching of polling is directly avoided by constructing the dictionary tree of reverse order query, the backtracking problem during regular matching is avoided, and the matching speed of a single record is at least doubled. Assuming that there are 1000 rules, the estimated time of match for each record is not stable and cannot be estimated in the worst case when using regular polling matching. However, with the method of this embodiment, considering that the length of host is not too long conventionally, most of the host is below 50 characters, and therefore, in the worst case, 50 characters need to be polled for one record, and this situation is still based on the matching rule that each layer has a character correspondence. Therefore, the method of the embodiment greatly improves the matching speed, reduces the operation pressure of the system, correspondingly improves the service processing speed, and avoids the consumption of computing resources.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
The embodiment provides a method for analyzing a network access log, which comprises the following steps:
the first step is as follows: the following matching rules are preset: ". snssdk.com", ". uczd.cn", ". news.qq.com", obtaining a rule matching configuration file;
the second step is as follows: the dictionary tree part data generated according to the rule matching configuration file in the file is as follows:
the dictionary tree above omits the loading effect of other rules, and in the omitted part, if the corresponding characters are regular, the object at the corresponding position of the array is not null, and if not, the object is null. For example, if the first layer does not have a rule corresponding to a, the object at the position corresponding to a in the array is empty. The subscript of the character correspondence array is obtained by subtracting 33 from the ASCII decimal value of the character, 33 being the ASCII decimal value of the starting character of the character range under consideration.
And in the third step, assuming that the host of url in the analyzed log is' api. After the characters are sequentially matched, because the object is not empty in the next node, the application number corresponding to the rule of' snssdk. And continuing to poll until the ' a ' is matched with no other rule, and finally obtaining the application number of the ' api.
As shown in fig. 2, this embodiment further provides a system for analyzing a network access log, including:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
Further as a preferred embodiment, the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
Further as a preferred embodiment, the matching query module includes:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
The system for analyzing the network access log according to the embodiment of the invention can execute the method for analyzing the network access log provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The present embodiment further provides an apparatus for analyzing a network access log, including:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method.
The device for analyzing the network access log according to the embodiment of the invention can execute the method for analyzing the network access log provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
The present embodiments also provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method as described above.
The storage medium of this embodiment may execute the method for analyzing a network access log provided in the method embodiment of the present invention, may execute any combination of the implementation steps of the method embodiment, and has corresponding functions and advantageous effects of the method.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A method of analyzing a network access log, comprising the steps of:
acquiring url information in a log record, and acquiring a host character string according to the url information;
and polling and matching the host character strings by adopting the dictionary tree of the reverse query, and acquiring corresponding application information according to a matching result.
2. The method for analyzing a network access log according to claim 1, wherein the step of obtaining url information in a log record and obtaining a host character string according to the url information includes:
and reading url information in the log record by operating MapReduce, and acquiring a host character string according to the url information.
3. The method for analyzing the network access log according to claim 1, wherein the step of performing matching query on host character strings by using a dictionary tree of reverse query and acquiring corresponding application information according to a matching query result specifically comprises the following steps:
matching and querying are carried out from the tail letter of the host character string by adopting a dictionary tree;
after the mark of the host character string is identified according to the preset rule matching configuration file, a target number is obtained from the host character string;
and sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
4. The method of claim 3, wherein the data structure of the trie comprises an array of child nodes.
5. The method of analyzing a network access log of claim 1, wherein the url information further includes uri path feature, request parameter feature and user-agent feature.
6. A system for analyzing a network access log, comprising:
the characteristic acquisition module is used for acquiring url information in the log record and acquiring host character strings according to the url information;
and the matching query module is used for performing polling matching on the host character string by adopting the dictionary tree queried in the reverse order and acquiring corresponding application information according to a matching result.
7. The system for analyzing a network access log according to claim 6, wherein the feature obtaining module is specifically configured to read url information in a log record by operating MapReduce, and obtain a host character string according to the url information.
8. The system for analyzing logs of network access of claim 6, wherein the match query module comprises:
the matching query unit is used for performing matching query from the tail letter of the host character string by adopting a dictionary tree;
the mark identification unit is used for acquiring a target number from the host character string after the mark of the host character string is identified according to the preset rule matching configuration file;
and the number polling unit is used for sending the host character string to the corresponding matching number list according to the target number for polling to obtain the application information corresponding to the host character string.
9. An apparatus for analyzing a network access log, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement a method of analyzing a network access log as recited in any of claims 1-5.
10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911009697.8A CN110929107A (en) | 2019-10-23 | 2019-10-23 | Method, system, device and storage medium for analyzing network access log |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911009697.8A CN110929107A (en) | 2019-10-23 | 2019-10-23 | Method, system, device and storage medium for analyzing network access log |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110929107A true CN110929107A (en) | 2020-03-27 |
Family
ID=69849170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911009697.8A Pending CN110929107A (en) | 2019-10-23 | 2019-10-23 | Method, system, device and storage medium for analyzing network access log |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110929107A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115878924A (en) * | 2021-09-27 | 2023-03-31 | 小沃科技有限公司 | Data processing method, device, medium and electronic equipment based on double dictionary trees |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102316099A (en) * | 2011-07-28 | 2012-01-11 | 中国科学院计算机网络信息中心 | Network fishing detection method and apparatus thereof |
CN108549679A (en) * | 2018-04-03 | 2018-09-18 | 国家计算机网络与信息安全管理中心 | File extension fast matching method and device for URL analysis systems |
CN110222238A (en) * | 2019-04-30 | 2019-09-10 | 上海交通大学 | The querying method and system of character string and identifier biaxial stress structure |
-
2019
- 2019-10-23 CN CN201911009697.8A patent/CN110929107A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102316099A (en) * | 2011-07-28 | 2012-01-11 | 中国科学院计算机网络信息中心 | Network fishing detection method and apparatus thereof |
CN108549679A (en) * | 2018-04-03 | 2018-09-18 | 国家计算机网络与信息安全管理中心 | File extension fast matching method and device for URL analysis systems |
CN110222238A (en) * | 2019-04-30 | 2019-09-10 | 上海交通大学 | The querying method and system of character string and identifier biaxial stress structure |
Non-Patent Citations (1)
Title |
---|
张成等: "基于安全字典树的关键词密文模糊搜索方案", 《基于安全字典树的关键词密文模糊搜索方案》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115878924A (en) * | 2021-09-27 | 2023-03-31 | 小沃科技有限公司 | Data processing method, device, medium and electronic equipment based on double dictionary trees |
CN115878924B (en) * | 2021-09-27 | 2024-03-12 | 小沃科技有限公司 | Data processing method, device, medium and electronic equipment based on double dictionary trees |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220035775A1 (en) | Data field extraction model training for a data intake and query system | |
CN108090064B (en) | Data query method and device, data storage server and system | |
CN107257390B (en) | URL address resolution method and system | |
KR20120106978A (en) | Methods and apparatuses for reducing power consumption in a pattern recognition processor | |
CN110768875A (en) | Application identification method and system based on DNS learning | |
WO2020199603A1 (en) | Server vulnerability detection method and apparatus, device, and storage medium | |
US20150128280A1 (en) | Network service interface analysis | |
KR102587776B1 (en) | System and method for managing connectivity in a scalable cluster | |
CN112347165A (en) | Log processing method and device, server and computer readable storage medium | |
CN110851136A (en) | Data acquisition method and device, electronic equipment and storage medium | |
CN111897828A (en) | Data batch processing implementation method, device, equipment and storage medium | |
CN111803917A (en) | Resource processing method and device | |
CN111585963A (en) | Data acquisition method, system and storage medium | |
CN114579533A (en) | Method and device for acquiring user activity index, electronic equipment and storage medium | |
CN110929107A (en) | Method, system, device and storage medium for analyzing network access log | |
CN111078975B (en) | Multi-node incremental data acquisition system and acquisition method | |
CN110795915B (en) | Method, system, device and computer readable storage medium for modifying xml files in batches | |
CN112269726A (en) | Data processing method and device | |
CN115495462A (en) | Batch data updating method and device, electronic equipment and readable storage medium | |
CN111124883A (en) | Test case library introduction method, system and equipment based on tree form | |
CN106250440B (en) | Document management method and device | |
CN114416741A (en) | KV data writing and reading method and device based on multi-level index and storage medium | |
CN113051333A (en) | Data processing method and device, electronic equipment and storage medium | |
CN105190598A (en) | Resource reference classification | |
CN114697271A (en) | Method and device for determining data flow label and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200327 |