CN117932175A - Data analysis method, device and storage medium - Google Patents

Data analysis method, device and storage medium Download PDF

Info

Publication number
CN117932175A
CN117932175A CN202410303383.3A CN202410303383A CN117932175A CN 117932175 A CN117932175 A CN 117932175A CN 202410303383 A CN202410303383 A CN 202410303383A CN 117932175 A CN117932175 A CN 117932175A
Authority
CN
China
Prior art keywords
data
analysis
analyzed
rule
breadth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410303383.3A
Other languages
Chinese (zh)
Inventor
蔡中兴
黄有福
廖艺咪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Panyu Polytechnic
Original Assignee
Guangzhou Panyu Polytechnic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Panyu Polytechnic filed Critical Guangzhou Panyu Polytechnic
Priority to CN202410303383.3A priority Critical patent/CN117932175A/en
Publication of CN117932175A publication Critical patent/CN117932175A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data analysis method, equipment and a storage medium, wherein the data analysis method comprises the following steps: s101: acquiring data to be analyzed, and identifying the data to be analyzed based on a configuration file corresponding to the data to be analyzed; s102: performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result to obtain product information; s103: and carrying out deep analysis by using a deep rule base based on the product information, and storing a deep analysis result. The invention can identify the data to be analyzed by using the configuration file, thereby realizing the identification of the data with different characteristics by adding or modifying the configuration file on the basis of not modifying the analysis engine, facilitating the analysis of the data by the analysis engine by utilizing the rule base, improving the efficiency of data analysis and reducing the difficulty and cost of analysis work.

Description

Data analysis method, device and storage medium
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a data analysis method, a device, and a storage medium.
Background
With the development of internet of things and artificial intelligence, more types of smart devices access the internet, and these smart devices or applications on the smart devices generate a large amount of electronic data (e.g., text, graphics, images, voice (audio), video, etc. data). The electronic data contains a great deal of information useful for human development and technical development, and can promote improvement and progress of production and life.
However, these data, unlike the conventional data structures or formats used in the conventional databases, cannot be directly parsed by using the parsing tools associated with the transmission databases, and a corresponding parsing engine needs to be set for parsing. However, the parsing engine can only identify data with specific characteristics, but the characteristics of the electronic data often change due to factors such as communication protocol, equipment, application, and the like. The analysis engine needs to be adjusted for different data characteristics to identify the data to be analyzed, so that the difficulty and cost of data analysis work are greatly improved.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a data analysis method, equipment and a storage medium, which solve the problems that an analysis engine is required to be correspondingly adjusted according to data characteristics for analyzing electronic data at present, the analysis work difficulty is high and the analysis cost is high.
In order to solve the problems, the invention adopts a technical scheme that: a data parsing method, the data parsing method comprising: s101: acquiring data to be analyzed, and identifying the data to be analyzed based on a configuration file corresponding to the data to be analyzed; s102: performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result to obtain product information, wherein the breadth analysis comprises at least one of URL analysis, host analysis, ip analysis, application size analysis and ua product analysis; s103: and carrying out deep analysis by using a deep rule base based on the product information, and storing a deep analysis result.
Further, the identifying the data to be parsed based on the configuration file corresponding to the data to be parsed includes: determining configuration files according to the protocol information of the data to be analyzed, wherein the configuration files corresponding to different protocol information are different; and identifying the file name and the data content of the data to be analyzed according to the configuration file, wherein the file name is determined based on at least one of a data interface, a data version and a protocol field subscript value corresponding to the data to be analyzed, and the data content comprises at least one of URL data, host data, ip data, application data and ua product data.
Further, the content includes URL data, and the performing, according to the identification result, the breadth resolution on the data to be resolved by using a breadth rule base includes: keyword matching is carried out on the URL data of the data to be analyzed and the URL rules in the breadth rule base; and determining an analysis rule corresponding to the URL data in the URL rules according to the keyword matching result, and acquiring an analysis result of the URL data by utilizing the analysis rule, wherein the analysis result comprises the product information of the data to be analyzed.
Further, the content includes host data, and the performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result includes: and extracting host data in the data to be analyzed, and acquiring domain name information by using a domain name rule table and a domain name suffix rule table in the breadth rule base based on a matching result.
Further, the performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result includes: identifying first data in the data to be analyzed, and determining a rule table corresponding to the first data in the breadth rule base, wherein the first data comprises any one of ip data and application size data; and generating a matching formula corresponding to the rule table according to the first data, and acquiring an analysis result corresponding to the first data from the rule table through the matching formula.
Further, the performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result includes: extracting ua data in the data to be analyzed according to the identification result, and removing noise in the ua data by using a ua noise rule table in the breadth rule table to generate first ua data;
and carrying out keyword matching on the first ua data by adopting a preset algorithm and a ua keyword rule table in the breadth rule table, and obtaining an analysis result according to the matched keywords.
Further, the performing depth analysis by using a depth rule base based on the product information includes: and acquiring depth rules corresponding to the data to be analyzed in the depth rule base based on the product information, and carrying out depth analysis on the data to be analyzed through the depth rules, wherein the depth analysis comprises at least one of content analysis and action analysis.
Further, the performing the depth parsing on the data to be parsed by the depth rule includes: if the content analysis is determined, extracting keywords in the data to be analyzed by using the regular expression corresponding to the depth rule, and generating an analysis result based on the keywords; and if the action analysis is determined, acquiring a rule matched with the data to be analyzed in the depth rule table, extracting keywords in the data to be analyzed by using the rule, and generating an analysis result based on the keywords.
Based on the same inventive concept, the invention also proposes an electronic device comprising a processor, a memory storing a computer program for use by the processor for implementing the data parsing method as described above.
Based on the same inventive concept, the present invention also proposes a computer-readable storage medium storing program data used to perform the data parsing method as described above.
The application has the beneficial effects that the data analysis method is provided, the data to be analyzed is obtained, the data to be analyzed is identified based on the configuration file corresponding to the data to be analyzed, the breadth rule base is used for carrying out breadth analysis on the data to be analyzed according to the identification result, the product information is obtained, the depth rule base is used for carrying out depth analysis according to the product information, and the depth analysis result is stored.
Drawings
FIG. 1 is a flow chart of an embodiment of a data parsing method according to the present invention;
FIG. 2 is a flowchart of another embodiment of a data parsing method according to the present invention;
Fig. 3 is a block diagram of an embodiment of an electronic device according to the present invention.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It is noted that the various embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be combined with one another without conflict, wherein structural components or functional modules may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
Referring to fig. 1-2, fig. 1 is a flowchart illustrating an embodiment of a data parsing method according to the present invention; FIG. 2 is a flowchart of a data parsing method according to another embodiment of the present invention. The data analysis method of the present invention will be described in detail with reference to fig. 1 to 2.
In this embodiment, the data parsing method includes:
s101: and acquiring the data to be analyzed, and identifying the data to be analyzed based on the configuration file corresponding to the data to be analyzed.
Optionally, the data to be analyzed may be internet data of the target object, and the object for executing the data analysis method may be a large data platform, where the large data platform may be connected to a gateway used by the target object, and the data to be analyzed is collected and analyzed through the gateway. In order to increase the data processing speed, the large data platform may use a flank (a stream computing framework) technology to execute the data parsing method to achieve rapid parsing of data. Wherein the Flink technique uses a stream processing engine to process real-time data streams. The method supports an event-driven processing model, and can orderly process according to the arrival sequence of the data to be analyzed so as to ensure the consistency and accuracy of the data. In addition to real-time streaming, flink also provides powerful batch processing capabilities. It supports the conversion of batch data into data streams and takes advantage of the stream processing engine for efficient parallel processing. The stream processing has problems and cannot normally run, and a batch processing mode can be used for temporary replacement.
In one embodiment, the object performing the data parsing method is a large data platform that collects data accessed by the internet device used by the target object, including, but not limited to, http, https, dns, etc. of various protocols. And the big data platform can acquire processing aging information (such as application scene and source information of the data) of the data when the data is acquired, if the processing aging information is determined not to meet the aging requirement, the data can be recorded as batch processing data, the batch processing data can be analyzed based on a preset time interval or the batch processing data can be analyzed after the data volume of the batch processing data reaches a preset data volume. If the processing aging information meets the aging requirement, a preset distributed data processing frame (such as a Flink frame) is called to distribute the data to a corresponding analysis engine so as to realize rapid processing of the data, and the effectiveness and timeliness of data analysis are improved.
Optionally, identifying the data to be parsed based on the configuration file corresponding to the data to be parsed includes: determining configuration files according to protocol information of data to be analyzed, wherein the configuration files corresponding to different protocol information are different; and identifying the file name and the data content of the data to be analyzed according to the configuration file, wherein the file name is determined based on at least one of a data interface, a data version and a protocol field subscript value corresponding to the data to be analyzed, and the data content comprises at least one of URL data, host data, ip data, application data and ua product data.
Optionally, a configuration file may be generated according to the data protocol type of the data to be parsed, and the data protocol types corresponding to different configuration files may be recorded. After receiving the analysis task, acquiring a configuration file corresponding to the analysis task, and identifying various data in the data to be analyzed based on the configuration file.
In one embodiment, a data source (such as a data interface for acquiring data to be parsed) and a data protocol (such as a request-response protocol) used by the data generated by the data source are determined based on a parsing task, and a configuration file for identifying the data to be parsed is determined according to the data protocol, so that a file name and a data content of the data to be parsed are identified through the configuration file. Specifically, the file name of the data to be parsed is: a-field-4g-209- #18.Properties, identifying the file name based on the configuration file, and identifying various items of data in the file name. If the accessed data interface is determined to be 4g according to the file name, the data version is 123, and the protocol field subscript value is 18.
Optionally, in order to identify data of different protocols, the configuration file further includes configuration items, where the configuration items are divided into four types, including configuration items for setting a protocol code, configuration items for setting a protocol name, configuration items for setting a required attribute index, and configuration items for setting a custom output index value. And the configuration items corresponding to different protocol data are different.
In one embodiment, for the data to be parsed using the HTTP protocol, the configuration item may be 18_103 protocol.code=103 for setting the protocol encoding; 18_103protocol name=http for setting protocol name; 18_103input.output.num= 0,1,1,1,2,3,4,5,6,7,8,9,10 is used to set the custom output index value. Attribute name=59 for setting a subscript value of data corresponding to the specified attribute. For example, 18_103input.num.url=59 is that the url attribute has a subscript value of 59 in the http protocol data.
S102: and carrying out breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result, and obtaining product information.
Optionally, the breadth resolution includes at least one of URL resolution, host resolution, ip resolution, application size class resolution, and ua product resolution. Product information (such as product id of the data source generating the data to be parsed) representing the source of the data to be parsed is obtained through the parsing mode, so that the next parsing operation can be conveniently executed.
Optionally, the breadth parsing is to identify product information of the data to be parsed based on a breadth rule base (the breadth rule base stores various parsing rules by which the data to be parsed is parsed), thereby determining the provenance of the data to be parsed.
In one embodiment, the data to be parsed is the user's access data that is parsed through a breadth to identify websites, applications, applets, external devices, data interfaces, etc. that the user accesses. Specifically, the data to be parsed includes a domain name field: com; url field: https:// v. Xx. Com/x/cover/abcdef. Ip port field: 99.99.99.99:22. And matching different fields in the data to be analyzed with different rule bases in the breadth rule base to obtain a recognition result. Such as matching domain name fields through a domain name rule base.
Optionally, when the content of the data to be analyzed includes URL data and the analysis mode used is URL analysis, performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result, including: keyword matching is carried out on URL data of the data to be analyzed and URL rules in the breadth rule base; and determining an analysis rule corresponding to the URL data in the URL rule according to the keyword matching result, and acquiring an analysis result of the URL data by utilizing the analysis rule, wherein the analysis result comprises product information of the data to be analyzed.
Optionally, each URL rule is provided with a corresponding keyword, which can be used to characterize the category of the URL rule. And determining the URL rule corresponding to the data to be analyzed based on the category.
In one embodiment, keywords corresponding to URL rules are stored in an ac dictionary tree, the ac algorithm performs keyword matching on URL data by using keywords in the ac dictionary tree, and if only one keyword corresponding to a URL rule matches with a keyword in URL data, the URL rule is determined as an analysis rule corresponding to URL data to be analyzed. If a plurality of URL rules exist and keywords matched with the URL data exist, the number of the keywords matched with the URL data in each URL rule is obtained, and the URL rule matched with the most keywords is determined to be an analysis rule matched with the URL data.
Optionally, to improve parsing efficiency, the URL rule includes priority information, and based on the priority information, the URL rule is divided into two parts of high priority and low priority. When keyword matching is carried out, the high-priority URL rule is firstly used for keyword matching with the data to be analyzed, and if the high-priority URL rule is not successfully matched, the low-priority URL rule is used for matching.
In one embodiment, the URL data to be parsed is: http:// szextshort. Abc. Dd. Com/aaa/mmt
The key words of the URL data comprise abc.dd.com, mmtls-1-prod 1; bbb-1-prod 2. When analyzing, based on priority, firstly, keyword stitching is used to obtain abc.dd.com, mmtls, bbb, and the abc.dd.com, mmtls, bbb is used for searching analysis rules corresponding to URL data, if not, the quantity of keywords is reduced, so that abc.dd.com, mmtls are used for searching URL rules, and so on, the URL rules for representing the product id_1 are found, so that the returned result is the product id_1. In keyword matching, if a plurality of keywords which are spliced are not matched with the URL rules, each keyword is matched once to obtain the matched URL rules, and product information corresponding to the URL rules is determined to be an analysis result of the URL data. After the analysis result is obtained, the URL data is subjected to deep analysis according to the analysis result.
Alternatively, if the content of the URL data includes host data, host analysis may also be performed by the host data. Performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result, wherein the method comprises the following steps: and extracting host data in the data to be analyzed, and acquiring domain name information by using a domain name rule table and a domain name suffix rule table in the breadth rule base based on the matching result.
Optionally, the domain name rule table stores hashMap (hash table-based Map interface implementation) domain name keys and the domain name suffix rule table stores hashSet domain name suffixes.
Optionally, in the matching process, the domain name rule table is used for matching, and if the matching is unsuccessful, the domain name suffix rule table is used for matching. Specifically, in the host analysis process, matching is performed with host data through a domain name rule table, and if matching is successful, a domain name matched with the host data in the domain name rule table is determined as a domain name corresponding to the host data. If the matching is unsuccessful, the host data is intercepted by adopting a recursive algorithm, and the intercepted data is circularly matched with the domain name suffix rule table until the domain name suffix is matched in the domain name suffix rule table.
In one embodiment, host data is www.baidu.com. The domain name rule table is www.aa.com-prod 1, the domain name suffix rule table is bb.com-prod 2, in the analysis process, www.aa.com is firstly used for analysis, the obtained analysis result is prod1, and if the analysis result is not obtained, the host data is subjected to recursion interception matching by using a recursion algorithm, and then the host data is matched to prod2.
Optionally, when the data to be analyzed includes ip data or application size data, performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result, including: identifying first data in the data to be analyzed, and determining a rule table corresponding to the first data in the breadth rule base, wherein the first data comprises any one of ip data and application size data; and generating a matching formula corresponding to the rule table according to the first data, and acquiring an analysis result corresponding to the first data from the rule table through the matching formula.
Optionally, the rule table corresponding to the ip data is an ip rule table, and the ip rule table stores ip and ports as keywords hashMap. When the ip data is analyzed, a server ip (server_ip) and a server port (server_port) of the ip data are obtained, the server ip, the server port and keywords in an ip rule table are matched, and product information corresponding to an ip rule corresponding to the matched keywords is used as an analysis result of the ip data.
In one embodiment, the server_ip and server_p data are 99.99.99.99, the rule data in the 22 ip rule table are 99.99.99.99, 22-1400038-1-087F828DBC 901EC86BA506F 77B2C
CB7. 99.99.99.99:22 matching rules assembled by using iserver_ip+server_port are successfully matched, and a result 1400038 is returned.
Optionally, the rule table corresponding to the application size class data is an application size class rule table, and the application size class_application subclass is used as a keyword in the rule table to store hashMap. When analyzing the application size class data, splicing the App Type (abbreviation of the application program Type is a parameter used for designating the application program Type) +App Sub-Type in the application size class data, matching the spliced data with a rule table, and obtaining an analysis result based on the matched rule.
In one embodiment, the application size data comprises 20 and 7 of App Type and App Sub-Type data, and the rule table data used for matching is 20-7-1004283-1000. During analysis, 20_7 of data spliced by using App Type and App Sub-Type is matched with the rule, the key words stored in hashMap of the rule above are 20_7, and a matching success returns a result 1004283.
Optionally, when the data to be analyzed includes ua data, performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result, including: extracting ua data in the data to be analyzed according to the identification result, and removing noise in the ua data by using an ua noise rule table in a breadth rule table to generate first ua data; and carrying out keyword matching on the first ua data by adopting a preset algorithm and a ua keyword rule table in a breadth rule table, and obtaining an analysis result according to the matched keywords. The noise in the ua data is the noise ua.
Optionally, the ua key rule table stores the key associated with the ua data in an ac dictionary tree and the ua noise rule table stores the noise ua in hashSet. In the analysis process, whether the ua data is noise or includes noise is firstly judged, if the ua data is noise, analysis of the ua data is stopped, and if the ua data is noise, the ua data is removed. And after noise is removed, extracting keywords in the ua data. The extracted keywords are matched with keywords in the ua keyword rule table using an ac algorithm. If the ua data has a plurality of successfully matched keywords, acquiring the sequence of the successfully matched keywords in the ua data, and taking the sequence as the final keywords as an analysis result.
In one embodiment, the rule table data used for the Mozilla/5.0 (Linux; Android 11; ABR-AL60 Build/HUAWEIABR-AL60; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/86.0.4240.99 XWEB/4317 MMWEBSDK/20220903 Mobile Safari/537.36 MMWEBID/2049 MicroMessenger/8.0.28.2240(0x28001CE9) WeChat/arm64 Weix, matching of the ua data is 1000962-WeChat, C65232-Chrome. Matching the keyword of the ua data with the rule table data to obtain a product id corresponding to the ua data, wherein WeChat keywords are the last keywords matched, and determining the product id according to WeChat keywords.
S103: and carrying out deep analysis by using a deep rule base based on the product information, and storing a deep analysis result.
Optionally, after the breadth analysis of the data to be analyzed is determined to be successful, the depth analysis is performed. Depth parsing is to identify specific content corresponding to the data to be parsed, for example, determining which operations are performed or accessed by a user to a merchant or a specific product on the APP, a host or live broadcast on the APP, or the like through the data to be parsed, for example: which pictures or music are searched, which advertisements or links are clicked, etc.
Optionally, the successfully analyzing includes obtaining the product id corresponding to the data to be analyzed through breadth analysis, wherein the product id can be obtained through ua analysis or URL analysis and refer_url analysis.
Optionally, a depth rule table is cached in HashMultimap, where a keyword of the depth rule table corresponds to a plurality of depth rules, and the keywords are classified into three types during storage, specifically as follows: matchtype (match type) is host: using host as key,2.MatchType as prod: using rootHost + prodId as key,3.MatchType is ip: prodId was used as key. The word segmentation rule table stores keywords in an ac dictionary tree.
Optionally, when the depth rule is obtained through the product information, a depth rule base corresponding to the product information is obtained, and host, rootHost + prodId in the data to be analyzed is used as a keyword to be matched with the depth rule base. If the depth rule table is not matched, prodId in the data to be analyzed is matched with the depth rule base to obtain the depth rule table.
In one embodiment, obtaining a depth rule table corresponding to data to be parsed includes: 1. matching by using host/root_host/url of the data to be analyzed as a keyword, and obtaining a matched depth rule; 2. and (3) matching by using root_host/ua product ids in the data to be analyzed, obtaining a matching depth rule, and filtering the depth rule obtained in the step (1) from the depth rule obtained in the matching. 3. And matching the refer_host/refer_root_host/refer_prod_id in the data to be analyzed with the keywords in the depth rule base by using the refer_host/refer_prod_id as keywords to acquire the depth rule. And after the depth rule is matched, respectively carrying out content analysis or action analysis based on the matched depth rule.
Optionally, the performing depth analysis on the data to be analyzed through depth rules includes: if the content analysis is determined, extracting keywords in the data to be analyzed by using regular expressions corresponding to the depth rules, and generating an analysis result based on the keywords; and if the action analysis is determined, acquiring a rule matched with the data to be analyzed in the depth rule table, extracting keywords in the data to be analyzed by using the rule, and generating an analysis result based on the keywords.
In one embodiment, the depth rule base corresponding to f.video.aa.com|http://f.video.weibocdn.com/u0/TSIwXYLDgx080KoYObQY010412002w9g0E010.mp4label=hevc_dash_hd&template=540x960.28.0&media_id=4835331149856846, of the data to be analyzed is f.video.aa.com-host-123456-content-video & media_id= ([ 0-9] {16 }) 1-123456-G9000079024 }, in the product-microblog body. The host data of the data to be parsed is matched to the depth rule in the depth rule base, and the rule type of the depth rule is content, so that the keyword 4835331149856846 is extracted through the regular expression, and the keyword is output to the kW field in the depth model. Content parsing is performed through the kW field. Some business scenarios also need to be used with kwadd fields, which are used in the following two cases.
1. When the regular expression is matched with a plurality of keywords, such as1, 2,3 and 4 are the matched 4 keywords, the keyword 1 is sent to the kW field when outputting, and after the separator among the keywords 2,3 and 4 is eliminated, 234 is output to the kwadd field together for content analysis.
2. When the search words are matched, the matched keywords are segmented by utilizing a segmentation rule table after the matching is successful, and the obtained segmentation codes are also added into kwadd fields. For example, if the matched keyword is a bidi car and a rule CA00 matched with the bidi car is in the word segmentation word stock, the CA00 is put in a kwadd field.
Optionally, during action analysis, performing cyclic matching by using a plurality of character string rules in the depth rule table, and if the character string rule of all character strings in url data including the data to be analyzed is found, determining that the character rule is the depth rule matched with the data to be analyzed.
In one embodiment, the data to be analyzed is extshort.abc.dd.com|http:// extshort.abc.dd.com/mmtls/4e6ffe97, and the depth rule is extshort.bbb.cc.com-host to 1000962-action to send messages and voice to extshort.abc.com/mmtls/-D0037-3761-03. The host data in the data to be analyzed is matched with the above rules, and the rule types in the rules are actions, so whether the analysis is successful is determined by judging whether url data contains character strings extshort.aaa.qq.com/mmtls.
Optionally, in an embodiment, the action resolution may also be used to obtain location information related to the data to be resolved (e.g. obtain the current location of the user based on the access data of the user). If the data to be analyzed is data adapting to the http protocol and the URL data of the data to be analyzed is not null, the position information can be acquired. The rule table for acquiring the position information is a position rule table that stores host data as a key in HashMultimap. In the analysis process, host data of the data to be analyzed is used for matching with the position rules in the position rule table. If the position rule is not matched, the data to be analyzed is matched again by rootHost. The matching result comprises at least one position rule, the matched rule is traversed, URL data of data to be analyzed are analyzed by using regular expressions in the rule, the URL data are matched to corresponding keywords, the number of the matched keywords is 2, and the keywords are longitude and latitude respectively.
In one embodiment, the position rule used for matching api.abc.com|http://api.abc.com/locations/v1/cities/geoposition/search.jsonq=39.872224,116.332627&apikey=7f8c4da3ce9849ffb2134f075201c45a&language=zh. of the data to be analyzed is api.abc.com-q= (\d\d+), (\d+) -2-1 Gd-G0018-latitude and longitude-5924F 5A50BF511EDA1F8506F77B2CCB7. When resolving, the domain name key words api.abc.com of the data to be resolved are matched to the corresponding position rules, the longitude is 116.332627, the latitude is 39.872224 through the regular expression in the position rules, and finally the resolving result is output.
The invention has the following advantages:
1. By means of flink technology, the analysis efficiency in the mass data analysis process is effectively improved, the analysis performance is improved, 9200 pieces/second/core is achieved, the resource consumption is reduced, and the machine nodes and the cost are saved.
2. The analysis quantity is improved, the analysis classification is finer, and the quantity of the APP which can be analyzed is improved by 1 time; the identification number of content ID, search word and public number actions is improved by more than 10 times.
3. And the Flink is used for guaranteeing the timeliness of the processing by adopting multiple concurrent and reasonable allocation resources based on a distributed computing technology. The resolving timeliness is improved: can be updated and analyzed according to the days in the past, and is improved to real-time updating and analyzing
The beneficial effects are that: the data analysis method of the application obtains the data to be analyzed, identifies the data to be analyzed based on the configuration file corresponding to the data to be analyzed, uses the breadth rule base to analyze the data to be analyzed to a breadth according to the identification result, obtains the product information, uses the depth rule base to analyze the data to be analyzed to a depth according to the product information, and stores the depth analysis result.
In an alternative embodiment, an electronic device is provided, as shown in fig. 3, the electronic device 4000 shown in fig. 3 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit ), general purpose Processor, DSP (DIGITAL SIGNAL Processor, data signal Processor), ASIC (Application SPECIFIC INTEGRATED Circuit), FPGA (Field Programmable GATE ARRAY ) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (PERIPHERAL COMPONENT INTERCONNECT, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.
The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.
The electronic device may be any electronic product that can perform man-machine interaction with an object, for example, a Personal computer, a tablet computer, a smart phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a game console, an interactive internet protocol television (Internet Protocol Television, IPTV), a smart wearable device, and the like.
The electronic device may also include a network device and/or an object device. Wherein the network device includes, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of a large number of hosts or network servers based on cloud computing (CloudComputing).
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of data parsing, the method comprising:
S101: acquiring data to be analyzed, and identifying the data to be analyzed based on a configuration file corresponding to the data to be analyzed;
S102: performing breadth analysis on the data to be analyzed by using a breadth rule base according to the identification result to obtain product information, wherein the breadth analysis comprises at least one of URL analysis, host analysis, ip analysis, application size analysis and ua product analysis;
s103: and carrying out deep analysis by using a deep rule base based on the product information, and storing a deep analysis result.
2. The data parsing method according to claim 1, wherein the identifying the data to be parsed based on the configuration file corresponding to the data to be parsed includes:
Determining configuration files according to the protocol information of the data to be analyzed, wherein the configuration files corresponding to different protocol information are different;
And identifying the file name and the data content of the data to be analyzed according to the configuration file, wherein the file name is determined based on at least one of a data interface, a data version and a protocol field subscript value corresponding to the data to be analyzed, and the data content comprises at least one of URL data, host data, ip data, application data and ua product data.
3. The data parsing method according to claim 2, wherein the content includes URL data, and the performing the breadth parsing on the data to be parsed using a breadth rule base according to the recognition result includes:
Keyword matching is carried out on the URL data of the data to be analyzed and the URL rules in the breadth rule base;
And determining an analysis rule corresponding to the URL data in the URL rules according to the keyword matching result, and acquiring an analysis result of the URL data by utilizing the analysis rule, wherein the analysis result comprises the product information of the data to be analyzed.
4. The data parsing method according to claim 2, wherein the content includes host data, and the performing the breadth parsing on the data to be parsed using a breadth rule base according to the recognition result includes:
And extracting host data in the data to be analyzed, and acquiring domain name information by using a domain name rule table and a domain name suffix rule table in the breadth rule base based on a matching result.
5. The data parsing method according to claim 2, wherein said performing the breadth parsing on the data to be parsed using the breadth rule base according to the recognition result includes:
Identifying first data in the data to be analyzed, and determining a rule table corresponding to the first data in the breadth rule base, wherein the first data comprises any one of ip data and application size data;
And generating a matching formula corresponding to the rule table according to the first data, and acquiring an analysis result corresponding to the first data from the rule table through the matching formula.
6. The data parsing method according to claim 2, wherein said performing the breadth parsing on the data to be parsed using the breadth rule base according to the recognition result includes:
Extracting ua data in the data to be analyzed according to the identification result, and removing noise in the ua data by using a ua noise rule table in a breadth rule table to generate first ua data;
and carrying out keyword matching on the first ua data by adopting a preset algorithm and a ua keyword rule table in the breadth rule table, and obtaining an analysis result according to the matched keywords.
7. The data parsing method according to claim 1, wherein said performing depth parsing using a depth rule base based on said product information comprises:
And acquiring depth rules corresponding to the data to be analyzed in the depth rule base based on the product information, and carrying out depth analysis on the data to be analyzed through the depth rules, wherein the depth analysis comprises at least one of content analysis and action analysis.
8. The data parsing method according to claim 7, wherein the performing the deep parsing on the data to be parsed by the depth rule includes:
If the content analysis is determined, extracting keywords in the data to be analyzed by using the regular expression corresponding to the depth rule, and generating an analysis result based on the keywords;
And if the action analysis is determined, acquiring a rule matched with the data to be analyzed in the depth rule table, extracting keywords in the data to be analyzed by using the rule, and generating an analysis result based on the keywords.
9. An electronic device comprising a processor, a memory storing a computer program for use by the processor in implementing the data parsing method according to any of claims 1-8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores program data for performing the data parsing method according to any one of claims 1-8.
CN202410303383.3A 2024-03-18 2024-03-18 Data analysis method, device and storage medium Pending CN117932175A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410303383.3A CN117932175A (en) 2024-03-18 2024-03-18 Data analysis method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410303383.3A CN117932175A (en) 2024-03-18 2024-03-18 Data analysis method, device and storage medium

Publications (1)

Publication Number Publication Date
CN117932175A true CN117932175A (en) 2024-04-26

Family

ID=90757758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410303383.3A Pending CN117932175A (en) 2024-03-18 2024-03-18 Data analysis method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117932175A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010000145A1 (en) * 2008-06-30 2010-01-07 成都市华为赛门铁克科技有限公司 A method, apparatus and system for monitoring the network
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN111191109A (en) * 2018-11-15 2020-05-22 中国移动通信集团有限公司 Information processing method and device and storage medium
CN112118232A (en) * 2020-08-25 2020-12-22 通号城市轨道交通技术有限公司 Message protocol analysis method and device
CN114938401A (en) * 2022-03-21 2022-08-23 北京思信飞扬信息技术股份有限公司 Configurable network protocol data analysis method and electronic equipment
CN116489251A (en) * 2023-04-21 2023-07-25 重庆长安汽车股份有限公司 Universal code stream analysis method, device, computer readable medium and terminal equipment
CN117201646A (en) * 2023-09-28 2023-12-08 国网上海市电力公司 Deep analysis method for electric power Internet of things terminal message
CN117240943A (en) * 2023-09-14 2023-12-15 北京城建智控科技股份有限公司 Interface data analysis method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010000145A1 (en) * 2008-06-30 2010-01-07 成都市华为赛门铁克科技有限公司 A method, apparatus and system for monitoring the network
CN107257390A (en) * 2017-05-27 2017-10-17 北京思特奇信息技术股份有限公司 A kind of parsing method and system of URL addresses
CN111191109A (en) * 2018-11-15 2020-05-22 中国移动通信集团有限公司 Information processing method and device and storage medium
CN112118232A (en) * 2020-08-25 2020-12-22 通号城市轨道交通技术有限公司 Message protocol analysis method and device
CN114938401A (en) * 2022-03-21 2022-08-23 北京思信飞扬信息技术股份有限公司 Configurable network protocol data analysis method and electronic equipment
CN116489251A (en) * 2023-04-21 2023-07-25 重庆长安汽车股份有限公司 Universal code stream analysis method, device, computer readable medium and terminal equipment
CN117240943A (en) * 2023-09-14 2023-12-15 北京城建智控科技股份有限公司 Interface data analysis method and device
CN117201646A (en) * 2023-09-28 2023-12-08 国网上海市电力公司 Deep analysis method for electric power Internet of things terminal message

Similar Documents

Publication Publication Date Title
WO2015103899A1 (en) Construction method and device for event repository
CN113051285B (en) SQL sentence conversion method, system, equipment and storage medium
WO2012125350A2 (en) Keyword extraction from uniform resource locators (urls)
JP4722195B2 (en) Database message analysis support program, method and apparatus
CN110795541A (en) Text query method and device, electronic equipment and computer readable storage medium
CN110020236B (en) Webpage parsing method, device, storage medium, processor and equipment
CN111400436A (en) Search method and device based on user intention recognition
CN114598597B (en) Multisource log analysis method, multisource log analysis device, computer equipment and medium
CN105790967B (en) Network log processing method and device
CN103914479B (en) Resource request matching method and device
CN115221191A (en) Virtual column construction method based on data lake and data query method
CN109635072B (en) Public opinion data distributed storage method, public opinion data distributed storage device, storage medium and terminal equipment
CN110209780B (en) Question template generation method and device, server and storage medium
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
WO2021103594A1 (en) Tacitness degree detection method and device, server and readable storage medium
CN111443920B (en) Frame migration method and device
CN110442696B (en) Query processing method and device
CN112883088B (en) Data processing method, device, equipment and storage medium
CN116962348A (en) Domain name resolution-based video flow processing method, system and electronic equipment
CN110515979B (en) Data query method, device, equipment and storage medium
CN109918661B (en) Synonym acquisition method and device
CN113806647A (en) Method for identifying development framework and related equipment
CN117932175A (en) Data analysis method, device and storage medium
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN113656830B (en) Database desensitization grammar parsing method, system, computer and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination