CN106599160B - Content rule library management system and coding method thereof - Google Patents

Content rule library management system and coding method thereof Download PDF

Info

Publication number
CN106599160B
CN106599160B CN201611121969.XA CN201611121969A CN106599160B CN 106599160 B CN106599160 B CN 106599160B CN 201611121969 A CN201611121969 A CN 201611121969A CN 106599160 B CN106599160 B CN 106599160B
Authority
CN
China
Prior art keywords
content
webpage
rule
module
rules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201611121969.XA
Other languages
Chinese (zh)
Other versions
CN106599160A (en
Inventor
胡庆勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netcommander Technology Beijing Co ltd
Original Assignee
Netcommander Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netcommander Technology Beijing Co ltd filed Critical Netcommander Technology Beijing Co ltd
Priority to CN201611121969.XA priority Critical patent/CN106599160B/en
Publication of CN106599160A publication Critical patent/CN106599160A/en
Application granted granted Critical
Publication of CN106599160B publication Critical patent/CN106599160B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Abstract

The invention relates to a dictionary coding method of a content rule base, which sets a dictionary of the content rule base as a 20-bit dictionary coding system. The invention also discloses a content rule base management system, which comprises: the system comprises a content rule base visualization management module, a URL data preprocessing classification module, a key application APP/website tracking module, a webpage restoration module, a webpage crawler module, a webpage content analysis module and a content rule base. The advantages of the invention are embodied in that: the method can be used for analyzing and translating the behavior log of the user accessing the mobile internet in a large-scale full-view manner, so that the holographic knowledge map of the user of the mobile internet is formed, and the subsequent various content analysis applications are supported.

Description

Content rule library management system and coding method thereof
Technical Field
The invention relates to the technical field of data processing, in particular to a content rule base management system and a coding method thereof.
Background
Telecom operators obtain original signaling data of the Internet of a client through light splitting, and output an Internet log synthesized by xDR through first-stage DPI identification, but generally, data analyzed through the first-stage analysis is not fine enough, rules cannot bear too much, and analysis dimensionality is inflexible, so that DPI enhanced analysis is required to be performed, and the aspects of APP identification, webpage classification, keyword analysis, a knowledge base system and the like are further enhanced so as to support subsequent various content analysis applications. How to translate and mark the data which is extremely huge and complicated as information containing deep semantic content, the prior art only puts requirements on data results which need to be analyzed, but has the following disadvantages for how to achieve the required data results:
1. only relatively shallow content can be translated;
2. relying substantially entirely on manual labeling;
3. only a small amount of sample data can be marked manually;
4. the inability to quickly discover changes in the source data structure;
5. no complete solution, model, and algorithm is provided for how such data results are accomplished.
Disclosure of Invention
The invention aims to provide a content rule base management system and a coding method thereof aiming at the defects in the prior art, which are used for carrying out large-scale full-view-angle analysis and translation on a behavior log of a mobile internet accessed by a user so as to form a holographic knowledge map of the mobile internet user.
In order to achieve the purpose, the invention discloses the following technical scheme:
a content rule base dictionary coding method sets a dictionary of a content rule base as a 20-bit dictionary coding system, supports a 5-level label system, wherein the first-level classification is a field and occupies 3 bits, the second-level classification is an industry and occupies 4 bits, the third-level classification is an application and occupies 5 bits, the fourth-level classification is a column and occupies 4 bits, and the fifth-level classification is a search content, metadata or extraction content type and occupies 4 bits;
the first bit of the four-level classification is an identifier and can be only 0 or 1, wherein 0 represents a column and 1 represents a behavior; the first digit of the five-level classification is 0 for searching, 1 for metadata, 2 for extracting, if the type of metadata is the metadata type, the code is 13 for beginning, 3 for ID, if the type of extracting is the extraction type, the second digit is 0 for text, 1 for floating point, 2 for date, and 3 for ID;
position 20 at 00000000000000000000 represents an unknown application.
The invention also discloses a content rule base management system, which applies the coding method and comprises the following steps:
the content rule base visualization management module is used for adding, deleting, searching and changing the rule base, simultaneously providing monitoring of the state of each module and extracting visualization operation of content rules from sample data;
the URL data preprocessing and classifying module extracts URLs needing deep content analysis based on a user internet log, imports a sample database, and cleans the URLs into sample data for a rule analyzer to use, wherein the extracted content comprises an application rule, a column rule, a search rule, a metadata rule, a noise rule and a metadata rule;
the method comprises the steps that an APP/website tracking module is applied in a key mode, key application refers to application needing deep extraction of content metadata, the key application is defined in a content rule-application rule, the key application tracking module is started through task management-key application, and output of the key application tracking module is sample data-various application URLs in known application and used for data analysts to further extract various content rules;
the webpage restoration module is used for arranging the content subjected to the URL deep analysis into a plurality of restoration rules and providing the restoration rules for the webpage restoration module to use; the restored webpage is provided for the webpage content analysis module to use; using the web page restoration rules defined in the metadata rules;
the webpage crawler module is used for crawling related webpage contents from the Internet based on the data processed by the crawler URL generating module for the subsequent webpage content analyzing module to use;
the webpage content analysis module is used for establishing a content metadata rule base by extracting rules of webpage content and corresponding the extracted content with content metadata;
a content rule base comprising the following rules: the method comprises the following steps of APP application rules, noise rules, APP column action rules, APP search keyword extraction rules, APP content metadata rules and webpage restoration rules.
Furthermore, the preprocessing content of the URL data preprocessing and classifying module comprises preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, defining preprocessing rules and summarizing statistics of processing data through configuration, outputting data which are not identified by all the rules to unknown data, and entering a rule base source data input table for analysis and rule classification formulation.
Further, the content analyzed by the webpage content analysis module includes judging the update state of the webpage/APP application, identifying the webpage/APP application code, acquiring the webpage/APP application title, and acquiring the webpage/APP application content.
Furthermore, the webpage crawler module supports Chinese and multi-byte coding and supports Unicode coding.
Further, the web crawler module comprises:
the network connection component is used for connecting to the Internet and establishing network communication;
the analysis layer is used for analyzing the crawled webpage content, the webpage header, the webpage codes and the files in various formats to acquire required information; simultaneously supporting the content analysis of the APP;
and the basic layer comprises a proxy server load balancing component, a thread pool component, a url filtering component, a url storing component, a crawling content storing component, an anti-crawling strategy component, a webpage updating state verifying component and a webpage simulation login component.
Further, the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.
Further, the content rule base automatically optimizes the parameter configuration of the rule periodically and gives an optimization suggestion.
The invention discloses a content rule library management system, which has the following beneficial effects:
1. the granularity of data analysis is deep, and the requirement of business support facing to the market side under different scenes can be met.
2. The user log data of the operator is translated into user behavior data with service meaning and potential business value.
3. Based on the subdivision of the service data, massive big data is changed into local semantic small data of users in different industries.
4. A model is provided that is suitable for large-scale processing of data by machines.
5. The solution is the last ring of pre-processing before the operator data industry applies.
6. And converting the manually processed data into data capable of being processed by human-computer interaction.
7. The data size after rule processing and encoding is about 20 compared with the source data size: 1, the subsequent treatment cost is greatly saved for subsequent analysis.
Drawings
FIG. 1 is a schematic view of a frame structure according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The core of the invention is to provide a semantic content rule tag algorithm model, which comprises a semantic content rule base model, semantic tag system coding, an algorithm for marking and translating Internet HTTP log data of enterprise data center users of telecommunication operators into semantic text data, a model and a coding system.
Please refer to fig. 1.
A content rule base dictionary coding method sets a dictionary of a content rule base as a 20-bit dictionary coding system, supports a 5-level label system, wherein the first-level classification is a field and occupies 3 bits, the second-level classification is an industry and occupies 4 bits, the third-level classification is an application and occupies 5 bits, the fourth-level classification is a column and occupies 4 bits, and the fifth-level classification is a search content, metadata or extraction content type and occupies 4 bits;
the first bit of the four-level classification is an identifier and can be only 0 or 1, wherein 0 represents a column and 1 represents a behavior; the first digit of the five-level classification is 0 for searching, 1 for metadata, 2 for extracting, if the type of metadata is the metadata type, the code is 13 for beginning, 3 for ID, if the type of extracting is the extraction type, the second digit is 0 for text, 1 for floating point, 2 for date, and 3 for ID;
position 20 at 00000000000000000000 represents an unknown application.
The invention also discloses a content rule base management system, which applies the coding method and comprises the following steps:
the content rule base visual management module is a WEB management tool and is used for adding, deleting, searching and modifying the rule base, simultaneously providing monitoring for the state of each module and extracting visual operation of content rules from sample data;
table one: content rule base visualization management description
Figure BDA0001174476760000061
Figure BDA0001174476760000071
Figure BDA0001174476760000081
The URL data preprocessing and classifying module extracts URLs needing deep content analysis based on a user internet log, imports the URLs into a sample database, and cleans the URLs into sample data for a rule analyzer to use, wherein the extracted content comprises an application rule, a column rule, a search rule, a metadata rule, a noise rule and a metadata rule; and the URL classification module is used for inputting the sampled DPI URL, and cleaning the DPI URL into sample data by using rule bases such as application rules, column rules, search rules, metadata rules, noise rules, metadata and the like, and is used by rule analysts. The method supports preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring a self-defined preprocessing rule, and supporting statistical summary of processed data and unknown data: and outputting the data which is not identified by all the rules to unknown data, and entering a rule base source data input table for analysis and rule classification formulation.
The method comprises the steps that an APP/website tracking module is applied in a key mode, key application refers to application needing deep extraction of content metadata, the key application is defined in a content rule-application rule, the key application tracking module is started through task management-key application, and output of the key application tracking module is sample data-various application URLs in known application and used for data analysts to further extract various content rules;
the webpage restoration module is used for arranging the content subjected to the URL deep analysis into a plurality of restoration rules and providing the restoration rules for the webpage restoration module to use; the 1 restoration rule can cover a plurality of web pages and is not limited by various limitations of the web sites on the crawlers. The restored webpage is provided for the webpage content analysis module to use; using the web page restoration rules defined in the metadata rules;
the webpage crawler module is used for crawling related webpage contents from the Internet based on the data processed by the crawler URL generating module for the subsequent webpage content analyzing module to use; the web crawler module meets the following technical requirements:
the support for Chinese and multi-byte coding supports Unicode coding. The crawler supports network connection through the proxy server, supports a proxy server encryption connection mode, supports load balancing according to the network condition of the server when supporting a plurality of proxy servers, supports local DNS cache through the crawler DNS cache, supports management of a crawling thread of the crawler, supports optimization of a crawling queue according to crawling quantity, supports crawling of specified contents of specified websites, and supports setting of a crawler crawling time period.
The webpage content analysis module is used for establishing a content metadata rule base by extracting rules of webpage content and corresponding the extracted content with content metadata;
a content rule base comprising the following rules: the method comprises the following steps of APP application rules, noise rules, APP column action rules, APP search keyword extraction rules, APP content metadata rules and webpage restoration rules.
The content rule base supports the following label rule definitions of the classification structure:
1) field of the invention
2) The trade of "Zhi Jian
3) Application of "Ji Zhi
4.1) column/behavior
4.2) search
4.3) content (metadata) - - -, in
5) Content attributes
4.4) extraction
Example (c): automobile (1) -automobile (2) -automobile home- (3) -brandID-3985, model amb AMGCLA-grade AMG (4) -price: 62.8 ten thousand (5)
The content rule base has the following characteristics:
supporting a multi-level classification system
Extensions to support taxonomy
Supporting taxonomy mapping
Supporting content augmentation by crawling internet content through crawlers
Support modeling by machine learning text mining algorithms and score prediction from models
Content augmentation with class results
Dictionary base supporting Chinese word segmentation
1. Chinese word segmentation supporting technology
2. Expansion supporting Chinese word segmentation dictionary
3. Supporting dictionary expansion through crawling of crawler to professional website
And the rule optimization algorithm model is used for automatically optimizing the parameter configuration of the rule at regular intervals, giving an optimization suggestion and outputting the optimization suggestion to a flow content identification rule base to be confirmed.
As a specific embodiment, the preprocessing content of the URL data preprocessing classification module includes preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring statistics summary from the definition of the preprocessing rules and the processing data, and outputting the data that is not identified by all the rules to unknown data, which enters the rule base source data input table for analysis and rule classification.
As a specific embodiment, the content analyzed by the webpage content analysis module includes determining an update state of a webpage/APP application, identifying a webpage/APP application code, obtaining a webpage/APP application title, and obtaining webpage/APP application content.
As a specific embodiment, the webpage crawler module supports Chinese and multi-byte coding and supports Unicode coding.
As a specific embodiment, the web crawler module includes:
the network connection component is used for connecting to the Internet and establishing network communication;
the analysis layer is used for analyzing the crawled webpage content, the webpage header, the webpage codes and the files in various formats to acquire required information; simultaneously supporting the content analysis of the APP;
and the basic layer comprises a proxy server load balancing component, a thread pool component, a url filtering component, a url storing component, a crawling content storing component, an anti-crawling strategy component, a webpage updating state verifying component and a webpage simulation login component.
As a specific embodiment, the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.
As a specific embodiment, the content rule base automatically optimizes the parameter configuration of the rule periodically and gives an optimization suggestion.
Based on the user log data on the internet of the enterprise data center, the method realizes the functions of content analysis, configuration, management and visualization, self-optimization of the content rule base and the like through methods of URL filtering, rule base matching, crawler crawling, content restoration, text data mining and the like, and simultaneously considers the requirements of system safety, load balancing, data backup and the like.
According to the invention, a crawler technology is utilized, a DPI restoration technology is combined, and the flow management of a label rule base is utilized to deeply identify the mobile phone internet access behavior, the access content and the access application of a client, so that the analysis requirements of client data and service data are finely supported, and the means of data integration, data modeling, data mining, data cleaning and the like are combined to build big data and user deep insight analysis capability based on subdivision industry.
The relationship between the content rule base and the peripheral system of the invention is as follows:
relationship to stream processing: the content rule base needs to provide an internet repository for streaming use. And the flow processing processes the signaling XDR data and then provides the signaling XDR data for the internet behavior analysis module to use.
Relationship to data asset management platform: and a label rule base is used for processing in the online behavior analysis, and both a data processing process and a data storage result are used as data assets to be managed by a data asset management platform.
The relationship between the large data operation support platform and the client label management is that the crawler and content rule base analysis system needs to provide client internet attribute data to the client label management module, and the client label management module generates a client internet preference label.
The system structure adopted by the invention is the basic condition of the internet behavior data source of the operator is completely considered, the establishment of the rule base is completed through the mode basically similar to the generation of the internet behavior data, and the rule establishment is more reasonable and effective.
Integrity of
1) Possesses back end data acquisition, data capture module.
2) The system is provided with a rule library management system matched with back-end data processing.
Degree of maturity
The related content of the rule base is established in the related fields of automobiles, insurance, securities and the like, rules are established by considering the characteristics of the internet behavior log data source, and finally determined rule models are obtained through full discussion and continuous improvement and optimization of clients in related industries and through related industry practice tests.
Extensibility and ease of use
The system architecture of the product adopts a layered and hierarchical processing structure, each layer can configure a corresponding physical server according to the size of data volume and the scale of a rule base required to be established, and the system is a flexible extensible system. The management of the rule base and the calling of the related system module can be based on WEB visual operation, and the deployment and the use of the system are greatly facilitated.
Stability of
The whole application software system can continuously work for 7 multiplied by 24 hours without interruption, and the failure can be timely warned. The application system should have a perfect detection function to ensure the accuracy of data such as service control. The detection of each link of the system is managed in a closed loop mode, a detection system which is relatively independent of other functional modules is established, and the accuracy of data is verified. The application system should be provided with automatic or manual recovery measures to quickly recover normal operation in the event of an error. The application software is to prevent the system from crashing by consuming too many system resources.
Advancement of
The rule base is established in a cold start, sampling, training and iteration mode, is parallel to the big data internet behavior deep analysis system, and can be seamlessly embedded into the big data internet behavior analysis system of the enterprise. The crawler and rule base management system can meet the requirements of business analysis and big data mining in the fastest mode and the shortest time in the face of the ever-changing content of the Internet.
Novelty
Based on the network characteristics of an operator, a webpage content restoration module is innovatively introduced to serve as one of core function modules for rule base construction, the restoration module has the greatest advantage that the establishment process of a rule base can be shortened, and the establishment cost and time of the rule base are greatly saved by utilizing the network owned by the operator. Moreover, the data validity of the rule base is also ensured.
Openness of
Based on the requirement of actual online behavior analysis, the rule base is established according to different levels of application, application behaviors, application columns, application contents, application content metadata, noise and the like, and a semantic dictionary classification rule is established at the same time, so that the requirements of behavior analysis and basic data of future user behavior portraits are completely covered.
The establishment of the content rules of the invention is a complex and long-flow work, and besides a matched tool, a clearer management process and an operation flow are needed for guarantee. The establishment process of the rule base is divided into two stages: 1) cold starting; 2) and (5) performing iterative optimization.
1. Cold start
Any new APP or website that requires deep content tagging inevitably has a cold start process that requires human intervention. The initial rule base for a cold start is initiated by a rule base maintenance person.
2. Iteration
Content rule base recognition rate: the primary classification is greater than 95%, the secondary classification is greater than 85%, based on the index data, in the URL processing process of a specific APP or website, the sampling data are used for training and iteration processing until the output unknown URL is less than 15%, then new sampling data are accessed, training is repeated, and a rule tag library is added.
Table two: content rule base establishment procedure
Figure BDA0001174476760000151
The foregoing is only a preferred embodiment of the present invention and is not limiting thereof; it should be noted that, although the present invention has been described in detail with reference to the above embodiments, those skilled in the art will understand that the technical solutions described in the above embodiments can be modified, and some or all of the technical features can be equivalently replaced; and the modifications and the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A content rule library management system is characterized in that the system is applied to a dictionary coding system which sets a dictionary of a content rule library as 20 bits and supports a 5-level label system, wherein the first-level classification is a field and occupies 3 bits, the second-level classification is an industry and occupies 4 bits, the third-level classification is an application and occupies 5 bits, the fourth-level classification is a column and occupies 4 bits, and the fifth-level classification is a search content, metadata or extraction content type and occupies 4 bits; the first bit of the four-level classification is an identifier and can be only 0 or 1, wherein 0 represents a column and 1 represents a behavior; the first digit of the five-level classification is 0 for searching, 1 for metadata, 2 for extracting, if the type of metadata is the metadata type, the code is 13 for beginning, 3 for ID, if the type of extracting is the extraction type, the second digit is 0 for text, 1 for floating point, 2 for date, and 3 for ID; position 20 at 00000000000000000000 represents an unknown application, including:
the content rule base visualization management module is used for adding, deleting, searching and changing the rule base, simultaneously providing monitoring of the state of each module and extracting visualization operation of content rules from sample data;
the URL data preprocessing and classifying module extracts URLs needing deep content analysis based on a user internet log, imports a sample database, and cleans the URLs into sample data for a rule analyzer to use, wherein the extracted content comprises an application rule, a column rule, a search rule, a metadata rule, a noise rule and a metadata rule;
the method comprises the steps that an APP/website tracking module is applied in a key mode, key application refers to application needing deep extraction of content metadata, the key application is defined in a content rule-application rule, the key application tracking module is started through task management-key application, and output of the key application tracking module is sample data-various application URLs in known application and used for data analysts to further extract various content rules;
the webpage restoration module is used for arranging the content subjected to the URL deep analysis into a plurality of restoration rules and providing the restoration rules for the webpage restoration module to use; the restored webpage is provided for the webpage content analysis module to use; using the web page restoration rules defined in the metadata rules;
the webpage crawler module is used for crawling related webpage contents from the Internet based on the data processed by the crawler URL generating module for the subsequent webpage content analyzing module to use;
the webpage content analysis module is used for establishing a content metadata rule base by extracting rules of webpage content and corresponding the extracted content with content metadata;
a content rule base comprising the following rules: the method comprises the following steps of APP application rules, noise rules, APP column action rules, APP search keyword extraction rules, APP content metadata rules and webpage restoration rules.
2. The system of claim 1, wherein the preprocessing of the URL data preprocessing classification module includes preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring statistical summaries from the definition of the preprocessing rules and the processing of the data, and outputting the data not identified by all the rules to unknown data and into the rule base source data input table for analysis and rule classification.
3. The system of claim 1, wherein the content analyzed by the web content analysis module comprises determining update status of web/APP applications, identifying web/APP application codes, obtaining web/APP application titles, and obtaining web/APP application content.
4. The system of claim 1, wherein the webcrawler module supports chinese and multi-byte encoding and supports Unicode encoding.
5. The content rule base management system of claim 4, wherein the web crawler module comprises:
the network connection component is used for connecting to the Internet and establishing network communication;
the analysis layer is used for analyzing the crawled webpage content, the webpage header, the webpage codes and the files in various formats to acquire required information; simultaneously supporting the content analysis of the APP;
and the basic layer comprises a proxy server load balancing component, a thread pool component, a url filtering component, a url storing component, a crawling content storing component, an anti-crawling strategy component, a webpage updating state verifying component and a webpage simulation login component.
6. The system of claim 5, wherein the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.
7. The system of claim 6, wherein the content rule base automatically optimizes the parameter configuration of the rule and gives optimization suggestions periodically.
CN201611121969.XA 2016-12-08 2016-12-08 Content rule library management system and coding method thereof Expired - Fee Related CN106599160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611121969.XA CN106599160B (en) 2016-12-08 2016-12-08 Content rule library management system and coding method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611121969.XA CN106599160B (en) 2016-12-08 2016-12-08 Content rule library management system and coding method thereof

Publications (2)

Publication Number Publication Date
CN106599160A CN106599160A (en) 2017-04-26
CN106599160B true CN106599160B (en) 2020-06-02

Family

ID=58598488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611121969.XA Expired - Fee Related CN106599160B (en) 2016-12-08 2016-12-08 Content rule library management system and coding method thereof

Country Status (1)

Country Link
CN (1) CN106599160B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107257390B (en) * 2017-05-27 2020-10-09 北京思特奇信息技术股份有限公司 URL address resolution method and system
CN107193995A (en) * 2017-06-08 2017-09-22 网帅科技(北京)有限公司 A kind of position classifying rules base management system and its coding method
CN107391597B (en) * 2017-06-30 2020-08-07 北京航空航天大学 Multivariate data acquisition method and system
CN107948266A (en) * 2017-11-17 2018-04-20 武汉绿色网络信息服务有限责任公司 The processing method and system of HTTP uplink traffics in asymmetric routed environment
CN109766501B (en) * 2019-01-14 2021-08-17 北京搜狗科技发展有限公司 Crawler protocol management method and device and crawler system
CN110765120A (en) * 2019-10-28 2020-02-07 陈贞辉 Information management method for power grid engineering cost
CN110826007B (en) * 2019-12-04 2022-07-05 杭州安恒信息技术股份有限公司 Column updating date determining method, device and equipment and readable storage medium
CN111813890B (en) * 2020-07-22 2021-12-07 江苏宏创信息科技有限公司 Policy portrait AI modeling system and method based on big data
CN111949803A (en) * 2020-08-21 2020-11-17 深圳供电局有限公司 Method, device and equipment for detecting network abnormal user based on knowledge graph
CN115730020B (en) * 2022-11-22 2023-10-10 哈尔滨工程大学 Automatic driving data monitoring method and monitoring system based on MySQL database log analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102354283A (en) * 2011-09-20 2012-02-15 天津智康医疗科技有限公司 Method for constructing rule base and method for checking data by utilizing rule base
CN103970898A (en) * 2014-05-27 2014-08-06 重庆大学 Method and device for extracting information based on multistage rule base

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011127049A1 (en) * 2010-04-07 2011-10-13 Liveperson, Inc. System and method for dynamically enabling customized web content and applications

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102354283A (en) * 2011-09-20 2012-02-15 天津智康医疗科技有限公司 Method for constructing rule base and method for checking data by utilizing rule base
CN103970898A (en) * 2014-05-27 2014-08-06 重庆大学 Method and device for extracting information based on multistage rule base

Also Published As

Publication number Publication date
CN106599160A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
CN106599160B (en) Content rule library management system and coding method thereof
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
CN106844640B (en) Webpage data analysis processing method
US9104709B2 (en) Cleansing a database system to improve data quality
CN107257390B (en) URL address resolution method and system
KR20220091676A (en) Apparatus and Method for Building Unstructured Cyber Threat Information Big-data, Method for Analyzing Unstructured Cyber Threat Information
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
CN107341399A (en) Assess the method and device of code file security
Chen et al. Bert-log: Anomaly detection for system logs based on pre-trained language model
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN113076735A (en) Target information acquisition method and device and server
US11601339B2 (en) Methods and systems for creating multi-dimensional baselines from network conversations using sequence prediction models
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
WO2016093839A1 (en) Structuring of semi-structured log messages
CN113918794A (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN107527289B (en) Investment portfolio industry configuration method, device, server and storage medium
US11295078B2 (en) Portfolio-based text analytics tool
CN115051863A (en) Abnormal flow detection method and device, electronic equipment and readable storage medium
CN110888977A (en) Text classification method and device, computer equipment and storage medium
CN112199573B (en) Illegal transaction active detection method and system
CN114492576A (en) Abnormal user detection method, system, storage medium and electronic equipment
Ma et al. API prober–a tool for analyzing web API features and clustering web APIs
CN111951079A (en) Credit rating method and device based on knowledge graph and electronic equipment
Do et al. Some Research Issues of Harmful and Violent Content Filtering for Social Networks in the Context of Large-Scale and Streaming Data with Apache Spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200602

Termination date: 20211208

CF01 Termination of patent right due to non-payment of annual fee