CN106599160B

CN106599160B - Content rule library management system and coding method thereof

Info

Publication number: CN106599160B
Application number: CN201611121969.XA
Authority: CN
Inventors: 胡庆勇
Original assignee: Netcommander Technology Beijing Co ltd
Current assignee: Netcommander Technology Beijing Co ltd
Priority date: 2016-12-08
Filing date: 2016-12-08
Publication date: 2020-06-02
Anticipated expiration: 2036-12-08
Also published as: CN106599160A

Abstract

The invention relates to a dictionary coding method of a content rule base, which sets a dictionary of the content rule base as a 20-bit dictionary coding system. The invention also discloses a content rule base management system, which comprises: the system comprises a content rule base visualization management module, a URL data preprocessing classification module, a key application APP/website tracking module, a webpage restoration module, a webpage crawler module, a webpage content analysis module and a content rule base. The advantages of the invention are embodied in that: the method can be used for analyzing and translating the behavior log of the user accessing the mobile internet in a large-scale full-view manner, so that the holographic knowledge map of the user of the mobile internet is formed, and the subsequent various content analysis applications are supported.

Description

Content rule library management system and coding method thereof

Technical Field

The invention relates to the technical field of data processing, in particular to a content rule base management system and a coding method thereof.

Background

Telecom operators obtain original signaling data of the Internet of a client through light splitting, and output an Internet log synthesized by xDR through first-stage DPI identification, but generally, data analyzed through the first-stage analysis is not fine enough, rules cannot bear too much, and analysis dimensionality is inflexible, so that DPI enhanced analysis is required to be performed, and the aspects of APP identification, webpage classification, keyword analysis, a knowledge base system and the like are further enhanced so as to support subsequent various content analysis applications. How to translate and mark the data which is extremely huge and complicated as information containing deep semantic content, the prior art only puts requirements on data results which need to be analyzed, but has the following disadvantages for how to achieve the required data results:

1. only relatively shallow content can be translated;

2. relying substantially entirely on manual labeling;

3. only a small amount of sample data can be marked manually;

4. the inability to quickly discover changes in the source data structure;

5. no complete solution, model, and algorithm is provided for how such data results are accomplished.

Disclosure of Invention

The invention aims to provide a content rule base management system and a coding method thereof aiming at the defects in the prior art, which are used for carrying out large-scale full-view-angle analysis and translation on a behavior log of a mobile internet accessed by a user so as to form a holographic knowledge map of the mobile internet user.

In order to achieve the purpose, the invention discloses the following technical scheme:

a content rule base dictionary coding method sets a dictionary of a content rule base as a 20-bit dictionary coding system, supports a 5-level label system, wherein the first-level classification is a field and occupies 3 bits, the second-level classification is an industry and occupies 4 bits, the third-level classification is an application and occupies 5 bits, the fourth-level classification is a column and occupies 4 bits, and the fifth-level classification is a search content, metadata or extraction content type and occupies 4 bits;

the first bit of the four-level classification is an identifier and can be only 0 or 1, wherein 0 represents a column and 1 represents a behavior; the first digit of the five-level classification is 0 for searching, 1 for metadata, 2 for extracting, if the type of metadata is the metadata type, the code is 13 for beginning, 3 for ID, if the type of extracting is the extraction type, the second digit is 0 for text, 1 for floating point, 2 for date, and 3 for ID;

position 20 at 00000000000000000000 represents an unknown application.

The invention also discloses a content rule base management system, which applies the coding method and comprises the following steps:

the content rule base visualization management module is used for adding, deleting, searching and changing the rule base, simultaneously providing monitoring of the state of each module and extracting visualization operation of content rules from sample data;

the URL data preprocessing and classifying module extracts URLs needing deep content analysis based on a user internet log, imports a sample database, and cleans the URLs into sample data for a rule analyzer to use, wherein the extracted content comprises an application rule, a column rule, a search rule, a metadata rule, a noise rule and a metadata rule;

the method comprises the steps that an APP/website tracking module is applied in a key mode, key application refers to application needing deep extraction of content metadata, the key application is defined in a content rule-application rule, the key application tracking module is started through task management-key application, and output of the key application tracking module is sample data-various application URLs in known application and used for data analysts to further extract various content rules;

the webpage restoration module is used for arranging the content subjected to the URL deep analysis into a plurality of restoration rules and providing the restoration rules for the webpage restoration module to use; the restored webpage is provided for the webpage content analysis module to use; using the web page restoration rules defined in the metadata rules;

the webpage crawler module is used for crawling related webpage contents from the Internet based on the data processed by the crawler URL generating module for the subsequent webpage content analyzing module to use;

the webpage content analysis module is used for establishing a content metadata rule base by extracting rules of webpage content and corresponding the extracted content with content metadata;

a content rule base comprising the following rules: the method comprises the following steps of APP application rules, noise rules, APP column action rules, APP search keyword extraction rules, APP content metadata rules and webpage restoration rules.

Furthermore, the preprocessing content of the URL data preprocessing and classifying module comprises preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, defining preprocessing rules and summarizing statistics of processing data through configuration, outputting data which are not identified by all the rules to unknown data, and entering a rule base source data input table for analysis and rule classification formulation.

Further, the content analyzed by the webpage content analysis module includes judging the update state of the webpage/APP application, identifying the webpage/APP application code, acquiring the webpage/APP application title, and acquiring the webpage/APP application content.

Furthermore, the webpage crawler module supports Chinese and multi-byte coding and supports Unicode coding.

Further, the web crawler module comprises:

the network connection component is used for connecting to the Internet and establishing network communication;

the analysis layer is used for analyzing the crawled webpage content, the webpage header, the webpage codes and the files in various formats to acquire required information; simultaneously supporting the content analysis of the APP;

and the basic layer comprises a proxy server load balancing component, a thread pool component, a url filtering component, a url storing component, a crawling content storing component, an anti-crawling strategy component, a webpage updating state verifying component and a webpage simulation login component.

Further, the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.

Further, the content rule base automatically optimizes the parameter configuration of the rule periodically and gives an optimization suggestion.

The invention discloses a content rule library management system, which has the following beneficial effects:

1. the granularity of data analysis is deep, and the requirement of business support facing to the market side under different scenes can be met.

2. The user log data of the operator is translated into user behavior data with service meaning and potential business value.

3. Based on the subdivision of the service data, massive big data is changed into local semantic small data of users in different industries.

4. A model is provided that is suitable for large-scale processing of data by machines.

5. The solution is the last ring of pre-processing before the operator data industry applies.

6. And converting the manually processed data into data capable of being processed by human-computer interaction.

7. The data size after rule processing and encoding is about 20 compared with the source data size: 1, the subsequent treatment cost is greatly saved for subsequent analysis.

Drawings

FIG. 1 is a schematic view of a frame structure according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The core of the invention is to provide a semantic content rule tag algorithm model, which comprises a semantic content rule base model, semantic tag system coding, an algorithm for marking and translating Internet HTTP log data of enterprise data center users of telecommunication operators into semantic text data, a model and a coding system.

Please refer to fig. 1.

position 20 at 00000000000000000000 represents an unknown application.

the content rule base visual management module is a WEB management tool and is used for adding, deleting, searching and modifying the rule base, simultaneously providing monitoring for the state of each module and extracting visual operation of content rules from sample data;

table one: content rule base visualization management description

The URL data preprocessing and classifying module extracts URLs needing deep content analysis based on a user internet log, imports the URLs into a sample database, and cleans the URLs into sample data for a rule analyzer to use, wherein the extracted content comprises an application rule, a column rule, a search rule, a metadata rule, a noise rule and a metadata rule; and the URL classification module is used for inputting the sampled DPI URL, and cleaning the DPI URL into sample data by using rule bases such as application rules, column rules, search rules, metadata rules, noise rules, metadata and the like, and is used by rule analysts. The method supports preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring a self-defined preprocessing rule, and supporting statistical summary of processed data and unknown data: and outputting the data which is not identified by all the rules to unknown data, and entering a rule base source data input table for analysis and rule classification formulation.

the webpage restoration module is used for arranging the content subjected to the URL deep analysis into a plurality of restoration rules and providing the restoration rules for the webpage restoration module to use; the 1 restoration rule can cover a plurality of web pages and is not limited by various limitations of the web sites on the crawlers. The restored webpage is provided for the webpage content analysis module to use; using the web page restoration rules defined in the metadata rules;

the webpage crawler module is used for crawling related webpage contents from the Internet based on the data processed by the crawler URL generating module for the subsequent webpage content analyzing module to use; the web crawler module meets the following technical requirements:

the support for Chinese and multi-byte coding supports Unicode coding. The crawler supports network connection through the proxy server, supports a proxy server encryption connection mode, supports load balancing according to the network condition of the server when supporting a plurality of proxy servers, supports local DNS cache through the crawler DNS cache, supports management of a crawling thread of the crawler, supports optimization of a crawling queue according to crawling quantity, supports crawling of specified contents of specified websites, and supports setting of a crawler crawling time period.

The content rule base supports the following label rule definitions of the classification structure:

1) field of the invention

2) The trade of "Zhi Jian

3) Application of "Ji Zhi

4.1) column/behavior

4.2) search

4.3) content (metadata) - - -, in

5) Content attributes

4.4) extraction

Example (c): automobile (1) -automobile (2) -automobile home- (3) -brandID-3985, model amb AMGCLA-grade AMG (4) -price: 62.8 ten thousand (5)

The content rule base has the following characteristics:

supporting a multi-level classification system

Extensions to support taxonomy

Supporting taxonomy mapping

Supporting content augmentation by crawling internet content through crawlers

Support modeling by machine learning text mining algorithms and score prediction from models

Content augmentation with class results

Dictionary base supporting Chinese word segmentation

1. Chinese word segmentation supporting technology

2. Expansion supporting Chinese word segmentation dictionary

3. Supporting dictionary expansion through crawling of crawler to professional website

And the rule optimization algorithm model is used for automatically optimizing the parameter configuration of the rule at regular intervals, giving an optimization suggestion and outputting the optimization suggestion to a flow content identification rule base to be confirmed.

As a specific embodiment, the preprocessing content of the URL data preprocessing classification module includes preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring statistics summary from the definition of the preprocessing rules and the processing data, and outputting the data that is not identified by all the rules to unknown data, which enters the rule base source data input table for analysis and rule classification.

As a specific embodiment, the content analyzed by the webpage content analysis module includes determining an update state of a webpage/APP application, identifying a webpage/APP application code, obtaining a webpage/APP application title, and obtaining webpage/APP application content.

As a specific embodiment, the webpage crawler module supports Chinese and multi-byte coding and supports Unicode coding.

As a specific embodiment, the web crawler module includes:

As a specific embodiment, the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.

As a specific embodiment, the content rule base automatically optimizes the parameter configuration of the rule periodically and gives an optimization suggestion.

Based on the user log data on the internet of the enterprise data center, the method realizes the functions of content analysis, configuration, management and visualization, self-optimization of the content rule base and the like through methods of URL filtering, rule base matching, crawler crawling, content restoration, text data mining and the like, and simultaneously considers the requirements of system safety, load balancing, data backup and the like.

According to the invention, a crawler technology is utilized, a DPI restoration technology is combined, and the flow management of a label rule base is utilized to deeply identify the mobile phone internet access behavior, the access content and the access application of a client, so that the analysis requirements of client data and service data are finely supported, and the means of data integration, data modeling, data mining, data cleaning and the like are combined to build big data and user deep insight analysis capability based on subdivision industry.

The relationship between the content rule base and the peripheral system of the invention is as follows:

relationship to stream processing: the content rule base needs to provide an internet repository for streaming use. And the flow processing processes the signaling XDR data and then provides the signaling XDR data for the internet behavior analysis module to use.

Relationship to data asset management platform: and a label rule base is used for processing in the online behavior analysis, and both a data processing process and a data storage result are used as data assets to be managed by a data asset management platform.

The relationship between the large data operation support platform and the client label management is that the crawler and content rule base analysis system needs to provide client internet attribute data to the client label management module, and the client label management module generates a client internet preference label.

The system structure adopted by the invention is the basic condition of the internet behavior data source of the operator is completely considered, the establishment of the rule base is completed through the mode basically similar to the generation of the internet behavior data, and the rule establishment is more reasonable and effective.

Integrity of

1) Possesses back end data acquisition, data capture module.

2) The system is provided with a rule library management system matched with back-end data processing.

Degree of maturity

The related content of the rule base is established in the related fields of automobiles, insurance, securities and the like, rules are established by considering the characteristics of the internet behavior log data source, and finally determined rule models are obtained through full discussion and continuous improvement and optimization of clients in related industries and through related industry practice tests.

Extensibility and ease of use

The system architecture of the product adopts a layered and hierarchical processing structure, each layer can configure a corresponding physical server according to the size of data volume and the scale of a rule base required to be established, and the system is a flexible extensible system. The management of the rule base and the calling of the related system module can be based on WEB visual operation, and the deployment and the use of the system are greatly facilitated.

Stability of

The whole application software system can continuously work for 7 multiplied by 24 hours without interruption, and the failure can be timely warned. The application system should have a perfect detection function to ensure the accuracy of data such as service control. The detection of each link of the system is managed in a closed loop mode, a detection system which is relatively independent of other functional modules is established, and the accuracy of data is verified. The application system should be provided with automatic or manual recovery measures to quickly recover normal operation in the event of an error. The application software is to prevent the system from crashing by consuming too many system resources.

Advancement of

The rule base is established in a cold start, sampling, training and iteration mode, is parallel to the big data internet behavior deep analysis system, and can be seamlessly embedded into the big data internet behavior analysis system of the enterprise. The crawler and rule base management system can meet the requirements of business analysis and big data mining in the fastest mode and the shortest time in the face of the ever-changing content of the Internet.

Novelty

Based on the network characteristics of an operator, a webpage content restoration module is innovatively introduced to serve as one of core function modules for rule base construction, the restoration module has the greatest advantage that the establishment process of a rule base can be shortened, and the establishment cost and time of the rule base are greatly saved by utilizing the network owned by the operator. Moreover, the data validity of the rule base is also ensured.

Openness of

Based on the requirement of actual online behavior analysis, the rule base is established according to different levels of application, application behaviors, application columns, application contents, application content metadata, noise and the like, and a semantic dictionary classification rule is established at the same time, so that the requirements of behavior analysis and basic data of future user behavior portraits are completely covered.

The establishment of the content rules of the invention is a complex and long-flow work, and besides a matched tool, a clearer management process and an operation flow are needed for guarantee. The establishment process of the rule base is divided into two stages: 1) cold starting; 2) and (5) performing iterative optimization.

1. Cold start

Any new APP or website that requires deep content tagging inevitably has a cold start process that requires human intervention. The initial rule base for a cold start is initiated by a rule base maintenance person.

2. Iteration

Content rule base recognition rate: the primary classification is greater than 95%, the secondary classification is greater than 85%, based on the index data, in the URL processing process of a specific APP or website, the sampling data are used for training and iteration processing until the output unknown URL is less than 15%, then new sampling data are accessed, training is repeated, and a rule tag library is added.

Table two: content rule base establishment procedure

The foregoing is only a preferred embodiment of the present invention and is not limiting thereof; it should be noted that, although the present invention has been described in detail with reference to the above embodiments, those skilled in the art will understand that the technical solutions described in the above embodiments can be modified, and some or all of the technical features can be equivalently replaced; and the modifications and the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A content rule library management system is characterized in that the system is applied to a dictionary coding system which sets a dictionary of a content rule library as 20 bits and supports a 5-level label system, wherein the first-level classification is a field and occupies 3 bits, the second-level classification is an industry and occupies 4 bits, the third-level classification is an application and occupies 5 bits, the fourth-level classification is a column and occupies 4 bits, and the fifth-level classification is a search content, metadata or extraction content type and occupies 4 bits; the first bit of the four-level classification is an identifier and can be only 0 or 1, wherein 0 represents a column and 1 represents a behavior; the first digit of the five-level classification is 0 for searching, 1 for metadata, 2 for extracting, if the type of metadata is the metadata type, the code is 13 for beginning, 3 for ID, if the type of extracting is the extraction type, the second digit is 0 for text, 1 for floating point, 2 for date, and 3 for ID; position 20 at 00000000000000000000 represents an unknown application, including:

2. The system of claim 1, wherein the preprocessing of the URL data preprocessing classification module includes preprocessing the integrity of the URL, preprocessing the URL protocol, preprocessing the suffix name of the URL, configuring statistical summaries from the definition of the preprocessing rules and the processing of the data, and outputting the data not identified by all the rules to unknown data and into the rule base source data input table for analysis and rule classification.

3. The system of claim 1, wherein the content analyzed by the web content analysis module comprises determining update status of web/APP applications, identifying web/APP application codes, obtaining web/APP application titles, and obtaining web/APP application content.

4. The system of claim 1, wherein the webcrawler module supports chinese and multi-byte encoding and supports Unicode encoding.

5. The content rule base management system of claim 4, wherein the web crawler module comprises:

6. The system of claim 5, wherein the content rule base supports a multi-level classification system; supporting the extension of a classification system; supporting classification system mapping; the method supports crawling of Internet contents by a crawler to perform content expansion; modeling is supported through a machine learning text mining algorithm, and content expansion is carried out according to a classification result predicted by the model; and a Chinese word segmentation dictionary base is supported.

7. The system of claim 6, wherein the content rule base automatically optimizes the parameter configuration of the rule and gives optimization suggestions periodically.