CN114398535A

CN114398535A - User role-oriented intelligent network specific information acquisition system and interaction method

Info

Publication number: CN114398535A
Application number: CN202210046499.4A
Authority: CN
Inventors: 钱夔; 孙瑞彬; 潘昱辰; 徐浩; 韩国辉; 陈晓琳
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-26

Abstract

The invention discloses a user role-oriented network specific information intelligent acquisition system and an interaction method. The task understanding module acquires a user target webpage and information elements concerned by a user in a man-machine interaction interface mode to generate a task demand expression, the data acquisition module performs distributed incremental data crawling of the target webpage under the support of the network agent module, the data aggregation enhancement module performs self-adaptive matching and aggregation according to the information demand expression generated by the task understanding module, and finally the data storage module achieves classified and diversified storage of user specific information. According to the invention, through a man-machine interaction understanding mode, the network specific information facing to the user role is rapidly and accurately acquired, the processing time of massive information is reduced, the aggregation timeliness is improved, and the autonomous and personalized requirements of the user are met.

Description

User role-oriented intelligent network specific information acquisition system and interaction method

Technical Field

The invention relates to the technical field of intelligent information acquisition, in particular to a user role-oriented network specific information intelligent acquisition system and an interaction method.

Background

With the advent of the big data age, data has penetrated almost every industry and field, and has gradually become the most strategically important asset. But also causes the phenomenon of low value density due to huge quantity and various types. Valuable information is contained in a large amount of data, and in the face of massive data, it is obviously unrealistic to collect, analyze and judge by manpower, so that a large number of research institutions currently utilize a crawler technology to acquire related data.

At present, in the existing web crawler technology, firstly, a url (uniform Resource locator) of a target web page needs to be acquired, a crawler task is issued at the same time, all data of the web page is downloaded and converted into a character string, and finally, data cleaning, preprocessing and parsing are performed, so that useful information is acquired. Just because of the large amount of data and the high collection frequency, the crawler technology focuses on improving the crawling efficiency of the web crawler and solving the load balancing problem of the web crawler in a distributed environment. Patent CN201710372282.1 discloses a distributed crawler system and a periodic incremental crawling method, wherein a coordination component is used to periodically import tasks into a distributed URL task queue, so as to implement periodic incremental crawling; patent CN202011618649.1 discloses a data processing method, system and cloud platform based on web crawlers, which obtains web crawler instructions input by a user, obtains target crawler data corresponding to target web information and a crawl object set, and stores the target crawler data in target distributed storage nodes, so as to improve reliability of data storage and integrity of data crawl during large-scale data crawl.

However, the webpage data acquired by the existing crawler technology has a quite long distance from the data really required by the user. In practical application, a crawler engineer is responsible for crawling relevant webpage data in the whole network, and then a data analyst completes data cleaning, data analysis and data visualization in combination with user service requirements, so that a data user cannot directly use a crawler system to quickly and accurately acquire specific information, and further cannot automatically acquire the specific information in the whole network according to different roles of the user and different data requirements.

Combining the actual situation of the data industry, the network specific information intelligent acquisition field facing to the user role, the problems existing in the prior crawler technology mainly comprise: 1. facing to the data use requirement, a crawler engineer and a data analyst are required to be used as intermediaries to process data, and the data cannot be directly facing to a user in an end-to-end mode, so that the user can interactively obtain information according to the requirement; 2. the data acquisition system is not automatically adapted to the use intention of user data, and information required by a user is difficult to accurately and quickly locate for a large amount of crawled webpage data; 3. most data crawling systems are too complex in deployment and operation, and cost of calculation, storage and network resources is too high; 4. the acquisition is not targeted, and the contents on the webpage are all crawled, so that the resource waste is caused.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the intelligent acquisition system and the interaction method of the network specific information facing to the user role, the end user is used for the front data, and the accurate acquisition of the personalized information of the mass data is realized in an intelligent interaction mode.

In order to achieve the purpose, the invention adopts the following technical scheme:

the intelligent network specific information acquiring system facing to user role includes: the system comprises a task understanding module, a data acquisition module, a network agent module and a data convergence enhancing module;

the task understanding module is used for generating a requirement expression after acquiring a target webpage given by a user and information acquisition requirements;

the data acquisition module is respectively connected with the task understanding module and the network agent module and is used for performing distributed incremental data crawling on a target webpage in the task understanding module under the support of an agent access IP address provided by the network agent module to finish data acquisition;

the network agent module is used for providing support for the data acquisition module to access the IP address by an agent;

the data aggregation enhancing module is respectively connected with the task understanding module, the data acquisition module and the data storage module, and is used for analyzing and expanding the data collected by the data acquisition module, performing self-adaptive matching, aggregation and sequencing on the expanded data and a demand expression generated in the task understanding module, and storing the data and the demand expression into the data storage module.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the task understanding module is used for acquiring a target webpage which needs to be retrieved by a user and an information acquisition requirement, and judging whether an information acquisition request sent by the user is clear or not; if the information acquisition requirement is definitely given, namely the name of the data table header field needing to be retrieved is given, a corresponding requirement expression is directly generated according to the name of the data table header field to be retrieved; if the information acquisition requirement is not definitely provided, namely the data table header field name required to be retrieved is not provided, and only generalized requirement keyword description is provided, semantic recognition is carried out according to the keywords, the keywords are inferred and expanded, and a corresponding requirement expression is generated according to the keywords.

Furthermore, the data acquisition module comprises a crawling controller, a data wrapper, an intelligent responder and a task buffer;

the crawling controller is used for accessing the target webpage under the support of the network agent module, analyzing the source code and extracting and downloading webpage content;

the data wrapper is used for safely wrapping the data of the webpage content downloaded by the crawling controller so as to ensure the completeness and integrity of the data;

the intelligent responder is used for further adjusting and analyzing the packaged data, namely analyzing the XML path language, the cascading style sheet and the regular expression in the data and adjusting the messy codes of the webpage data to avoid the messy codes of the data;

and the task buffer is used for temporarily storing the data after the adjustment and the analysis are finished, and inputting the data into the Pipeline assembly in batches after the data are gathered by a certain magnitude to finish data storage.

Further, the network agent module is used for providing different agent access IP address resource pools and is divided into a level 1 agent and a level 2 agent, wherein the agent access IP address in the level 1 agent is a specific agent access IP address which is specially used for the condition that the target webpage is easily limited in access, and the agent access IP address in the level 2 agent is a conventional agent access IP address which is used for the condition that the target webpage is in an open state, namely, the access is not limited;

the network agent module is also used for judging whether the current accessed target webpage is limited; if the IP address is limited, the agent in the level 1 agent is adopted to access the IP address; if not, adopting the agent in the 2-level agent to access the IP address;

the network agent module is also used for stopping the agent access request of the current agent for accessing the IP address and transferring the task of the agent access request to other agent access IP addresses in the level 1 agent if the target webpage starts a self-protection mechanism in the process of accessing the target webpage by using the agent access IP address in the level 1 agent;

the network agent module is also used for stopping the agent access request of the current agent for accessing the IP address and transferring the task of the agent access request to other agent access IP addresses in the level 2 agents if the target webpage starts a self-protection mechanism in the process of accessing the target webpage by using the agent access IP address in the level 2 agents.

Furthermore, the data aggregation enhancement module is used for finding the data associated with the content expressed by the required expression by combining multiple modes of multi-language translation, sentence entity recognition and semantic association analysis on the data collected in the library, and performing adaptive matching, aggregation and sequencing.

Further, a method for providing intelligent acquisition and interaction of network specific information for user roles comprises

S1: acquiring a target webpage to be retrieved by a user and an information acquisition requirement, and judging whether an information acquisition request sent by the user is clear or not;

if the information acquisition requirement is definitely given, namely the name of the data table header field needing to be retrieved is given, a corresponding requirement expression is directly generated according to the name of the data table header field to be retrieved;

if the information acquisition requirement is not definitely provided, namely the name of the data table header field needing to be retrieved is not provided, and only generalized requirement keyword description is provided, performing semantic identification according to the keywords, reasoning and expanding the keywords, and generating a corresponding requirement expression according to the keywords;

s2: accessing a target webpage which needs to be retrieved by a user through network multi-agent scheduling control, and analyzing and extracting downloaded webpage contents by using source codes;

carrying out safe packaging processing on the downloaded data content, and adjusting and analyzing the downloaded data content after the packaging is finished;

after the adjustment and the analysis are finished, the data are temporarily stored, and after a certain magnitude of aggregation, the data are input into a Pipeline assembly in batches to finish data storage;

s3: the data associated with the content expressed by the required expression is found by combining various modes of multi-language translation, sentence entity recognition and semantic association analysis on the data collected in the library, and adaptive matching, aggregation and sequencing are carried out;

s4: and storing the matched, aggregated and sequenced contents.

Further, in step S2, the specific content of the network multi-agent scheduling control is as follows:

judging whether the current accessed target webpage is limited or not; if the IP address is limited, the agent in the level 1 agent is adopted to access the IP address; and if not, adopting the agent in the 2-level agent to access the IP address.

Further, in the process of accessing the target webpage by using the agent access IP address in the level 1 agent, if the target webpage starts a self-protection mechanism, the agent access request of the current agent access IP address is stopped, and the task of the agent access request is transferred to other agent access IP addresses in the level 1 agent;

in the process of accessing the target webpage by using the agent access IP address in the 2-level agent, if the target webpage starts a self-protection mechanism, the agent access request of the current agent for accessing the IP address is stopped, and the task of the agent access request is transferred to other agent access IP addresses in the 2-level agent.

Further, in step S2, after the packaging is completed, the downloaded content is adjusted and analyzed, where the specific content of the adjustment and analysis includes analyzing the XML path language, the cascading style sheet, and the regular expression, and adjusting to prevent the webpage data from being scrambled.

The invention has the beneficial effects that:

1. the method solves the problem of 'last kilometer' in the traditional data acquisition mode, changes complex processes of data analysis requirement submission of a user, data acquisition of a crawler engineer, cleaning processing of the data analyst and the like, and realizes quick and accurate acquisition of the user role-oriented network specific information in a man-machine interaction understanding mode.

2. The invention can improve the use viscosity of the user and reduce the resource consumption of the existing data acquisition. Particularly, the crawler task is not crawled for the whole network data any more, but the crawler task is firstly decomposed into specific information requirements through the task understanding module, the pressures of crawler resource consumption, mass data storage and the like are reduced in the form of a task requirement expression, the crawler task is closer to the final specific target information requirements of users, and the practicability is improved.

3. The invention can provide more accurate and deeper information acquisition. The traditional crawler system generally performs common data tag processing based on HTML tags, structure analysis and the like after crawling user resource pool webpage data, and cannot understand the content, but the invention realizes the multilingual automatic translation enhancement function by utilizing multilingual analysis processing technology as required after crawling target webpage data, and can perform self-adaptive matching and aggregation according to an information demand expression generated by a task understanding module based on the technologies of word segmentation, entity recognition, association analysis and the like.

4. Compared with the common crawler system, the crawler system is more intelligent and efficient in information acquisition. The data aggregation enhancement module performs specific information crawling for semantic information understanding by combining information elements concerned by users, and is realized based on a distributed computing framework, so that the processing time of massive information is reduced, and the aggregation timeliness is improved.

5. The invention can realize data crawling batch warehousing and save a large amount of engineering time. The task buffer is introduced, so that a memory queue of a generated data temporary storage system can be stored in batches after a certain magnitude of data is accumulated, batch storage is realized, and the tension of crawler IP resources and task pressure are relieved.

6. The invention has more flexible configuration. The task information requirements can be directly obtained through a human-computer interaction interface, the intention reasoning can also be carried out by adopting a semantic recognition mode, the generated task requirement expression can adopt a default configuration template and an independent setting template, and automatic recommendation matching can also be carried out according to the historical information of the user, so that the independent and personalized requirements of the user are met.

7. The invention can reduce the resource consumption of the crawler and also maximally improve the utilization efficiency of the IP address. At present, the anti-crawler strategy of a target website is quite perfect, and a crawler system needs certain IP resources for realizing regular large-amount data acquisition. The invention uses a multi-agent optimization strategy, solves the problems of limited access, sealed IP and the like by longitudinal classification and transverse optimization, and improves the utilization rate of addresses.

Drawings

FIG. 1 is a schematic diagram of the connection relationship of the overall system structure of the present invention.

FIG. 2 is a schematic overall flow diagram of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the invention provides a user role-oriented network specific information intelligent acquisition system, which comprises a task understanding module, a data acquisition module, a network agent module, a data aggregation enhancement module and a data storage module.

The task understanding module acquires a target webpage given by a user and information elements concerned by the user in a man-machine interaction interface mode to generate a task demand expression, the data acquisition module crawls distributed incremental data of the target webpage under the support of an IP agent provided by the network agent module, the data aggregation enhancement module performs self-adaptive matching and aggregation according to the information demand expression generated by the task understanding module, and finally the data storage module realizes hierarchical classification personalized storage of user specific information.

The task understanding module receives the user information acquisition request, performs information element description judgment, and directly takes the given information as the task information requirement aiming at the data table header field name which is clearly given by the user; aiming at the generalization crawling requirement which is not clear enough by the user, the task understanding module adopts a semantic recognition mode to carry out intention reasoning and decomposes and converts the crawling task into the task information requirement. And finally generating a task requirement expression in both modes, thereby always focusing the information content most concerned by the user and avoiding crawling of the whole webpage data. The established task demand expression can meet different demands, a default configuration template can be adopted, a template can be set independently, and automatic recommendation and matching can be carried out according to user historical information.

And the data acquisition module comprises a crawling controller, a task buffer, a data wrapper and an intelligent responder. The crawling controller realizes source code analysis and network multi-agent scheduling control on a user target webpage resource pool in the task understanding module through scheduling data analysis; after receiving the address of the webpage to be crawled, the crawling controller adds an initial task to a queue of a task buffer, and downloads webpage content after the task starts; after downloading, the crawling controller transmits the downloaded content to a data wrapper for further wrapping processing of the data so as to ensure the safety and integrity of the data; the packaged data can return to the crawling controller, the controller schedules the intelligent responder to analyze the returned data, the intelligent responder supports analysis modes such as Xpath (XML path language), Css (cascading style sheet), Re (regular expression) and the like, the problem of messy codes of Chinese webpages can be solved, and self-adaption of codes is solved; the analyzed piece of data is distributed to a task buffer, the task buffer can temporarily store a memory queue of a data storage system, and the data is transmitted into Pipeline in batches after being gathered to a certain magnitude, so that the problem that the existing acquisition framework can only store mass data in sequence and cannot store the mass data in batches is solved, and system congestion caused by data being put into the memory can be avoided by setting an upper limit of data gathering.

And the network agent module provides different IP address resource pools, solves the problems of limited access, sealed IP and the like by using a multi-agent optimization strategy and improves the efficiency of the crawler. The multi-agent optimization strategy is composed of longitudinal grading and transverse optimization, wherein the longitudinal grading means that IP addresses are divided into a level 1 agent and a level 2 agent, the level 1 agent provides a specific small number of IP addresses and is specially used for the condition that the target website access is limited when a conventional IP address is used, and the level 2 agent provides an IP address which is easy to obtain and is used for the condition that the target website access is not limited; in order to prevent excessive access of a website, the abnormal IP address is limited to a certain extent (in some cases, the IP address resource pool in the application has two levels, the IP address in the level 1 is not common and easy to obtain, the number of times of access and use of the IP address is small, so that the limited situation can be avoided, the IP address in the level 2 is common, and the limited situation is not set for the website), the dynamic proxy IP replacement function is respectively provided on the addresses in different levels through transverse optimization, and the access request is forwarded to different proxy IPs, so that the IP available state is guaranteed under the target website protection mechanism.

And the data aggregation enhancing module is used for realizing the multilingual automatic translation enhancing function by utilizing the multilingual analysis processing technology as required after the data acquisition unit crawls the target webpage data, and can perform self-adaptive matching and aggregation according to the information demand expression generated by the task understanding module based on the technologies such as word segmentation, entity recognition, association analysis and the like. The self-adaptive matching and aggregation are realized based on a distributed computing framework, local parallel matching tasks are completed at a map end, and processing work such as matching result merging and sorting is performed at a reduce end, so that deep and accurate acquisition of specific information of a user is completed.

The data storage module supports various types of databases such as Mysql, Redis, MongoDB, memory databases and the like, can realize classified storage and sharing of different themes according to user roles, and provides data uniform access service for other systems to call.

The principle of the network specific information intelligent acquisition interaction method for the user role is described in detail below with reference to a flowchart, as shown in fig. 2.

Step 1, a user sends an information acquisition request, a task understanding module judges whether an information element is clear or not, if the name of a header field of a data table is clear, a task information requirement expression is directly generated, if the information field is not clear, only generalization requirements such as task description are provided, and the task understanding module conducts intention reasoning and generates the task requirement expression.

And 2, the data acquisition module starts to analyze the source code of the user target webpage resource pool in the task understanding module and perform network multi-agent scheduling control, performs task buffering, encapsulates the request function and safely encapsulates the crawled data, and completes the self-adaptive response of the code.

And 3, the network agent module performs network agent optimization under the control and scheduling of the data acquisition module, provides an agent IP dynamic replacement function, further judges the limited access condition of the target website, starts the level 1 agent service if limited, and starts the level 2 agent service if not limited.

And 4, the data aggregation enhancing module performs multilingual automatic translation enhancement and semantic processing based on word segmentation, entity recognition, association analysis and the like after the data acquisition device crawls target webpage data, performs adaptive matching and aggregation based on a distributed computing framework according to an information demand expression generated by the task understanding module, completes local parallel matching tasks at a map end, and performs processing work such as matching result merging and sorting at a reduce end, so that deep and accurate acquisition of specific information of a user is completed.

And step 5, the data storage module realizes classified storage and sharing of different themes according to the role of the user, and provides a data unified access service for other systems to call.

Note that the following is additionally explained:

1. purpose of source code analysis and subsequent processing of analysis

The webpage source code comprises an HTML language, webpage content and a small amount of JavaScript and CSS languages, and the source code is analyzed to extract the webpage content; after the content is analyzed, the required content is analyzed according to the custom rule through the technologies of keyword matching, character positioning and the like.

2. What task interpretation is specific to a task queue

The data downloading task is carried out on the web pages in the resource pool.

3. Pipeline concept and role after Pipeline introduction

Pipeline is a component name, and in a crawler module, the main function of the crawler module is to write a piece of data returned by parsing into a database, a file and other persistence modules, namely, the data which is transmitted into Pipeline is input into the database.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. The system for intelligently acquiring the network specific information facing to the user role is characterized by comprising the following steps: the system comprises a task understanding module, a data acquisition module, a network agent module and a data convergence enhancing module;

2. The system for intelligently acquiring network specific information oriented to user roles according to claim 1, wherein the task understanding module is configured to acquire a target webpage that a user needs to retrieve and information acquisition requirements, and determine whether an information acquisition request sent by the user is clear; if the information acquisition requirement is definitely given, namely the name of the data table header field needing to be retrieved is given, a corresponding requirement expression is directly generated according to the name of the data table header field to be retrieved; if the information acquisition requirement is not definitely provided, namely the data table header field name required to be retrieved is not provided, and only generalized requirement keyword description is provided, semantic recognition is carried out according to the keywords, the keywords are inferred and expanded, and a corresponding requirement expression is generated according to the keywords.

3. The system for intelligently acquiring network specific information oriented to user roles of claim 1, wherein the data acquisition module comprises a crawling controller, a data wrapper, an intelligent responder and a task buffer;

4. The system for intelligent acquisition of network specific information for user roles according to claim 3,

the network agent module is used for providing different agent access IP address resource pools and is divided into a level 1 agent and a level 2 agent, wherein the agent access IP address in the level 1 agent is a specific agent access IP address which is specially used for the condition that a target webpage is easy to limit during access, and the agent access IP address in the level 2 agent is a conventional agent access IP address which is used for the condition that the target webpage is in an open state, namely the access is not limited;

5. The system for intelligent acquisition of network specific information for user roles according to claim 3,

the data aggregation enhancement module is used for finding out data associated with the content expressed by the demand expression by combining multiple modes of multi-language translation, sentence entity recognition and semantic association analysis on the data collected in the library, and performing self-adaptive matching, aggregation and sequencing.

6. The method for interacting network specific information intelligently based on the system as claimed in any one of claims 1-5, comprising

s4: and storing the matched, aggregated and sequenced contents.

7. The method for intelligently acquiring and interacting network specific information oriented to user roles as claimed in claim 6, wherein in step S2, the specific contents of the network multi-agent scheduling control are as follows:

8. The method of claim 7, wherein the user-specific information is provided in a form of a web page,

in the process of accessing a target webpage by using an agent access IP address in the level 1 agent, if the target webpage starts a self-protection mechanism, stopping an agent access request of the current agent for accessing the IP address, and transferring the task of the agent access request to other agent access IP addresses in the level 1 agent;

9. The method of claim 6, wherein the downloaded content is adjusted and parsed after the packaging is completed in step S2, and the specific content of the adjustment and parsing includes parsing the XML path language, the cascading style sheet, and the regular expression, and adjusting to prevent the web page data from being scrambled.