CN108256106B

CN108256106B - Simulation access website adapter system

Info

Publication number: CN108256106B
Application number: CN201810114765.6A
Authority: CN
Inventors: 王欣
Original assignee: Shenzhen Topwise Communication Co ltd
Current assignee: Shenzhen Topwise Communication Co ltd
Priority date: 2018-02-06
Filing date: 2018-02-06
Publication date: 2021-11-02
Anticipated expiration: 2038-02-06
Also published as: CN108256106A

Abstract

The invention discloses a simulation access website adapter system, which comprises an automatic login module, a designated page capturing module, a page analyzing module, a data extraction module and a DB access module, wherein the automatic login module is used for acquiring a designated page; the automatic login module extracts relevant information in a request HTTP process message by simulating an HTTP request process, and fills the parameters into the simulation process to realize automatic login; the designated page capturing module captures a designated page, and after the designated page capturing module organizes page data, the request process of simulating an HTTP request is carried out; after the page is captured, capturing a specified page and storing the page as an HTML file, analyzing tags in the HTML file, and extracting data in the tags; after the analysis is finished, the relevant processing is carried out in the data extraction module, the specified information is obtained and stored in the DB access module, and relevant serialization operation is carried out, so that the subsequent information can be obtained at any time.

Description

Simulation access website adapter system

Technical Field

The invention relates to the technical field of video monitoring, in particular to a method for identifying and tracking a vehicle in video monitoring.

Background

In information statistics, it is often necessary to visit a plurality of websites and obtain information with certain identification from the websites (e.g., statistics of daily visit amount of a certain website, etc.), which has certain requirements on timeliness and frequency of visiting the websites. The time cost for manually logging in a plurality of websites is high, and the logging efficiency is low, so people usually adopt specific software to automatically log in.

For example, patent application 201210579372.5 discloses a method, an apparatus and an operating application platform for realizing third-party application service website login, wherein the method comprises: the method comprises the steps that when a user terminal accesses a third-party application service website, an operation application platform receives a first URL request sent by the third-party application service website; acquiring account login information of a third-party application service website from local according to the first URL request; and returning the account login information to a third-party application service website by calling a first predetermined function containing the account login information, and providing a quick login page for a user terminal to quickly login by the third-party application service website. The invention avoids the trouble that the user inputs account information again, realizes the quick login of the third-party application service website and improves the operation efficiency of the user.

However, this method records login information through a third-party website, and therefore has a potential safety hazard, and it is difficult to obtain an application. Patent application 201610147571.7 discloses a website login method and apparatus, wherein the method comprises: when a first website receives website jump trigger, acquiring a first token stored in a cookie of a browser where the first website runs, wherein the first token comprises a website identifier of one second website of a second website list page indicated by the website jump trigger and an equipment fingerprint used for representing when secret-free proxy login of the second website is set; the first website determines that the current operating environment is the same as the operating environment when the second website is set for secret agent login according to the device fingerprint, and obtains a second token corresponding to the first token; and the first website sends a secret-free login request to the second website according to the website identification in the first token, carries a third token, comprises the second token, and logs in to the second website in a secret-free manner when the second website verifies the second token successfully. The application improves the security of the first website logging in to the second website without secret. According to the method, the login information is recorded in the form of the token, so that the login safety and reliability can be provided, but batch login cannot be realized, and the problems of low login speed and high time cost still exist when a plurality of websites need to be logged in.

Disclosure of Invention

Aiming at the defects in the technology, the invention primarily aims to provide the simulated access website adapter system, which integrates ASP.NET and HTTP related technologies, can realize batch automatic website login, can greatly save time and cost, and improve login efficiency.

Another object of the present invention is to provide an adapter system for simulating website access, which uses an intermediate adapter for obtaining website information in batch to obtain login information and perform automatic login, is easy to implement, and can be widely applied to website login of existing browsers.

To achieve the above object, the present invention is achieved as follows.

An adapter system for simulating access to a website is characterized in that the whole functional architecture of the adapter system for simulating access to the website comprises the following 5 modules: the system comprises an automatic login module, a designated page capturing module, a page analyzing module, a data extraction module and a DB access module; the automatic login module extracts relevant information in a request HTTP process message by simulating an HTTP request process, and fills the parameters into the simulation process to realize automatic login; after login is successful, a designated page grabbing module grabs a designated page, a designated page grabbing module organizes page data, then the data are packaged, a request process of simulating an HTTP request is carried out, and after a response is obtained, next processing is carried out; after the page capture is finished, capturing a specified page and storing the page as an HTML file, wherein a page analyzing module and a data extracting module analyze the tags in the HTML file and extract the data in the tags; after the analysis is finished, the relevant processing is carried out in the data extraction module, the specified information is obtained and stored in the DB access module, and relevant serialization operation is carried out, so that the subsequent information can be obtained at any time.

The key parts of the adapter system that need to be implemented are as follows: 1. acquiring a request process of a designated page needing HHTP simulation, and through an abstract factory design mode, effectively and uniformly managing HTTP request message information of all request pages so as to facilitate an adapter to capture the required page; 2. and capturing the specified page, then extracting the page to be acquired, analyzing HTML page elements, and acquiring information in the specified elements.

Therefore, the two modules, namely the automatic login module and the designated page capturing module, need to organize the information of the request HTTP message into a database, so that the extension and modification of the related information of the website to be accessed in the future are facilitated; after the information of the HTTP message and the data are organized, the data are packaged and provided for an HTMLhelper class to perform a request process of simulating an HTTP request, after a response is obtained, the next step of processing is performed, the current page needs to be stored in an automatic login module, and the login state is kept; the current specified page needs to be saved in the specified page capturing module, and the files are all saved in an HTML format.

Further, the automatic login module and the designated page capturing module need to collect network data packets in the process of accessing the login page and the designated page, and analyze parameters required in the process of requesting the page HTTP in the network data packets.

Further, the parameters required in the page HTTP request process include, but are not limited to: url of the request page, jump url, cookie, post data, useragent, contenttype, host.

Furthermore, the automatic login module and the designated page capturing module uniformly manage the differentiated data by means of an abstract factory mode in a design mode, and can realize uniform scheduling of the simulation HTTP request process by the adapter program.

Furthermore, the processing method of the abstract factory in the module comprises the following steps: the difference sources of the data are POST data attached to the HTTP request, the data exist in a header POST parameter of the request for the POST mode request, the data directly exist in a URL of the request for the GET mode request, the data are collectively called postdata, and the postdata are mainly user name and password information in a login module, and capture dates and channel numbers needing to be submitted in a specified page; filling data by using ITAG through the getValue () method of the abstract class, which is beneficial to the extension of difference data, managing the data in a database in the form of a label, and analyzing the label through TAGManager in an adapter program; the organization of these tagged data in the database is as follows: UserName ═ TAG _ USER ] & UserPass ═ TAG _ PWD.

Cs deals with the data with TAG label in database. After the automatic login and the capture of the specified page data are processed, an HTTP request needs to be simulated in the adapter program to acquire related information.

Furthermore, after the page is captured, the captured specified page is stored as an HTML file, and the page analysis and data extraction module analyzes the tags in the HTML file and extracts the data in the tags.

Further, the path for extracting the label can be parsed to the specified label by using an HtmlAgility Pack Tester tool, and then the content in the label is obtained.

Furthermore, for the data of the analyzed page activation amount, the path analyzed by the tool HtmlAgility Pack Tester is subjected to relevant processing in the data extraction module after the analysis is finished, and the data is stored in the table corresponding to the database in the DB storage module.

The invention adopts the related technologies of C # and SQLServer2008 to realize a batch information acquisition management website, automatically operates batch automatic login and batch jump to the designated page, and is convenient for information acquisition personnel to perform acquisition work. The adapter needs three modules for processing, the login module needs to simulate the process of logging in websites, and extracts information needed in the login process of each website, the information is mainly obtained from HTTP messages in the process of requesting to log in websites, and the request messages and response messages need to be subjected to relevant analysis, and the information needed in the simulated login process is extracted; after the login is finished, the login state is kept, the user jumps to a page appointed to acquire information, the jumped page is analyzed, and the appointed information is acquired; the data extraction module processes the result of the specified information, stores the result in the DB and carries out related serialization operation.

The invention realizes that a group of URL links are automatically captured to obtain the appointed webpage and the required data information is extracted by the webpage analysis tool. And C # language is adopted, the appointed websites are accessed by simulating an HTTP request process, the webpages are automatically captured, and the pages are analyzed. In the process of automatically capturing the web pages, HTTP request message information required by all the web pages in the automatic capturing process is produced by combining a factory mode of a software design mode, and the information is serialized into a database and managed uniformly, so that the web pages are captured automatically in batches. And analyzing the captured webpage by combining with an Html Agility Pack library, extracting required data, and storing the data into a corresponding database. The method can effectively save the time required for manually accessing the URL and acquiring the data, realizes automatic webpage capturing and data extraction every day, and lays a foundation for the maintenance, management and analysis of the data in the later period.

Drawings

FIG. 1 is a block diagram of a system in which the present invention is implemented.

FIG. 2 is a diagram of a differentiated data abstraction factory UML implemented by the present invention.

FIG. 3 is a core UML diagram of the module for automatically logging in and capturing a specified web page implemented by the present invention.

Detailed Description

In order to more clearly describe the present invention, the present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, for the simulated website access adapter system implemented by the present invention, the overall functional architecture includes the following 5 modules: the system comprises an automatic login module, a designated page capturing module, a page analyzing module, a data extraction module and a DB access module; the automatic login module extracts relevant information in a request HTTP process message by simulating an HTTP request process, and fills the parameters into the simulation process to realize automatic login; after login is successful, a designated page grabbing module grabs a designated page, a designated page grabbing module organizes page data, then the data are packaged through a factory mode, a request process of simulating an HTTP request is carried out, and after a response is obtained, next processing is carried out; after the page capture is finished, capturing a specified page and storing the page as an HTML file, wherein the page analyzing module and the data extracting module analyze the tags in the HTML file and extract the data in the tags; after the analysis is finished, the relevant processing is carried out in the data extraction module and is stored in the database in the DB access module so as to be convenient for subsequent acquisition at any time.

In the two modules, information of the request HTTP message is organized into a database, so that the extension and modification of related information of a website to be accessed are facilitated in the future. After the specified page capturing module organizes the data, the data are packaged by means of a simple factory mode and are provided for an HTMLhelper class to simulate the request process of an HTTP request, after a response is obtained, the next step of processing is carried out, the current page needs to be stored in the automatic login module, and the login state is kept; the module for capturing the specified page needs to store the current specified page. These files are all saved in HTML format.

In the automatic login and designated page capturing module, a network data packet in the process of accessing a login page and a designated page is acquired (the network data packet can be acquired by a Firebug tool), and parameters (including url of a requested page, a pre-jump url, cookie, post data, usergent, contenttype, host and the like) required in the process of a page HTTP request in the network data packet are analyzed; in the adapter program, two methods of GET and POST are adopted as the HTTP request modes, and the settings of website developers in the parameters are inconsistent, so that the data with differentiation can be uniformly managed by means of an abstract factory mode in a design mode due to the difference of the parameters in the HTTP request process. Different types of products can be manufactured according to differentiated data types through an abstract factory class so as to meet the requirement that adapter programs uniformly manage data scheduling. The part is one of the keys of the design of the two modules, and the unified scheduling of the simulation HTTP request process by the adapter program can be realized.

The UML diagram of the abstract factory in the module is shown in fig. 2, the difference sources of these data are POST data appended to the HTTP request, and for the POST mode request, these data exist in the header POST parameter of the request, and for the GET mode request, these data directly exist in the URL of the request, and are collectively referred to as POST data, and mainly are the user name and password information in the login module, and the date that needs to be submitted in the specified page is captured (the inconsistent date format may result in the generation of multiple date products), the channel number, and so on. The ITAG is an abstract class in an abstract factory, and the following concrete product is a getValue () method for realizing the abstract class, so as to fill data, which is beneficial to the extension of difference data, such as: when a new date format is generated, a date class can be directly added to realize the abstract plant class ITAG. The data is managed in the database in the form of a label that is parsed in the adapter program (by regular expressions) through the category TAGManager. The organization of these tagged data in the database is as follows: UserName ═ TAG _ USER ] & UserPass ═ TAG _ PWD.

Cs processes the data with TAG label in database. After the automatic login and the capture of the specified page data are processed, an HTTP request needs to be simulated in the adapter program to acquire related information, and a UML diagram specifically implemented by the two modules is shown in fig. 3.

After the previous page capture is completed, the captured specified page is stored as an HTML file, and the page analysis and data extraction module analyzes the tags in the HTML file and extracts the data in the tags. The path for extracting the label can be analyzed to the specified label by using a tool HtmlAgiliity Pack Tester, and then the content in the label is acquired. Data of page activation amount needs to be analyzed, and a path is analyzed by means of a tool HtmlAgiliity Package tester. After the analysis is finished, the relevant processing is carried out in the data extraction module and the data is stored in a table corresponding to the database in the DB storage module.

The website adapter system needs to import the HTML agility pack.

In a word, the method and the device realize automatic capture of a group of URL links to obtain the specified webpage and extract the required data information through the webpage analysis tool. And C # language is adopted, the appointed websites are accessed by simulating an HTTP request process, the webpages are automatically captured, and the pages are analyzed. In the process of automatically capturing the web pages, HTTP request message information required by all the web pages in the automatic capturing process is produced by combining a factory mode of a software design mode, and the information is serialized into a database and managed uniformly, so that the web pages are captured automatically in batches. And analyzing the captured webpage by combining with an Html Agility Pack library, extracting required data, and storing the data into a corresponding database. The method can effectively save the time required for manually accessing the URL and acquiring the data, realizes automatic webpage capturing and data extraction every day, and lays a foundation for the maintenance, management and analysis of the data in the later period.

The above disclosure is only for a few specific embodiments of the present invention, but the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. An adapter system for simulating access to a website is characterized in that the whole functional architecture of the adapter system for simulating access to the website comprises the following 5 modules: the system comprises an automatic login module, a designated page capturing module, a page analyzing module, a data extraction module and a DB access module; the automatic login module extracts relevant information in a request HTTP process message by simulating an HTTP request process, and fills the parameters into the simulation process to realize automatic login; after login is successful, a designated page grabbing module grabs a designated page, a designated page grabbing module organizes page data, then the data are packaged, a request process of simulating an HTTP request is carried out, and after a response is obtained, next processing is carried out; after the page capture is finished, capturing a specified page and storing the page as an HTML file, wherein a page analyzing module and a data extracting module analyze the tags in the HTML file and extract the data in the tags; after the analysis is finished, performing related processing in a data extraction module to obtain specified information, storing the specified information in a DB access module, and performing related serialization operation so as to facilitate subsequent acquisition at any time;

the two modules, namely the automatic login module and the designated page capturing module, need to organize the information of the request HTTP message into a database, so that the extension and modification of the related information of the website to be accessed in the future are facilitated; after information data of the HTTP message are organized, the data are packaged and provided for an HTMLhelper class to perform a request process of simulating an HTTP request, after a response is obtained, the next step of processing is performed, a current page needs to be stored in an automatic login module, and a login state is kept; the current specified page needs to be saved in the specified page capturing module, and the files are all saved in an HTML format.

2. The adapter system for simulating access to websites according to claim 1, wherein the automatic login module and the module for capturing specified pages are required to collect the network data packets during the process of accessing the login page and the specified pages, and analyze parameters required during the process of requesting the HTTP on the pages in the network data packets.

3. The simulated access website adapter system as claimed in claim 2, wherein said parameters required in said page HTTP request process include but are not limited to: url of the request page, jump url, cookie, post data, useragent, contenttype, host.

4. The system according to claim 3, wherein the automatic login module and the designated page capturing module manage these differentiated data uniformly by means of an abstract factory model in a design model, so as to realize uniform scheduling of the simulated HTTP request process by the adapter program.

5. The system of claim 4, wherein the abstract factory in the module is processed by: the difference source of the data is POST data attached to the HTTP request, the data exists in a header POST parameter of the request for a POST mode request, and the data directly exists in a URL of the request for a GET mode request, is collectively called postdata, and mainly comprises user name and password information in a login module, and captures a date and a channel number which need to be submitted in a specified page; filling data by using ITAG through a getValue () method of the abstract class, managing the data in a database in a form of a label, and analyzing the label through TAGManager in an adapter program; the data with tags is organized in the database as follows: UserName ═ TAG _ USER ] & UserPass ═ TAG _ PWD.

6. The adaptor system for simulating access to web sites of claim 5 wherein the TAGManager.cs processes the TAG tagged data in the database and simulates HTTP requests in the adaptor program to obtain relevant information after processing the auto-login and capture of the specified page data.

7. The system of claim 6, wherein after the page is captured, the captured specified page is stored as an HTML file, and the page parsing and data extraction module parses the tags in the HTML file to extract the data in the tags.

8. The system of claim 7, wherein the path for extracting the tag can parse the specified tag by using an HtmlAgilityPackTester tool, and then obtain the content in the tag.

9. The system of claim 8, wherein for the data of the analyzed page activation amount, the path analyzed by the tool htmigility package tester is processed in the data extraction module after the analysis is completed, and the processed data is stored in the table corresponding to the database in the DB storage module.