CN107679168B

CN107679168B - Target website content acquisition method based on java platform

Info

Publication number: CN107679168B
Application number: CN201710905213.2A
Authority: CN
Inventors: 何祥利; 周华; 宋小厚
Original assignee: Linewell Software Co Ltd
Current assignee: Linewell Software Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-04-20
Anticipated expiration: 2037-09-29
Also published as: CN107679168A

Abstract

The invention discloses a target website content acquisition method based on a java platform, which is used for improving the acquisition efficiency of website content. In the method, after a user starts an automatic website content acquisition switch, a jar configuration file imported by the user is read according to a generated function instance, wherein the jar configuration file comprises the following steps: the number of threads, a data source address and a template; instantiating a corresponding number of working threads according to the number of threads set by the jar configuration file; using a corresponding number of working threads to respectively request a data source address set by a jar configuration file, and acquiring target website content meeting a search rule from the data source address, wherein the search rule comprises: keywords that the user requires to search; filling the target website content into a template set by a jar configuration file to form streaming document data; and storing the streaming document data into a streaming document material library so that the user can search the material content of the matching target website from the streaming document material library.

Description

Target website content acquisition method based on java platform

Technical Field

The invention relates to the technical field of computers, in particular to a target website content acquisition method based on a java platform.

Background

Websites need to attract visitors to visit with rich contents, and therefore, a large amount of material contents related to the topic of the website need to be accumulated, especially the website providing comprehensive information services. Such as portal-like web sites, need to provide service content that can be browsed by the user. Further requirements may require the website to provide documents for the guest to download and view.

In the prior art, the material content of the website needs to be manually compiled and recorded, and the investment in the initial stage of the project is particularly needed, but a lot of manpower and material resources need to be consumed. With the progress of projects and the increase of visitor flow, higher requirements are placed on the timeliness and the quantity of information and documents, the barrier lake effect inevitably occurs based on the realization of the requirements in a traditional mode, and the projects are forced to be put into more resources to deal with, so that the problem of low efficiency exists in the acquisition of website contents in the prior art.

Disclosure of Invention

The invention aims to provide a target website content acquisition method based on a java platform, which is used for improving the acquisition efficiency of website content.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a target website content acquisition method based on a java platform, which comprises the following steps:

after a user starts an automatic website content acquisition switch, reading a jar configuration file imported by the user according to a generated function instance, wherein the jar configuration file comprises: the number of threads, a data source address and a template;

instantiating a corresponding number of working threads according to the number of threads set by the jar configuration file;

respectively requesting a data source address set by the jar configuration file by using the corresponding number of working threads, and acquiring target website content meeting a search rule from the data source address, wherein the search rule comprises: a keyword that the user requires to search;

filling the target website content into a template set by the jar configuration file to form streaming document data;

and storing the streaming document data into a streaming document material library so that the user searches the material content of the matched target website from the streaming document material library.

After the technical scheme is adopted, the technical scheme provided by the invention has the following advantages:

according to the embodiment of the invention, a user can start the automatic website content acquisition switch, a plurality of working threads are started through jar configuration files, each working thread can request a data source address, the target website content meeting the search rule is acquired from the data source address, the target website content is matched with the keywords which need to be searched by the user, the target website content can be stored as streaming document data in a template filling mode, and the streaming document data can be stored in a streaming document material library, so that the automatic update of the streaming document material library can be realized, the user searches the streaming document material library, and the material content matched with the target website can be found. In the embodiment of the invention, manual website content collection and editing are replaced, and streaming documents are made and recorded in the material library. By adopting the technical scheme, the user-defined keywords can capture website contents specified by the user, the contents meeting the search rules are stored in the material library according to the output template defined by the user, and the user filters and captures the contents of the target website by setting the keywords and converts the keywords into streaming documents for storage by combining the retrieval, capture and streaming file construction technologies.

Drawings

FIG. 1 is a schematic flowchart of a target website content obtaining method based on a java platform according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an implementation flow of capturing content of a specified website to construct streaming document data storage according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a target website content obtaining method based on a java platform, which is used for improving the obtaining efficiency of website content.

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the embodiments described below are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one skilled in the art from the embodiments given herein are intended to be within the scope of the invention.

The terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The following are detailed below.

The embodiment of the invention provides a technology for capturing specified website content to construct streaming document data storage, and is mainly realized based on a Java technology platform, and a user project can realize the claimed content by introducing jar packaged by the technology and configuring corresponding files in own user engineering. Referring to fig. 1 and fig. 2, the method for obtaining target website content based on a java platform according to the present invention may include the following steps:

101. after a user starts an automatic website content acquisition switch, reading a jar configuration file imported by the user according to a generated function instance, wherein the jar configuration file comprises: number of threads, data source address, and template.

In the embodiment of the invention, an automatic acquisition switch of the website content can be arranged on an output interface of the server, a user can set whether to start the switch, and after the user triggers the switch, the automatic acquisition flow of the target website content is executed according to the target website content acquisition method based on the java platform provided by the embodiment of the invention. When a user triggers a website content automatic acquisition switch, a function instance is generated first, function instantiation is carried out, and jar (Java archive) configuration files imported by the user are read by using the function instance. The jar configuration file can be a jar package, classes written in advance by a user exist in the jar package, and the classes are packaged into jar configuration questions, so that the jar package can be introduced into user engineering, and then the classes, attributes and methods in the jar package can be directly used. The jar file can be used in a cross-platform mode, and after a plurality of files are combined into one jar file, only one request needs to be sent to the remote server. Meanwhile, due to the adoption of a compression technology, all data can be obtained in a shorter time. The jar configuration file may specifically include: number of threads, data source address, and template. It should be noted that the content included in the jar configuration file can be flexibly determined according to the scene, and is not limited herein.

In some embodiments of the present invention, before the user starts the website content automatic acquisition switch in step 101, the method provided in the embodiments of the present invention further includes the following steps:

acquiring jar configuration files imported into user engineering by a user;

putting the jar configuration file into the engineering path, and configuring a data source address, a template injection attribute and a template of the user engineering according to the jar configuration file;

and typesetting the template by using the template injection attribute, and updating the template to the template address.

In the embodiment of the invention, the automatic downloading of the target website content can be completed in a user engineering mode. The user project is an engineering project, and the user project used in the embodiment of the present invention may specifically use a java project. The specific user engineering deployment steps are as follows: firstly, a user imports a jar configuration file integrated by the method in a project, the jar configuration file can comprise an httpclient and a jsup resolver, and the basic jar can be downloaded through a server. The httpclient is mainly used for initiating a request and crawling webpage data; the jsup is equivalent to the function of analyzing webpage information, the using mode is similar to that of java-edition jquery, and the webpage information analyzing method can be flexibly used. And then, the configuration file is placed under the engineering path, and the attributes such as the depth and the breadth, the data source address, the template injection attribute and the like which are required to be grabbed by the user, and the template are configured. The template refers to the content of a document required by a user, the template injection attribute refers to the fact that variable definition exists in the template, and data filling is conducted according to the mapping relation of the variables. And then typesetting the template according to the template injection attribute set in the step, and placing the template to the address specified in the step, namely the storage address of the template. And finally, introducing a function code module for starting the invention into the user engineering, and executing corresponding configuration operation after calling.

102. And instantiating a corresponding number of working threads according to the number of threads set by the jar configuration file.

In the embodiment of the invention, a user can set the thread number through the jar configuration file and instantiate the corresponding number of working threads according to the preset thread number, for example, the user sets the number of N threads through the jar configuration file, so that the N working threads can be instantiated, namely the N working threads are started simultaneously.

103. Respectively requesting a data source address set by a jar configuration file by using a working thread, and acquiring target website content meeting a search rule from the data source address, wherein the search rule comprises the following steps: the user asks for a keyword for search.

In the embodiment of the present invention, each worker thread requests the specified data source address again, where the data source address refers to a Uniform Resource Locator (URL) address of the target website. The data source address is accessed by a plurality of working threads simultaneously, and the target website content meeting a preset search rule can be acquired by the data source address, wherein the search rule refers to a matching rule for capturing webpage data, namely a regular expression, and the search rule at least comprises the following steps: the keyword that the user requests to search, that is, the target website content acquired in step 103, is the keyword that matches the user's request to search.

In some embodiments of the invention, the keywords are configured in jar configuration files; alternatively, the keywords are entered into the foreground page by the user. The user can import the keywords to be searched through the jar configuration file, and the user can also input the keywords to be searched through a foreground page provided by the server, and the specific implementation process is not limited.

In this embodiment of the present invention, the obtaining of the content of the target website meeting the search rule from the data source address in step 103 may specifically include the following steps:

acquiring webpage data from a data source address by using a httpparent request configured in user engineering;

and filtering js codes and resource type data of the target website from the webpage data by using a jsup parser configured in user engineering to obtain the target website content meeting the search rule.

The httpclient is mainly used for initiating a request and crawling webpage data; the jsup is equivalent to the function of analyzing webpage information, the using mode is similar to that of java-edition jquery, and the webpage information analyzing method can be flexibly used. The jsup is a Java HTML parser and can directly parse a certain URL address and HTML text content. It provides a very labor-saving set of APIs that can fetch and manipulate data through DOM, CSS and jQuery-like manipulation methods. For example, the obtained data may be processed by a jsup, similar to a dom operation, to form an html page, the filter-unrelated characters include resource types such as JS code of a website, html tag, and the like, for example, the resource type data may include CSS, JS, and picture resources, and the content of the page conforming to the keywords configured from the configuration file or input in the foreground is obtained.

Further, in some embodiments of the present invention, the acquiring, by using an httpclient request configured in user engineering, web page data from a data source address includes:

and acquiring webpage data from a data source address by using an httpparent request based on a depth-first retrieval strategy or a breadth-first retrieval strategy configured in user engineering.

The content capture process based on breadth-first or depth-first can be implemented according to the set algorithm priority. The depth refers to the number of sub-url levels existing in a target address, the breadth refers to the fetching of url data in the same level, and the specific content fetching strategy can be completed by a user through a jar configuration file.

104. And filling the target website content into a template set by the jar configuration file to form streaming document data.

In the embodiment of the present invention, a user may import a jar configuration file, configure a template through the jar configuration file, download the updated template according to a template address when the template is updated, and fill the template with the target website content acquired in the foregoing step 103, where the process of filling the template refers to inputting the template by using target website data and program capture, and after the template bears the target website content, streaming document data may be formed, for example, the streaming document data may be a word document.

105. And storing the streaming document data into a streaming document material library so that the user can search the material content of the matching target website from the streaming document material library.

In the embodiment of the invention, after the content under the specified page tag is filled in the template to form streaming document data, the content of the target website is analyzed, the specified content is captured to construct a streaming document material library of the target website, for example, the streaming document material library can be a WORD document material library, and a user searches the material content matched with the target website from the streaming document material library.

In some embodiments of the invention, a library of streaming document materials comprises: a local database, or a cloud database. The implementation mode of the streaming document material library depends on different application scenarios, so that local storage and cloud storage can be realized.

For example, after a user activates the start function switch, the activation forms a function instance. The instance reads the configuration file under the class path for initialization, and instantiates a specified number of working threads according to the thread number set by the configuration file. Each thread requests the appointed data source address again, content grabbing process based on breadth priority or depth priority is implemented according to the set algorithm priority, the httpparent requests the appointed target url to obtain webpage data, then filtering and obtaining appointed content according to configured search rules such as keywords and the like, finally processing the obtained data by using a jsup, forming an html page by similar dom operation, filtering irrelevant characters including resource types (CSS, JS and picture resources) such as JS codes of websites and html tags, obtaining page content conforming to the keywords configured in a configuration file or input from the foreground, filling the content under the appointed page tags into a template, and forming streaming data, and forming a local file or directly storing the local file into a database according to user setting to form a complete working mechanism from analysis, retrieval, capture and storage.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus necessary general hardware, and may also be implemented by special hardware including special integrated circuits, special CPUs, special memories, special components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, the implementation of a software program is a more preferable embodiment for the present invention. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the above embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the above embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A target website content obtaining method based on a java platform is characterized by comprising the following steps:

2. The java platform based targeted website content obtaining method as claimed in claim 1, wherein before the user activates the website content automatic obtaining switch, the method further comprises:

acquiring jar configuration files imported into user engineering by a user;

putting the jar configuration file into an engineering path, and configuring a data source address, a template injection attribute and a template of the user engineering according to the jar configuration file;

3. The java platform based targeted website content obtaining method as claimed in claim 2, wherein said obtaining targeted website content meeting the search rule from the data source address comprises:

acquiring webpage data from the data source address by using the httpparent request configured in the user engineering;

and filtering js codes and resource type data of the target website from the webpage data by using a jsup parser configured in the user engineering to obtain the target website content meeting the search rule.

4. The java platform based target website content obtaining method as claimed in claim 3, wherein the obtaining of the webpage data from the data source address using the httpclient configured in the user engineering request comprises:

and acquiring webpage data from the data source address by using the httpparent request based on a depth-first retrieval strategy or a breadth-first retrieval strategy configured in the user engineering.

5. The java platform based target website content obtaining method as claimed in any one of claims 1 to 4, wherein the keywords are configured in the jar configuration file; or, the keyword is input into a foreground page by the user.

6. The java platform based targeted website content acquisition method as claimed in any one of claims 1 to 4, wherein the streaming document material library comprises: a local database, or a cloud database.