CN107622125B

CN107622125B - Information crawling method and device and electronic equipment

Info

Publication number: CN107622125B
Application number: CN201710903327.3A
Authority: CN
Inventors: 卓雷; 杨奇川; 胡长健
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2020-02-21
Anticipated expiration: 2037-09-29
Also published as: CN107622125A

Abstract

The invention discloses an information processing method and device and electronic equipment, wherein the method comprises the following steps: receiving a first crawling target; the first crawling target comprises a target website and first key information; based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target; the second crawling target comprises a crawling website set and second key information; performing a crawling operation based on the second crawling target; the method and the device can improve the pertinence of information crawling on the basis of reducing manual participation.

Description

Information crawling method and device and electronic equipment

Technical Field

The invention relates to the technical field of information crawling, in particular to an information crawling method and device and electronic equipment.

Background

With the development of internet technology and service technology, the amount of internet information is huge, and an information crawling technology is provided for conveniently acquiring content meeting requirements from network information.

Currently, in the field of artificial intelligence, information crawling methods can be divided into a width-first method and a depth-first method, wherein the depth-first method specifically includes: the crawling method needs to configure the crawling path by means of prior analysis, so that a large amount of manual analysis and configuration work is needed, and the website style and the webpage structure of a crawling object are not invariable, so that manual maintenance and updating are needed. The width priority method specifically comprises the following steps: starting from the seed link, new links are continuously extracted from the currently accessed page and added into the seed link list to expand the seed link set to be extracted, the capturing range is gradually expanded, and therefore the whole network information is captured.

In conclusion, how to improve the pertinence of information crawling on the basis of reducing manual participation as much as possible becomes a technical problem to be solved urgently in the field.

Disclosure of Invention

In view of the above, the invention provides an information crawling method and apparatus, and an electronic device, so as to improve the pertinence of information crawling on the basis of reducing human participation as much as possible.

In order to achieve the purpose, the invention provides the following technical scheme:

an information processing method comprising:

receiving a first crawling target; the first crawling target comprises a target website and first key information;

based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target; the second crawling target comprises a crawling website set and second key information;

and performing crawling operation based on the second crawling target.

Preferably, the expanding the first crawling target by using a pre-established target database based on the first crawling target to obtain a second crawling target including the first crawling target includes:

expanding the target website by utilizing a pre-established target database to obtain a crawling website set containing the target website;

the target database comprises at least one of a website set, website sets in different categories and a website set with an association relationship.

Preferably, the first key information includes at least one of a target website type, a keyword, and a target information type;

when the first key information includes a target website type, expanding the first crawling target by using a pre-established target database based on the first crawling target to obtain a second crawling target including the first crawling target, including: determining a website which is the same as the target website type in the target database based on the target website type, and generating a crawling website set comprising the target website and the determined website; the target data comprises website sets in different categories;

when the first key information comprises a keyword, based on the first crawling target, expanding the first crawling target by using a pre-established target database, and acquiring a second crawling target comprising the first crawling target comprises: determining other keywords associated with the keywords in the target database based on the keywords, and generating second key information containing the keywords and the determined other keywords;

when the first key information comprises a target information type, expanding the first crawling target by utilizing a pre-established target database based on the first crawling target, and acquiring a second crawling target comprising the first crawling target comprises: and determining other information associated with the target information type in the target data based on the target information type, and generating second key information containing the target information type and the determined other information.

Preferably, the method further comprises the following steps:

accessing websites in the crawled website set, and determining a first position matched with the second key information in a corresponding webpage;

target information is extracted at the first location.

Preferably, the method further comprises the following steps:

determining whether content matched with the second key information exists in the corresponding webpage;

and if not, accessing the next website in the crawling website set.

Preferably, the determining whether there is content matching with the second crawling target in the corresponding web page includes:

determining whether the corresponding webpage comprises a keyword or not; and/or the presence of a gas in the gas,

it is determined whether the corresponding web page matches the target information type.

An information processing apparatus comprising:

the first receiving unit is used for receiving a first crawling target; the first crawling target comprises a target website and first key information;

the first extension unit is used for extending the first crawling target by utilizing a pre-established target database based on the first crawling target to obtain a second crawling target containing the first crawling target; the second crawling target comprises a crawling website set and second key information;

and the first crawling unit is used for crawling operation based on the second crawling target.

Preferably, the first expansion unit is specifically configured to expand the target website by using a pre-established target database, and acquire a crawl website set including the target website;

and the target database comprises at least one of a website set of the target database, website sets under different categories and a website set with an association relationship.

An electronic device, comprising:

a memory for storing a target database;

the system comprises a processor, a first crawling target and a second crawling target, wherein the processor is used for receiving the first crawling target, expanding the first crawling target by utilizing a pre-established target database based on the first crawling target, obtaining the second crawling target containing the first crawling target, and performing crawling operation based on the second crawling target;

the first crawling target comprises a target website and first key information, and the second crawling target comprises a crawling website set and second key information.

Preferably, the processor is specifically configured to expand the target website by using a pre-established target database, and acquire a crawl website set including the target website;

As can be seen from the foregoing technical solutions, compared with the prior art, an embodiment of the present invention provides an information processing method, including: receiving a first crawling target; the first crawling target comprises a target website and first key information; based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target; the second crawling target comprises a crawling website set and second key information; and performing crawling operation based on the second crawling target. Therefore, the first crawling target can be expanded through the target database, the second crawling target after the expansion is subjected to crawling operation, the expansion is automatically achieved, manual periodical dimension updating is not needed, the second crawling target comprises a crawling website set, the crawling target of the crawling method in the prior art is the whole network, crawling efficiency is also improved, and the pertinence of information crawling can be improved on the basis of reducing manual participation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of an information crawling method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating an information crawling method according to another embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating an information crawling method according to another embodiment of the present invention;

fig. 4 is a schematic flowchart of an information crawling method according to another embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an information crawling apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an information crawling apparatus according to another embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an information crawling apparatus according to another embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an information crawling apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention discloses an information processing method, as shown in fig. 1, the method includes the following steps:

step 101: receiving a first crawling target;

the first crawling target comprises a target website and first key information;

the target web address is a web address designated by the user and used for information crawling on a corresponding web page, and the target web addresses can be one or more. It should be noted that the number of target websites specified by the user is generally small, and an object of the present application is to expand a large number of crawl website sets similar to the target websites specified by the user.

The first key information is information used for crawling specified by a user and comprises at least one of a target website type, a keyword and a target information type.

The target website type is the type of the target website.

Information to be acquired by the crawling operation can be called target information, the target information is information related to a crawling target, the crawling target can be an entity, and then the keywords are used for assisting the system in acquiring the target information, can be used for representing characteristics, attributes and the like of the crawling target, and can help the system to understand the real crawling target and intention of the user.

The target information type may include at least one of a domain, a domain ontology. Wherein, the field is the field of crawling the operation, and the field ontology is the attribute information in field, and this field ontology can regard as the target of crawling.

For example, the target website is www.jd.com, the target website type is e-commerce website, the keywords are associations, the domain is transactions, and the domain ontology is prices and configurations.

Step 102: based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target;

and the second crawling target comprises a crawling website set and second key information.

Specifically, expanding the target web address can obtain a set of crawled web addresses that includes the target web address and other web addresses associated with the target web address. Optionally, the method for expanding the target website is as follows:

based on a first crawling target, utilizing a pre-established target database to expand the first crawling target, and obtaining a second crawling target containing the first crawling target, wherein the method comprises the following steps of: and expanding the target website by utilizing a pre-established target database to obtain a crawling website set containing the target website.

The target database comprises at least one of a website set, website sets in different categories and a website set with an association relationship. Then, other websites related to the target website can be searched in the target database and added to the crawl website set.

For example, other websites in the same website set as the target website are searched, and/or other websites having an association relationship with the target website are searched, and/or other websites in the same category of website set as the target website are searched, and are added to the crawled website set.

It should be noted that the target database is a pre-established database, and can be pre-established by technical means such as crawler crawling, big data statistics, information arrangement and the like. Specifically, the web pages can be classified and clustered by collecting various types of web page information on the internet as much as possible and by methods such as machine learning and natural language processing, and the like, and the form of the database comprises various website collection.

Specifically, expanding the first key information can acquire second key information, where the second key information includes the first key information and other information associated with the first key information.

It should be noted that, in the present application, only the target website may be expanded, but not the first key information, in which case, the second key information is consistent with the first key information.

Optionally, in the present invention, the target database may be divided, the database for expanding the target website is divided into a crawling database, and the database for expanding the first key information is divided into an ontology database. Namely, the crawled database comprises at least one of a website set, website sets under different categories and a website set with an association relationship.

Step 103: and performing crawling operation based on the second crawling target.

And the second crawling target is a target expanded based on the first crawling target specified by the user, so that crawling operation is performed based on the second crawling target.

Thus, in the embodiment, by receiving the first crawling target; the first crawling target comprises a target website and first key information; based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target; the second crawling target comprises a crawling website set and second key information; performing a crawling operation based on the second crawling target. Therefore, this application can be through the target database to first target of crawling expand to the second target of crawling after the extension is crawled and is operated, because this extension is automatic the realization, consequently need not artifical regular dimension and update, and what the second target of crawling included is crawled the website set, the target of crawling that is equivalent to the method of crawling among the prior art is the whole net, has also improved and has crawled efficiency, that is to say, this application can improve the pertinence that information crawled on the basis that reduces artifical participation.

Another embodiment of the present invention discloses an information processing method, as shown in fig. 2, the method includes the following steps:

step 201: receiving a first crawling target;

the first crawling target comprises a target website and first key information;

optionally, the first key information may include at least one of a target website type, a keyword, and a target information type.

The target website type is the type of the target website.

Of course, the target information type in the present application may also include other types of information, such as an ontology, a name, and the like of the crawl target.

Step 202: when the first key information comprises a target website type, determining websites which are the same as the target website type in a target database based on the target website type, and generating a crawling website set comprising the target website and the determined websites;

the target database comprises website sets under different categories, so that the websites of the same category can be searched in the target database according to the types of the target websites and added into the crawling website set.

For example, if the target website is www.jd.com, the type of the target website is e-commerce website, and www.suning.com www.tmall.com www.amazon.com.cn is included in the target data under the e-commerce website type, then the websites can be found and added to the crawl website set based on the type of the target website.

Step 203: when the first key information comprises a keyword, determining other keywords related to the keyword in the target database based on the keyword, and generating second key information containing the keyword and the determined other keywords;

the target database may include a set of keywords, each set of keywords being a keyword having an association, and then other keywords having an association with the keyword specified by the user may be determined in the set of keywords.

For example, what the user wants to crawl is information related to a notebook, and specifically may be information related to associating the notebook, then the keyword may be an association. Other keywords that may be determined by expansion to be relevant to association include Lenovo, Thinkpad, Xiaoxin, Yangtze.

Step 204: when the first key information comprises a target information type, determining other information associated with the target information type in the target database based on the target information type, and generating second key information comprising the target information type and the determined other information.

The target database can comprise information such as ontology sets, domain ontology sets, data types and data formats, and can be generated in advance through various means such as crawler crawling and information arrangement. If the target database can be divided as described in the previous embodiment, the ontology data divided by the target database may include information such as ontology set, domain ontology set, data type, data format, and the like.

The ontology set refers to various attributes for forming a crawling target, for example, a mobile phone, a notebook and the like are formed by attributes of a system, a GPS, an operator and the like, the domain ontology set refers to various attributes for forming a crawling domain, for example, a transaction is formed by attributes of price, configuration and the like, the data type is the type to which data of various attributes belongs, some attributes are character types, some attributes are digital types, the data format is the format of the data of various attributes, and some attributes are time formats and the like. In particular, other information associated with the target information type may be determined in the set of ontologies, the set of domain ontologies, the data type, the data format.

Step 205: and performing crawling operation based on the second crawling target.

For the convenience of understanding, the present invention is briefly described by way of an example, and specifically, the various information are represented by the following symbols:

Target(W,T,K,D,P)->Items[item1,item2,item3,…]->Item(p1,p2,p3,…)

target crawling Target

Specifically, Target is defined by one or more of W, T, K, D, and P, and the information is input by the user as an initial value and is gradually expanded during the system implementation.

W target web site

Usually, the website lists are a series of similar types and fields, the number of target websites initially input by a user is small, for example, only 1-2 websites are needed, and a large number of target website sets similar to the target websites specified by the user can be expanded by combining system analysis and a target database.

T, target website type.

The target website type is the type of the target website and can be used as auxiliary information of the target website, so that a crawling website set can be expanded more accurately, and the understanding of the field of target information is facilitated. In the target data, the website types include, but are not limited to, e-commerce, news, social, forum, and the like.

K key word

The method is used for assisting the system to acquire target information and helping the system to understand the real crawling target and intention of the user.

D field of

Used in the field of explicit crawling.

P: domain ontology

The method is characterized in that multiple ontologies are generally contained in the same domain for attribute information of the domain, and P can be used as a crawling target.

This information describes the structure and composition fields of the information, including the value type and the list or range of possible values for each field, etc.

Item: the crawl target is structured information. Item is an avatar of P, with multiple items making up the final crawl target.

The Target is assigned with an initial value by a user, but the initial value is not invariable, the system analyzes original input W, T, K, D and P, deeply understands the crawling intention of the user by combining a Target database, expands the crawling range, positions the crawling Target and extracts a final result.

Suppose the user-specified crawling task is: and crawling configuration and price information of the associated brand notebook on the domestic e-commerce website. Then, the user may first enter a crawl task into the system through the UI interface. The initial values received by the system may be as follows:

W:www.jd.com

t is E-commerce website

K is Association

D transaction (notebook)

P: configuration and price (notebook)

After the system receives the information (W is a target website, and T, K, D, and P are first key information), the system is expanded by combining a target database as follows:

W：www.jd.com www.gome.com.cn www.suning.com www.tmall.comwww.amazon.com.cn

t is E-commerce website

K: lenovo Thinkpad association small new flying sky

D: transaction (notebook) [ Name, movement, Price, Configuration, Comment ]

P：Item[Name,Model,Brand,CPU,Memory,HardDisk,Graphics,Price,On saleTime…]

As can be seen from the above example, the present application can implement expansion of the target website and the first key information.

In another embodiment of the present invention, the method may further include: and accessing the website in the crawling website set, and adding the external link information in the webpage into the crawling website set.

In order to further expand the crawl website set, the out-link information in the corresponding webpage can be obtained by accessing the websites in the crawl website set, and the out-link information refers to the website information contained in the corresponding webpage which can be skipped to other websites through clicking operation.

When the out-link information in the web page is expanded, the expansion can be specifically limited based on the preset corresponding relationship between the sources of different out-link information and different weights, or based on the preset corresponding relationship between the sources of different out-link information and different priorities, so that the expansion cannot be unlimited. That is, only the out-link information whose source has a weight higher than a certain set weight value or whose source has a priority higher than a certain set priority may be extended.

Another embodiment of the present invention discloses an information processing method, as shown in fig. 3, the method including the steps of:

step 301: receiving a first crawling target;

the first crawling target comprises a target website and first key information.

Step 302: based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target;

Step 303: performing a crawling operation based on the second crawling target;

step 304: accessing websites in the crawled website set, and determining a first position matched with the second key information in the corresponding webpage;

for example, the second key information includes a domain ontology, and specifically, the first location of the domain ontology of the corresponding web page may be determined by accessing the website of the crawl website set. For example, if the second key information includes a keyword, the web addresses in the web address set may be crawled in a position to determine a first location of the corresponding web page having the keyword.

It should be noted that the system analyzes the structures of a large number of websites in advance, so as to count the positions of different information in the web pages in advance, and when receiving the second key information, the system can determine the position of the second key information in the web page based on the pre-statistical result.

For example, for price information that generally appears near product names and product pictures in a web page, for configuration information that generally appears in the middle of a web page in the form of a list or table, and agrees to be titled "configuration" or "product configuration" or the like. Then, the system can determine the position of the price information and the position of the configuration information on the webpage based on the result of the pre-statistics.

Step 305: target information is extracted at the first location.

Specifically, a display form of the crawling target may be determined based on a target database, and target information corresponding to the display form of the crawling target may be extracted at the first location.

The crawling target may be a target in the second key information, and certainly may also be a target obtained in other forms, and the specific invention is not limited.

The target database can comprise information such as an ontology set, a domain ontology set, a data type, a data format and the like, the data type defines the data belonging type of various attributes, and the data format defines the format of the data of various attributes, so that the display form of the crawling target can be determined based on the data type and the data format, and the target information corresponding to the display form is extracted at the first position.

For example, price information is typically of a numeric type, accompanied by a presentation of a "price", meta, or ￥ symbol, and configuration information is typically of a textual or numeric type, accompanied by a presentation of units such as "model", "memory", "hard disc", "video card", GB, Hz.

Furthermore, the crawling operation can be assisted based on the second key information, and the efficiency of the crawling operation is improved.

Another embodiment of the present invention discloses an information processing method, as shown in fig. 4, the method includes the following steps:

step 401: receiving a first crawling target;

the first crawling target comprises a target website and first key information;

step 402: based on the first crawling target, expanding the first crawling target by utilizing a pre-established target database to obtain a second crawling target containing the first crawling target;

Step 403: accessing the websites in the crawled website set, determining whether contents matched with second key information exist in the corresponding webpages, if not, entering a step 404, and if so, entering a step 405;

the second key information comprises at least one of a keyword and a target information type. Correspondingly, determining whether the content matched with the second crawling target exists in the corresponding webpage comprises the following steps: determining whether the corresponding webpage comprises a keyword or not; and/or, determining whether the corresponding webpage is matched with the target information type.

Step 404: accessing a next website in the crawling website set;

and if the content matched with the second key information does not exist, directly accessing a next website in the crawled website set, determining whether the content matched with the second key information exists in a corresponding webpage in the webpage corresponding to the next website, if so, entering a step 405, and if not, accessing the next website in the crawled website set until the website in the crawled website set is completely accessed.

Step 405: determining a first position matched with the second key information in the corresponding webpage;

and if the content matched with the second key information exists, determining a first position matched with the second crawling target in the corresponding webpage.

Step 406: target information is extracted at the first location.

And when target information is extracted from a corresponding webpage, accessing a next website in the crawled website set, determining whether content matched with second key information exists in the corresponding webpage, if so, entering step 405, and if not, accessing the next website in the crawled website set until the website in the crawled website set is accessed.

Further, the crawling operation can be assisted based on the second key information, the efficiency of the crawling operation is improved, specifically, whether the content matched with the second key information exists in the corresponding webpage or not can be determined firstly, if not, the next website in the crawling website set is directly accessed, if so, the crawling operation is performed on the corresponding webpage, and the crawling time is further saved.

In another embodiment of the present invention, the system may further include a feedback optimization function, that is, after the target information is extracted, the data of the target database may be optimized based on the feedback of the user on the target information, so as to improve the accuracy of system extraction, enrich the crawling target database, and improve the performance of the whole system.

Corresponding to the information processing method, the embodiment of the invention also discloses an information processing device, and the following description is provided by several embodiments.

An embodiment of the present invention discloses an information processing apparatus, as shown in fig. 5, the apparatus including: a first receiving unit 501, a first extending unit 502 and a first crawling unit 503; specifically, the method comprises the following steps:

a first receiving unit 501, configured to receive a first crawling target;

the first crawling target comprises a target website and first key information;

The target website type is the type of the target website.

A first extension unit 502, configured to extend, based on the first crawling target, the first crawling target by using a pre-established target database to obtain a second crawling target that includes the first crawling target;

Specifically, expanding the target web address can obtain a set of crawled web addresses that includes the target web address and other web addresses associated with the target web address. Optionally, the first expansion unit is specifically configured to expand the target website by using a pre-established target database, and acquire a crawl website set including the target website.

And the target database comprises at least one of a website set of the target database, website sets under different categories and a website set with an association relationship. Then, other websites related to the target website can be searched in the target database and added to the crawl website set.

Specifically, the first expansion unit may search for other websites in the same website set as the target website, and/or search for other websites having an association relationship with the target website, and/or search for other websites in a website set of the same category as the target website, and add the other websites to the crawl website set.

Specifically, the first extension unit can obtain second key information by extending the first key information, where the second key information includes the first key information and other information associated with the first key information.

A first crawling unit 503, configured to perform a crawling operation based on the second crawling target.

In another embodiment of the present invention, an information processing apparatus is disclosed, and in this embodiment, the first key information includes at least one of a target website type, a keyword, and a target information type.

The target website type is the type of the target website.

Of course, the target information type in the present application may also include other types of information, such as the name of the crawl target.

When the first key information comprises a target website type, the first expansion unit is specifically used for determining a website which is the same as the target website type in the target database based on the target website type and generating a crawling website set comprising the target website and the determined website; the target data comprises website sets in different categories.

When the first key information includes a keyword, the first extension unit is specifically configured to determine, based on the keyword, another keyword associated with the keyword in the target database, and generate second key information including the keyword and the determined another keyword.

When the first key information includes a target information type, the first extension unit is specifically configured to determine, based on the target information type, other information associated with the target information type in the target data, and generate second key information including the target information type and the determined other information.

The target database may include information such as ontology set, domain ontology set, data types, data formats, and the like. If the target database can be divided as described in the previous embodiment, the ontology data divided by the target database may include information such as ontology set, domain ontology set, data type, data format, and the like.

In another embodiment of the present invention, the apparatus may further include: the first adding unit is used for accessing websites in the crawling website set and adding external link information in the webpage into the crawling website set.

In order to further expand the crawl website set, the first adding unit may obtain the out-link information in the corresponding webpage by accessing the websites in the crawl website set, where the out-link information refers to information contained in the corresponding webpage that can jump to other websites through a click operation.

When the out-link information in the web page is expanded, specifically, the first adding unit may be configured to limit expansion based on a preset correspondence between sources of different out-link information and different weights, or based on a preset correspondence between sources of different out-link information and different priorities, so that the expansion is not unlimited. That is, only the out-link information whose source has a weight higher than a certain set weight value or whose source has a priority higher than a certain set priority may be extended.

Another embodiment of the present invention discloses an information processing method, as shown in fig. 6, the apparatus including: a first receiving unit 601, a first extending unit 602, a first crawling unit 603, a location determining unit 604 and an information extracting unit 605; specifically, the method comprises the following steps:

a first receiving unit 601, configured to receive a first crawling target;

the first crawling target comprises a target website and first key information.

A first extension unit 602, configured to extend, based on the first crawling target, the first crawling target by using a pre-established target database to obtain a second crawling target that includes the first crawling target;

The first expansion unit is specifically configured to expand the target website by using a pre-established target database, and acquire a crawl website set including the target website.

A first crawling unit 603, configured to perform a crawling operation based on the second crawling target;

a location determining unit 604, configured to access a website in the crawled website set, and determine a first location in a corresponding webpage, where the first location is matched with the second key information;

An extraction information unit 605 for extracting the target information at the first position.

Still another embodiment of the present invention discloses an information processing method, as shown in fig. 7, the apparatus includes: a first receiving unit 701, a first expanding unit 702, a first crawling unit 703, a first judging unit 704, a next accessing unit 705, a position determining unit 706 and an information extracting unit 707; specifically, the method comprises the following steps:

a first receiving unit 701, configured to receive a first crawling target;

the first crawling target comprises a target website and first key information.

A first extension unit 702, configured to extend, based on the first crawling target, the first crawling target by using a pre-established target database to obtain a second crawling target that includes the first crawling target;

A first crawling unit 703, configured to perform a crawling operation based on the second crawling target;

a first judging unit 704, configured to access a website in the crawled website set, and determine whether content matching the second key information exists in a corresponding webpage;

the second key information comprises at least one of a keyword and a target information type, and the first judgment unit is specifically used for accessing websites in the crawled website set and determining whether the corresponding webpage comprises the keyword; and/or, determining whether the corresponding webpage is matched with the target information type.

A next accessing unit 705, configured to access a next website in the crawled website set when it is determined that there is no content matching the second key information in the corresponding webpage.

When the next website in the crawled website set is accessed, a first judgment unit can be triggered to determine whether the corresponding webpage has content matched with the second key information.

A location determining unit 706, configured to determine a first location in the corresponding web page that matches the second key information when it is determined that content that matches the second key information exists in the corresponding web page;

an extraction information unit 707 for extracting the target information at the first position.

In another embodiment of the present invention, the system may further include a feedback optimization unit, that is, the feedback optimization unit is configured to optimize data of the target database based on feedback of the user on the target information after the target information is extracted, so as to improve accuracy of system extraction, enrich the crawling target database, and improve performance of the whole system.

Corresponding to the information processing method, the invention also discloses an electronic device, which is described by several embodiments below.

An embodiment of the present invention discloses an electronic device, as shown in fig. 8, including: a memory 100 and a processor 200; wherein:

a memory 100 for storing a target database;

the system comprises a processor 200, a first crawling target, a second crawling target and a target database, wherein the processor is used for receiving the first crawling target, expanding the first crawling target by utilizing the target database established in advance based on the first crawling target, obtaining the second crawling target containing the first crawling target, and performing crawling operation based on the second crawling target;

The target web address is a web address designated by the user and used for information crawling on a corresponding web page, and the target web addresses can be one or more.

The target website type is the type of the target website.

The processor is specifically configured to expand the target website by using a pre-established target database, and acquire a crawl website set including the target website.

Specifically, the processor may be configured to search for other websites in the same website set as the target website, and/or search for other websites having an association relationship with the target website, and/or search for other websites in a website set of the same category as the target website, and add the other websites to the crawl website set.

In particular, the processor may be configured to extend the first key information to obtain second key information, where the second key information includes the first key information and other information associated with the first key information.

It should be noted that, in the present application, the processor may also only expand the target website, but not expand the first key information, in which case, the second key information is consistent with the first key information.

In this embodiment, the first key information includes at least one of a target website type, a keyword, and a target information type.

The target website type is the type of the target website.

When the first key information comprises a target website type, the processor is specifically configured to determine a website of the same type as the target website type in the target database based on the target website type, and generate a crawl website set including the target website and the determined website; the target data comprises website sets in different categories.

When the first key information includes a keyword, the processor is specifically configured to include: and determining other keywords related to the keywords in the target database based on the keywords, and generating second key information containing the keywords and the determined other keywords.

When the first key information includes a target information type, the processor is specifically configured to determine, based on the target information type, other information associated with the target information type in the target data, and generate second key information including the target information type and the determined other information.

In another embodiment of the present invention, the processor may be further configured to access a website in the crawled website set, and add the out-link information in the web page to the crawled website set.

To further extend the crawl website set, the processor may be configured to obtain the out-link information in the corresponding web page by accessing the websites in the crawl website set, where the out-link information refers to information contained in the corresponding web page that can jump to other websites through a click operation.

When the out-link information in the web page is expanded, the processor may be specifically configured to limit expansion based on a preset correspondence between sources of different out-link information and different weights, or based on a preset correspondence between sources of different out-link information and different priorities, so that the expansion is not unlimited. That is, only the out-link information whose source has a weight higher than a certain set weight value or whose source has a priority higher than a certain set priority may be extended.

In this embodiment, the processor is further configured to access a website in the crawled website set, determine a first location in a corresponding webpage that matches the second key information, and extract target information at the first location.

Specifically, the processor may be configured to determine a presentation form of a crawl target based on a target database, and extract target information corresponding to the presentation form of the crawl target at the first location.

Therefore, in the embodiment, the crawling operation can be assisted based on the second key information, and the efficiency of the crawling operation is improved.

In this embodiment, after accessing the websites in the crawled website set, the processor is further configured to determine whether content matching the second key information exists in the corresponding webpage, if not, access a next website in the crawled website set, if so, determine a first location matching the second key information in the corresponding webpage, and extract target information at the first location.

The second key information comprises at least one of a keyword and a target information type, and the processor is used for determining whether content matched with the second crawling target exists in the corresponding webpage, specifically: determining whether the corresponding webpage comprises a keyword or not; and/or, determining whether the corresponding webpage is matched with the target information type.

It can be seen that, in this embodiment, the crawling operation can be assisted based on the second key information, so that the efficiency of the crawling operation is improved, specifically, whether content matched with the second key information exists in the corresponding web page can be determined first, if not, the next web page in the crawling web page set is directly accessed, and if so, the crawling operation is performed on the corresponding web page, so that the crawling time is further saved.

In another embodiment of the present invention, the processor is further configured to optimize data of the target database based on feedback of the user on the target information after the target information is extracted, so as to improve accuracy of system extraction, enrich the crawl target database, and improve performance of the whole system.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An information processing method characterized by comprising:

performing a crawling operation based on the second crawling target;

the first key information comprises at least one of a target website type, a keyword and a target information type;

when the first key information includes a target website type, expanding the first crawling target by using a pre-established target database based on the first crawling target to obtain a second crawling target including the first crawling target, including: determining a website which is the same as the target website type in the target database based on the target website type, and generating a crawling website set comprising the target website and the determined website; the target database comprises website sets in different categories;

when the first key information comprises a target information type, expanding the first crawling target by utilizing a pre-established target database based on the first crawling target, and acquiring a second crawling target comprising the first crawling target comprises: and determining other information associated with the target information type in the target database based on the target information type, and generating second key information containing the target information type and the determined other information.

2. The method of claim 1, wherein the expanding the first crawling objective with a pre-established objective database based on the first crawling objective to obtain a second crawling objective containing the first crawling objective comprises:

3. The method of claim 1, further comprising:

target information is extracted at the first location.

4. The method of claim 3, further comprising:

and if not, accessing the next website in the crawling website set.

5. The method of claim 4, wherein the second key information comprises at least one of a keyword and a target information type, and the determining whether content matching the second crawling target exists in the corresponding webpage comprises:

6. An information processing apparatus characterized by comprising:

the first crawling unit is used for performing crawling operation based on the second crawling target;

7. The apparatus according to claim 6, wherein the first expanding unit is specifically configured to expand the target website by using a pre-established target database, and obtain a crawl website set including the target website;

8. An electronic device, comprising:

a memory for storing a target database;

the first crawling target comprises a target website and first key information, and the second crawling target comprises a crawling website set and second key information;

9. The electronic device according to claim 8, wherein the processor is specifically configured to expand the target website by using a pre-established target database, and obtain a crawl website set including the target website;