CN106528802A

CN106528802A - Data collecting method and device

Info

Publication number: CN106528802A
Application number: CN201610998106.4A
Authority: CN
Inventors: 陈桓; 蔡晓胜; 张良杰
Original assignee: Kingdee Software China Co Ltd
Current assignee: Kingdee Software China Co Ltd
Priority date: 2016-11-11
Filing date: 2016-11-11
Publication date: 2017-03-22

Abstract

The invention discloses a data collecting method and device. The data collecting method comprises the following steps that a target theme and a target theme collecting website are determined; target webpage links corresponding to the target theme are determined in a plurality of webpage links included in the target theme collecting website; content in a webpage corresponding to each target webpage link is collected, and a plurality of pieces of collected data are obtained; a result data set is determined according to the matching degree of the target theme and each piece of collected data. According to the technical scheme, the target webpage links corresponding to the target theme are determined in a targeted mode, so that less content is collected from the webpage corresponding to each target webpage link, correlation with the target theme is large, and the precision of data collection and data value density are improved.

Description

A kind of collecting method and device

Technical field

The present invention relates to Internet technical field, more particularly to a kind of collecting method and device.

Background technology

With the fast development of Internet technology, the application of big data is more and more.Under big data scene, data acquisition Demand gradually increase.

In the prior art, when the data of certain theme are needed, obtained from the Internet by non-directional reptile mostly Mass data, then based on the mass data for getting, by complicated Data Matching algorithm, filters out related to theme Data.

This method haves the shortcomings that certain, and the data volume of basic data is too big, and non-relevant data accounting is higher, often very Hardly possible correctly picks out the data closely related with theme, and precision is relatively low.In the big data epoch, the data value density of presentation compared with It is low.

The content of the invention

It is an object of the invention to provide a kind of collecting method and device, to improve the precision and data of data acquisition Value density.

To solve above-mentioned technical problem, the present invention provides following technical scheme：

A kind of collecting method, including：

Determine target topic and target collection website；

In multiple web page interlinkages that the target gathers that website includes, the corresponding target web of the target topic is determined Link；

The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained；

According to the target topic and the matching degree of every gathered data, result data set is determined.

In a kind of specific embodiment of the present invention, the corresponding target web link of the target topic is determined described Afterwards, before described each target web of collection links the content on corresponding webpage, also include：

The corresponding target web link of the target topic to determining carries out filtration treatment.

In a kind of specific embodiment of the present invention, the determination target topic and target gather website, including：

According to the key word of user input, target topic and target collection website are determined.

In a kind of specific embodiment of the present invention, the matching according to the target topic and every gathered data Degree, determines result data set, including：

Determine the key word of every gathered data；

Determine the text similarity of the target topic and the key word of every gathered data；

For every gathered data, if the target topic is high with the text similarity of the key word of the gathered data In preset first threshold value, then the gathered data is integrated in result data set.

In a kind of specific embodiment of the present invention, the key word for determining every gathered data, including：

For every gathered data, word segmentation processing is carried out to the gathered data, obtain the basic word of the gathered data Set；

Determine the frequency that each basic word occurs in the gathered data；

Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.

A kind of data acquisition unit, including：

Target determination module, for determining target topic and target collection website；

Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target The corresponding target web link of theme；

Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of Gathered data；

Result data determining module, for the matching degree according to the target topic and every gathered data, it is determined that knot Fruit data acquisition system.

In a kind of specific embodiment of the present invention, also include：

Link filter module, for it is described determine the corresponding target web link of the target topic after, described adopt Before collecting the content that each target web is linked on corresponding webpage, to the corresponding target web chain of the target topic for determining Tap into row filtration treatment.

In a kind of specific embodiment of the present invention, the target determination module, specifically for：

In a kind of specific embodiment of the present invention, the result data determining module, including：

Key word determination sub-module, for determining the key word of every gathered data；

Text similarity determination sub-module, for determining the text of the target topic and the key word of every gathered data Similarity；

Result data determination sub-module, for for every gathered data, if the target topic and the collection number According to key word text similarity be higher than preset first threshold value, then the gathered data is integrated in result data set.

In a kind of specific embodiment of the present invention, the key word determination sub-module, specifically for：

Determine the frequency that each basic word occurs in the gathered data；

The technical scheme provided using the embodiment of the present invention, after it is determined that target topic and target gather website, in mesh In multiple web page interlinkages that mark collection website includes, the corresponding target web link of target topic is determined, each target is gathered Content in the corresponding webpage of web page interlinkage, obtains a plurality of gathered data, according to matching for target topic and every gathered data Degree, it may be determined that result data set.Orientation determines the corresponding target web link of target topic so that from each target The content collected in the corresponding webpage of web page interlinkage is less, larger with the dependency of target topic, improves data acquisition Precision and data value density.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of implementing procedure figure of collecting method in the embodiment of the present invention；

Fig. 2 is a kind of structural representation of data acquisition unit in the embodiment of the present invention.

Specific embodiment

In order that those skilled in the art more fully understand the present invention program, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.

A kind of collecting method is embodiments provided, the method can apply to search engine and provide the user In the application scenarios of retrieval service.Search engine refers to information of collecting from the Internet, after information is organized and is processed, Retrieval service is provided the user, the system that the related information of user search is showed into user.

The technical scheme provided by the embodiment of the present invention can intelligently carry out data acquisition, according to the target master for determining Topic, using the orientation filter capacity of search engine, with reference to secondary content filtering method, can accurately filter out target and adopt With the closely related content of target topic in collection website.

It is shown in Figure 1, a kind of implementing procedure figure of the collecting method provided by the embodiment of the present invention, the method May comprise steps of：

S110：Determine target topic and target collection website.

When user has the demand of gathered data, the target topic and target collection net of data to be gathered can be first determined Stand.

The present invention a kind of specific embodiment in, can according to the key word of user input, determine target topic and Target gathers website.

In embodiments of the present invention, input interface can be provided the user, user is connect by the input according to self-demand Mouth can be input into key word.The key word can be any one or more nouns such as enterprise's name, name, event, relation.Can be with The key word of user input is determined directly as into target topic.

User can also be input into the chained address that target gathers website by the input interface, so as to according to user input Chained address, it may be determined that target gathers website.

Or, the target topic for determining can be passed through, target collection website is automatically determined.Such as, pre-build substantial amounts of The corresponding relation of theme and website, when it is determined that after target topic, can find in the corresponding relation for pre-building and target master Inscribe corresponding target collection website.

The embodiment of the present invention is applied to the data acquisition of arbitrary theme and arbitrary collection website, and versatility is higher.

S120：In multiple web page interlinkages that target gathers that website includes, the corresponding target web chain of target topic is determined Connect.

In step S110, it is determined that target topic and target collection website.Each website includes multiple web page interlinkages, Different web pages include different contents in linking corresponding webpage.Target collection website equally includes multiple web page interlinkages.

In multiple web page interlinkages that target gathers that website includes, it may be determined that the corresponding target web chain of target topic Connect.Specifically, website can be gathered as target with target, filters out a series of target webs related to target topic and link.Mesh Mark web page interlinkage can have one or more, and the content that each target web link is included is related to target topic.

S130：The content that each target web is linked in corresponding webpage is gathered, a plurality of gathered data is obtained.

In embodiments of the present invention, can link for each target web, the target network is gathered by non-directional mode Full content in the corresponding webpage of page link, obtains a plurality of gathered data.

In actual applications, multithreading can be opened, corresponding web page contents is linked to different target webs respectively and is entered Row collection, it is to avoid resource contention, improves collecting efficiency.

The corresponding target web link of target topic is first determined, then gathers each target web and linked in corresponding webpage Content so that the amount of content data for collecting is less, reduces the difficulty of subsequent treatment.

The present invention a kind of specific embodiment in, after step S120, before step S130, can also include with Lower step：

The corresponding target web link of target topic to determining carries out filtration treatment.

After it is determined that the corresponding target web of target topic is linked, can be to the corresponding target network of target topic of determination Page link carries out filtration treatment.Specifically, the correctness of target web link can be analyzed, picks out correct webpage Link, deletes web page interlinkage, invalid web pages link of repetition etc..

Further, in step s 130, content of the collection in the corresponding webpage of each web page interlinkage of filtration treatment, with Improve the efficiency of data acquisition.

S140：According to target topic and the matching degree of every gathered data, result data set is determined.

Target topic is the theme determined according to user's request, and finally data to be obtained should be matched with target topic More data.

A plurality of gathered data is obtained in step S130, can calculate target topic and every gathered data matches journey Degree.According to the matching degree of target topic and every gathered data, it may be determined that result data set.

In a kind of specific embodiment of the present invention, step S140 may comprise steps of：

Step one：Determine the key word of every gathered data；

Step 2：Determine the text similarity of target topic and the key word of every gathered data；

Step 3：For every gathered data, if target topic is similar to the text of the key word of the gathered data Degree is then integrated into the gathered data in result data set higher than preset first threshold value.

For ease of description, above three step is combined and is illustrated.

Every gathered data may be considered and is made up of multiple basic words.For every gathered data, can be from this The key word of the gathered data is determined in the basic word that bar gathered data is included.

In a kind of specific embodiment of the present invention, above-mentioned steps one may comprise steps of：

First step：For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection number According to basic word set；

Second step：Determine the frequency that each basic word occurs in the gathered data；

3rd step：Frequency is defined as into the key word of the gathered data higher than the basic word for presetting Second Threshold.

For every gathered data, the gathered data is carried out after word segmentation processing, it is possible to obtain the gathered data The set of basic word.In embodiments of the present invention, basic word is the word with practical significance, such as name, place name, action and action Object etc., can exclude the function word without practical significance, as " ", " ", " obtaining " etc..

It is understood that the frequency that basic word occurs in gathered data is more, then the basic word can more represent this and adopt Collection data implication to be expressed.For a basic word of a gathered data, the basic word goes out in the gathered data Existing frequency is：The frequency that all basic word of the frequency/gathered data that the basic word occurs in the gathered data occurs It is cumulative and.

For every gathered data, after obtaining the set of basic word of the gathered data, it may be determined that each basic word Basic word of the frequency higher than default Second Threshold is defined as the gathered data by the frequency occurred in the gathered data Key word.

Further, it may be determined that the text similarity of target topic and the key word of every gathered data.Specifically, may be used With the algorithm using prior art Chinese version similarity, the embodiment of the present invention is repeated no more to this.

For every gathered data, if target topic is higher than pre- with the text similarity of the key word of the gathered data If first threshold, then show that the gathered data with target topic relatively, can be integrated into result by the gathered data In data acquisition system.

It should be noted that first threshold and Second Threshold can be set according to practical situation and be adjusted, the present invention Embodiment is without limitation.

The method provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data, Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition Degree and data value density.

In addition, the embodiment of the present invention by means of the Millisecond search capability of search engine, orientation can be completed within the several seconds Acquisition tasks.

Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of data acquisition unit, is described below A kind of data acquisition unit can be mutually to should refer to a kind of above-described collecting method.

Shown in Figure 2, the device can be included with lower module：

Target determination module 210, for determining target topic and target collection website；

Link determining module 220, for, in multiple web page interlinkages that target gathers that website includes, determining target topic pair The target web link answered；

Gathered data obtains module 230, for gathering the content in each corresponding webpage of target web link, obtains many Bar gathered data；

Result data determining module 240, for according to target topic and the matching degree of every gathered data, determining result Data acquisition system.

The device provided using the embodiment of the present invention, it is determined that behind target topic and target collection website, adopting in target In multiple web page interlinkages that collection website includes, the corresponding target web link of target topic is determined, each target web is gathered Link the content in corresponding webpage, obtain a plurality of gathered data, according to the matching degree of target topic and every gathered data, Can determine result data set.Orientation determines the corresponding target web link of target topic so that from each target web The content collected in linking corresponding webpage is less, larger with the dependency of target topic, improves the accurate of data acquisition Degree and data value density.

In a kind of specific embodiment of the present invention, also include：

Link filter module, for it is determined that after the corresponding target web link of target topic, gathering each target network Before content on the corresponding webpage of page link, the corresponding target web link of target topic to determining carries out filtration treatment.

In a kind of specific embodiment of the present invention, target determination module 210, specifically for：

In a kind of specific embodiment of the present invention, result data determining module 240, including：

Text similarity determination sub-module, for determining that target topic is similar to the text of the key word of every gathered data Degree；

Result data determination sub-module, for for every gathered data, if target topic and the gathered data The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.

In a kind of specific embodiment of the present invention, key word determination sub-module, specifically for：

Determine the frequency that each basic word occurs in the gathered data；

In this specification, each embodiment is described by the way of progressive, and what each embodiment was stressed is and other The difference of embodiment, between each embodiment same or similar part mutually referring to.For dress disclosed in embodiment For putting, as which corresponds to the method disclosed in Example, so description is fairly simple, related part is referring to method part Illustrate.

Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes the composition and step of each example in the above description according to function.These Function actually with hardware or software mode performing, depending on the application-specific and design constraint of technical scheme.Specialty Technical staff can use different methods to realize described function to each specific application, but this realization should not Think beyond the scope of this invention.

The step of method described with reference to the embodiments described herein or algorithm, directly can be held with hardware, processor Capable software module, or the combination of the two is implementing.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, depositor, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

Specific case used herein is set forth to the principle and embodiment of the present invention, and above example is said It is bright to be only intended to help and understand technical scheme and its core concept.It should be pointed out that common for the art For technical staff, under the premise without departing from the principles of the invention, some improvement and modification can also be carried out to the present invention, these Improve and modification is also fallen in the protection domain of the claims in the present invention.

Claims

1. a kind of collecting method, it is characterised in that include：

Determine target topic and target collection website；

In multiple web page interlinkages that the target gathers that website includes, the corresponding target web chain of the target topic is determined Connect；

2. collecting method according to claim 1, it is characterised in that determine that the target topic is corresponding described After target web link, before the content gathered on each corresponding webpage of target web link, also include：

3. collecting method according to claim 1, it is characterised in that the determination target topic and target collection net Stand, including：

4. the collecting method according to any one of claims 1 to 3, it is characterised in that described according to the target master The matching degree with every gathered data is inscribed, result data set is determined, including：

Determine the key word of every gathered data；

For every gathered data, if the target topic is higher than pre- with the text similarity of the key word of the gathered data If first threshold, then the gathered data is integrated in result data set.

5. collecting method according to claim 4, it is characterised in that the key of every gathered data of determination Word, including：

For every gathered data, word segmentation processing is carried out to the gathered data, obtain the collection of the basic word of the gathered data Close；

Determine the frequency that each basic word occurs in the gathered data；

6. a kind of data acquisition unit, it is characterised in that include：

Link determining module, for, in multiple web page interlinkages that the target gathers that website includes, determining the target topic Corresponding target web link；

Gathered data obtains module, for gathering the content in each corresponding webpage of target web link, obtains a plurality of collection Data；

Result data determining module, for according to the target topic and the matching degree of every gathered data, determining number of results According to set.

7. data acquisition unit according to claim 6, it is characterised in that also include：

Link filter module, for it is described determine the corresponding target web link of the target topic after, the collection it is every Before individual target web links the content on corresponding webpage, the corresponding target web chain of the target topic to determining is tapped into Row filtration treatment.

8. data acquisition unit according to claim 6, it is characterised in that the target determination module, specifically for：

9. the data acquisition unit according to any one of claim 6 to 8, it is characterised in that the result data determines mould Block, including：

Text similarity determination sub-module, for determining that the target topic is similar to the text of the key word of every gathered data Degree；

Result data determination sub-module, for for every gathered data, if the target topic and the gathered data The text similarity of key word is higher than preset first threshold value, then the gathered data is integrated in result data set.

10. data acquisition unit according to claim 9, it is characterised in that the key word determination sub-module, it is concrete to use In：

Determine the frequency that each basic word occurs in the gathered data；