CN104216928A - Site information acquiring method and device - Google Patents

Site information acquiring method and device Download PDF

Info

Publication number
CN104216928A
CN104216928A CN201310222196.4A CN201310222196A CN104216928A CN 104216928 A CN104216928 A CN 104216928A CN 201310222196 A CN201310222196 A CN 201310222196A CN 104216928 A CN104216928 A CN 104216928A
Authority
CN
China
Prior art keywords
information
search results
site
works
information acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310222196.4A
Other languages
Chinese (zh)
Inventor
高健
牛小彬
章云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310222196.4A priority Critical patent/CN104216928A/en
Publication of CN104216928A publication Critical patent/CN104216928A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The embodiment of the invention discloses a site information acquiring method. The site information acquiring method comprises the steps that searching is conducted according to keywords of elements in a basic data set, and a corresponding searching result is obtained; a predetermined part of information in a web page of a corresponding site is obtained according to web page link information in the searching result; a site identification corresponding to the web page link information and the predetermined part of information are processed into a corresponding data record, and a site information acquisition result is generated according to the data record. The embodiment of the invention further discloses an information mining device. By the adoption of the site information acquiring method and the information mining device, site information meeting requirements can be found out automatically, and a small amount of labor is consumed.

Description

Site information acquisition methods and device
[technical field]
The present invention relates to field of computer technology, particularly a kind of site information acquisition methods and device.
[background technology]
In order to find out the website with numerous novel, traditional technical scheme has following two kinds:
One, at Hub(hinge) search the above-mentioned website with numerous novel by the mode of manually searching in the page (such as, http://www.hao123.com/);
Two, on the search engine page (such as, http://www.baidu.com/), the above-mentioned website with numerous novel is obtained by the mode of manual search.
In practice, inventor finds that prior art at least exists following problem:
For above-mentioned first point, the novel negligible amounts comprised in the Hub page, cannot find the website with numerous novel;
For above-mentioned second point, the human cost expended by the mode of manual search is too high.
To sum up, traditional technical scheme generally all needs manually to search to obtain satisfactory information, cannot realize automatically finding out satisfactory information.
Therefore, be necessary to propose a kind of new technical scheme, to solve the problems of the technologies described above.
[summary of the invention]
The object of the present invention is to provide a kind of site information acquisition methods and device, it can automatically find out satisfactory site information, does not need to expend too many manpower.
For solving the problems of the technologies described above, the technical scheme of the embodiment of the present invention is as follows:
A kind of site information acquisition methods, described method comprises: search for the keyword of the element in basic data set, and obtains corresponding Search Results; According to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results; The site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message, and generates site information acquisition result according to described data record.
A kind of site information acquisition device, described device comprises: search module, for searching for the keyword of the element in basic data set, and obtains corresponding Search Results; Acquisition module, for according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results; Sorting module, for the site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message, and obtains result for generating site information according to described data record.
Hinge structure, the embodiment of the present invention is owing to make use of the combination of search module, handling module and sorting module to excavate the site information on internet, therefore can realize automatically excavating the site information on internet, operator only need provide primary data (such as, the information of several novels) site information of meet the requirements (there is numerous novel) can be excavated), do not need to expend too many manpower in the process excavating this site information.
For foregoing of the present invention can be become apparent, preferred embodiment cited below particularly, and coordinate institute's accompanying drawings, be described in detail below:
[accompanying drawing explanation]
Fig. 1 is the site information acquisition methods of the embodiment of the present invention and the running environment schematic diagram of device;
Fig. 2 is the block diagram of the first embodiment of site information acquisition device of the present invention;
Fig. 3 is the block diagram of the 3rd embodiment of site information acquisition device of the present invention;
Fig. 4 is the block diagram of the 4th embodiment of site information acquisition device of the present invention;
Fig. 5 is the block diagram of the 5th embodiment of site information acquisition device of the present invention;
Fig. 6 is the block diagram of the 6th embodiment of site information acquisition device of the present invention;
Fig. 7 is the process flow diagram of the first embodiment of site information acquisition methods of the present invention;
Fig. 8 is the process flow diagram of the second embodiment of site information acquisition methods of the present invention;
Fig. 9 is the process flow diagram of the 3rd embodiment of site information acquisition methods of the present invention;
Figure 10 is the process flow diagram of the 4th embodiment of site information acquisition methods of the present invention;
Figure 11 is the process flow diagram of the 5th embodiment of site information acquisition methods of the present invention;
Figure 12 is the process flow diagram of the 6th embodiment of site information acquisition methods of the present invention.
[embodiment]
The explanation of following embodiment is graphic with reference to what add, can in order to the specific embodiment implemented in order to illustrate the present invention.
In the following description, specific embodiments of the invention illustrate, unless otherwise stating clearly with reference to the step of the operation performed by or multi-section computing machine/mobile device and symbol.Therefore, it can recognize these steps and operation, wherein have and will mention for several times as being performed by computing machine/mobile device, include and handled with the computing machine of the electronic signal of the data in a structuring pattern/mobile device processing unit by representing.These data of this manipulation transforms or the position maintained in the memory system of this computing machine/mobile device, its reconfigurable or other running changing this computing machine/mobile device in a manner familiar to those skilled in the art.The data structure that these data maintain is the provider location of this internal memory, and it has the particular characteristics defined by this data layout.But the principle of the invention illustrates with above-mentioned word, it is not represented as a kind of restriction, and those skilled in the art can recognize that the plurality of step of the following stated and operation also may be implemented in the middle of hardware.
Principle of the present invention uses other wide usages many or specific purpose computing, communication environment or configuration to carry out operation.Known by be suitable for the arithmetic system of the embodiment of the present invention, environment and configuration example can include, but is not limited to panel computer, mobile phone, personal computer, server, multicomputer system, micro computer be main system, body frame configuration computing machine and distributed computing environment, which includes any said system or device.
As the term " module " that uses herein or " unit " can be referred to as the software object that performs in this arithmetic system or routine formula.Different assemblies described herein, module, engine and service can be embodied as the object or process that perform in this arithmetic system.And system and method described herein is preferably implemented as software, the enforcement on software and hardware or hardware also likely and consider.
With reference to figure 1, the site information acquisition methods of the embodiment of the present invention and device can run in computing machine/mobile device, this computing machine can be the system that the one or more than one in PC, server etc. combines, this mobile device can be panel computer, mobile phone, PDA(Personal Digital Assistant, personal digital assistant), the system that combines of one or more than one in notebook computer etc.The combination in any 100 in processor 101, storer 102, sensor 105, switching device 104, power supply 103, clock signal generators 106, input-output device 107 etc. can be comprised in this computing machine/mobile device.Combination in any 100 in processor 101 in above computer/mobile device, storer 102, sensor 105, switching device 104, power supply 103, clock signal generators 106, input-output device 107 etc. is for realizing the step in the site information acquisition methods of the embodiment of the present invention and the function in site information acquisition device.
In the present embodiment, the software program instructions corresponding to described site information acquisition device is stored in storer 102, and is performed by processor 101, to realize the management of process in operating system.
In addition, the storage medium of above-mentioned storer 102 embodied on computer readable, this storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.
It is the block diagram of the first embodiment of site information acquisition device 100 of the present invention with reference to figure 2, Fig. 2.
The site information acquisition device 100 of the present embodiment comprises search module 201, acquisition module 202 and sorting module 203.
Wherein, described search module 201 for searching for the keyword of the element in basic data set, and for obtaining corresponding Search Results.Described search module 201 can search for described keyword in a database, also can search for described keyword in internet.In the present embodiment, described basic data set to comprise in literary works set, work of fine arts set, musical works set, films and television programs set the combination of more than any one or one.Described basic data set is the set at least including the works of N1 the first kind and the works of N2 Second Type, wherein, N1, N2 are positive integers, namely, described basic data set is the set including the works of at least one first kind and the works of at least one Second Type, certainly, described basic data set can also comprise the works of at least one the 3rd type.The described first kind is relevant to any one in the popularity (popular degree), situation of selling well degree, recommendation degree etc. of described works with described Second Type, such as, the described first kind and described Second Type are the first popularity (most popular) and the second popularity (more welcome) respectively, or the described first kind and described Second Type are the first situation of selling well degree (being in great demand most) and the second situation of selling well degree (being comparatively in great demand) respectively.Works in described basic data set can be generated by the mode manually chosen, and also can be that the mode automatically chosen by computing machine/mobile device generates.Such as, described basic data set is the set including the most popular novel in N1 portion, the medium popular novel in N2 portion, N3 portion is not popular novel etc., and wherein, N3 is also positive integer.Described element is works described in each in described basic data set.Described works can be the combinations of more than any one or the one in literary works, work of fine arts, musical works, films and television programs etc., that is, described works to comprise in literary works, work of fine arts, musical works, films and television programs the combination of more than any one or one.
Described acquisition module 202 is for obtaining the predetermined portions information in the page of respective site according to the page link information (such as, network address) in described Search Results.Particularly, in the present embodiment, described acquisition module 202 is for analyzing (read the code in this page and find out position corresponding to predetermined portions) the page of the website corresponding to each record in described Search Results, then predetermined portions information is wherein obtained, such as, heading message, that is, described predetermined portions is the title division in the page of described website.
Described sorting module 203 for the site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message, and obtains result for generating site information according to described data record.Described sorting module 203 is also for according to described page link information or the described predetermined portions acquisition of information site identity (such as, from described page link information obtain domain name or from described predetermined portions information obtain website name) corresponding with described page link information.Particularly, described sorting module 203 is for by the site identity described in each corresponding to page link information (such as, website name) and the page in predetermined portions information (such as, heading message) be integrated into a data record, and many this type of data record arrangements are generated described site information obtain result.
For above-mentioned discussion, illustrate below:
In the novel set chosen in advance (basic data set), described search module 201 is to each novel, search in internet with the keyword relevant to this novel, obtain the Search Results of being correlated with, obtained Search Results is sent to described acquisition module 202 by described search module 201.Described acquisition module 202 captures page link information (such as each record in this Search Results, network address), then the relevant website of this page link message reference is passed through, read page code wherein, and find out predetermined portions information (such as, heading message), then described page link information and described predetermined portions information are sent to described sorting module 203, such as, described acquisition module 202 captures according to the homepage of this page link information to website, and extract HTML (Hypertext Markup Language) (Hyper Text Markup Language, HTML) Title(title in) content.Described page link information is become a data record with described predetermined portions information integration by described sorting module 203, then according to many this type of data record generate site information obtain result, such as, generate data like this: { website 1, title 1}, { website 2, title 2}, website 3, title 3}, and be stored in corresponding storage space.
In the present embodiment, owing to make use of the combination of search module 201, acquisition module 202 and sorting module 203 to excavate the site information on internet, therefore can realize automatically excavating the site information on internet, operator only need provide primary data (such as, the information of several novels) site information of meet the requirements (there is numerous novel) can be excavated, do not need to expend too many manpower in the process excavating this site information.
Further, described sorting module 203 can also arrange described page link information, such as, under this network address comprises more multicharacter situation, extract the domain-name information in this network address and only retain this domain-name information, excavated site information can be made so more to simplify.
Second embodiment of site information acquisition device 100 of the present invention is similar to above-mentioned first embodiment, and difference is:
Described element at least comprises the first element and the second element, and described Search Results at least comprises first Search Results corresponding with described first element and second Search Results corresponding with described second element.In the present embodiment, the works of the corresponding described first kind of described first element, the works of the corresponding described Second Type of described second element.
Described search module 203 also for searching for the second keyword of the first keyword of described first element and described second element respectively, and for obtaining described first Search Results and described second Search Results respectively.
In the site information acquisition device 100 of the present embodiment, described first keyword comprises at least two attribute informations of described first element, and described second keyword comprises at least two attribute informations of described second element.At least two attribute informations of described first element comprise makes the name of an article and authors' name, and at least two attribute informations of described second element comprise makes the name of an article and authors' name.
For above-mentioned discussion, illustrate below:
Described search module 203 is searched for for keyword with " novel 1 author 1 ", " novel 2 author 2 " and " novel 3 author 3 " respectively, obtains first Search Results corresponding with " novel 1 author 1 ", second Search Results corresponding with " novel 2 author 2 " and three Search Results corresponding with " novel 3 author 3 ".
In the present embodiment, because described search module 201 utilizes the keyword of at least two elements to search for respectively, and obtain corresponding Search Results respectively, therefore obtained Search Results variation can be made, thus make described acquisition module 202 can get more much more full site information on more diversified data basis, ensure that the site information acquisition device 100 of the present embodiment can excavate the site information of more meet the requirements (having numerous novel).
It is the block diagram of the 3rd embodiment of site information acquisition device 100 of the present invention with reference to figure 3, Fig. 3.The present embodiment is similar to the above-mentioned first or second embodiment, and difference is:
The site information acquisition device 100 of the present embodiment also comprises the first extraction module 301.
Described first extraction module 301 is for extracting the page link information in described first Search Results and described second Search Results.Particularly, described first extraction module 301 for obtaining the code corresponding to each record in the first Search Results and the second Search Results from described search module 201, then the part corresponding with page link information is therefrom found out, and therefrom extract corresponding page link information (such as, network address).Further, described first extraction module 301 is also for after extracting multiple page link information, extracted page link information is compared, judge that whether domain name is wherein identical, identical multiple page link information only retain wherein one, and different multiple page link information then retain.
For described discussion, illustrate below:
Every bar record in described first extraction module 301 pairs of result of page searching carries out website extraction, generates data: novel 1, and website a, website b, website c ..., novel 2, website a ', website b ', website c ' ..., novel 3, and website a ", website b ", website c " and ... }.
In the present embodiment, owing to utilizing described first extraction module 301 to extract the page link information in Search Results, be therefore conducive to described acquisition module 202 and obtained information in corresponding website by this page link information.
It is the block diagram of the 4th embodiment of site information acquisition device 100 of the present invention with reference to figure 4, Fig. 4.The present embodiment is similar to above-mentioned first, second or the 3rd embodiment, and difference is:
The site information acquisition device 100 of the present embodiment also comprises statistical module 401 and screening module 402.
Described statistical module 401, for for the website corresponding with described page link information, is added up the quantity of described element and generates statistics.Particularly, described statistical module 401 is added up for the number of elements of the website corresponding to the page link information extracted above-mentioned first extraction module 301.
Website corresponding to described screening module 402 is greater than predetermined value element for screening described quantity according to described statistics also generates the selection result.Particularly, described predetermined value is set as M, M be positive integer (such as, 6), according to described statistics, described screening module 402 for judging that the novel quantity of which website is greater than M after receiving described statistics, retain and be wherein quantitatively greater than the site identity of the website of M and the site identity these remained generates described the selection result, then abandoning of other at novel.
Described acquisition module 202 is for obtaining the predetermined portions information in the page of wherein respective site respectively according to described the selection result.
For described discussion, illustrate below:
Described statistical module 401 carries out the quantitative statistics of novel number according to website, obtains statistics: { website a, n1}, { website b, n2}, { website c, n3}.Wherein, n1, n2 and n3 are website a respectively, the novel quantity of website b and website c.Described screening module 402 filters out the numerical value being greater than M from n1, n2 and n3, and such as, n1 and n3, corresponding website is website a and website c.
In the present embodiment, by described first extraction module 301 extract the page link information obtained and add up and screen, therefore can obtain the website with more novel.
It is the block diagram of the 5th embodiment of site information acquisition device 100 of the present invention with reference to figure 5, Fig. 5.The present embodiment is similar to any one embodiment in above-mentioned first to fourth, and difference is:
The site information acquisition device 100 of the present embodiment also comprises judge module 501.
Described judge module 501 is for judging whether described predetermined portions information comprises predetermined content and generate judged result.Particularly, described predetermined content can be this type of word of such as " novel ", " reading ", " up-to-date chapters and sections " etc.
Described sorting module 203 also in described judged result be described predetermined portions information do not comprise described predetermined content abandon described data record, and retain described data record for being described predetermined portions packets of information in described judged result containing when described predetermined content.
For described discussion, illustrate below:
Described judge module 501 judges that predetermined portions information that described website is corresponding (such as, heading message) in whether comprise such as the content of the one in " novel ", " reading ", " up-to-date chapters and sections " etc., if the title of website does not comprise arbitrary keyword wherein, then abandon this website, the website finally retained, namely can be used as newfound novel website.
In the present embodiment, by judging whether the predetermined portions information of described website comprises predetermined content, can judge whether described website is novel website, if novel website, then retains, otherwise abandons, and is conducive to like this filtering out satisfactory website further.
It is the block diagram of the 6th embodiment of site information acquisition device 100 of the present invention with reference to figure 6, Fig. 6.The present embodiment is similar to any one embodiment in above-mentioned first to the 5th, and difference is:
The site information acquisition device 100 of the present embodiment also comprises the second extraction module 601.
Described second extraction module 601 for extracting attribute information described at least two respectively from the file corresponding to described first element and described second element.Particularly, described second extraction module 601 can extract at least two attribute informations of the file corresponding to the first element and the second element by the technology of text search, identification.
For described discussion, illustrate below:
Described second extraction module 601 from operating personnel provide or machine Stochastic choice file (such as, " .pdf " document files, " .doc " document files, " .mp3 " audio file, " .rmvb " video files) corresponding at least two attribute informations of middle extraction, such as, about the attribute information of document title, authors' name.
In the present embodiment, extracted at least two attribute informations of the element in basic data set by the second extraction module 601, to make the purposes of the keyword searched for, be conducive to realizing automatically excavating above-mentioned site information.
It is the process flow diagram of the first embodiment of site information acquisition methods of the present invention with reference to figure 7, Fig. 7.
The site information acquisition methods of the present embodiment comprises the following steps:
Step 701, described search module 201 is searched for the keyword of the element in basic data set, and obtains corresponding Search Results.Described search module 201 can search for described keyword in a database, also can search for described keyword in internet.In the present embodiment, described basic data set to comprise in literary works set, work of fine arts set, musical works set, films and television programs set the combination of more than any one or one.Described basic data set is the set at least including the works of N1 the first kind and the works of N2 Second Type, wherein, N1, N2 are positive integers, namely, described basic data set is the set including the works of at least one first kind and the works of at least one Second Type, certainly, described basic data set can also comprise the works of at least one the 3rd type.The described first kind is relevant to any one in the popularity (popular degree), situation of selling well degree, recommendation degree etc. of described works with described Second Type, such as, the described first kind and described Second Type are the first popularity (most popular) and the second popularity (more welcome) respectively, or the described first kind and described Second Type are the first situation of selling well degree (being in great demand most) and the second situation of selling well degree (being comparatively in great demand) respectively.Works in described basic data set can be generated by the mode manually chosen, and also can be that the mode automatically chosen by computing machine/mobile device generates.Such as, described basic data set is the set including the most popular novel in N1 portion, the medium popular novel in N2 portion, N3 portion is not popular novel etc., and wherein, N3 is also positive integer.Described works to comprise in literary works, work of fine arts, musical works, films and television programs the combination of more than any one or one.
Step 702, described acquisition module 202 obtains the predetermined portions information in the page of respective site according to the page link information (such as, network address) in described Search Results.Particularly, in the present embodiment, the page of described acquisition module 202 to the website corresponding to each record in described Search Results is analyzed (read the code in this page and find out position corresponding to predetermined portions), then predetermined portions information is wherein obtained, such as, heading message, that is, described predetermined portions is the title division in the page of described website.
Step 703, the site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message by described sorting module 203, and generates site information acquisition result according to described data record.Described sorting module 203 is also for according to described page link information or the described predetermined portions acquisition of information site identity (such as, from described page link information obtain domain name or from described predetermined portions information obtain website name) corresponding with described page link information.Particularly, described sorting module 203 by the site identity described in each corresponding to page link information (such as, website name) and the page in predetermined portions information (such as, heading message) be integrated into a data record, and many this type of data record arrangements are generated described site information obtain result.
For above-mentioned discussion, illustrate below:
In step 701, in the novel set chosen in advance (basic data set), described search module 201 is to each novel, search in internet with the keyword relevant to this novel, obtain the Search Results of being correlated with, obtained Search Results is sent to described acquisition module 202 by described search module 201.In step 702, described acquisition module 202 captures page link information (such as each record in this Search Results, network address), then the relevant website of this page link message reference is passed through, read page code wherein, and find out predetermined portions information (such as, heading message), then described page link information and described predetermined portions information are sent to described sorting module 203, such as, described acquisition module 202 captures according to the homepage of this page link information to website, and extract HTML(Hyper Text Markup Language, HTML (Hypertext Markup Language)) middle Title(title) content.In step 703, described page link information is become a data record with described predetermined portions information integration by described sorting module 203, then according to many this type of data record generate site information obtain result, such as, generate data like this: { website 1, title 1}, { website 2, title 2}, { website 3, title 3}, and be stored in corresponding storage space.
In the present embodiment, owing to make use of the combination of search module 201, acquisition module 202 and sorting module 203 to excavate the site information on internet, therefore can realize automatically excavating the site information on internet, operator only need provide primary data (such as, the information of several novels) site information of meet the requirements (there is numerous novel) can be excavated, do not need to expend too many manpower in the process excavating this site information.
Further, described sorting module 203 can also arrange described page link information, such as, under this network address comprises more multicharacter situation, extract the domain-name information in this network address and only retain this domain-name information, excavated site information can be made so more to simplify.
It is the process flow diagram of the second embodiment of site information acquisition methods of the present invention with reference to figure 8, Fig. 8.The present embodiment is similar to above-mentioned first embodiment, and difference is:
Described element at least comprises the first element and the second element, and described Search Results at least comprises first Search Results corresponding with described first element and second Search Results corresponding with described second element.In the present embodiment, the works of the corresponding described first kind of described first element, the works of the corresponding described Second Type of described second element.
The described keyword with the element in basic data set is searched for, and the step (that is, step 701) obtaining corresponding Search Results comprising:
Step 7011, described search module 203 is searched for the second keyword of the first keyword of described first element and described second element respectively.
Step 7012, described search module 203 obtains described first Search Results and described second Search Results respectively.
In the site information acquisition device 100 of the present embodiment, described first keyword comprises at least two attribute informations of described first element, and described second keyword comprises at least two attribute informations of described second element.At least two attribute informations of described first element comprise makes the name of an article and authors' name, and at least two attribute informations of described second element comprise makes the name of an article and authors' name.
For above-mentioned discussion, illustrate below:
Described search module 203 is searched for for keyword with " novel 1 author 1 ", " novel 2 author 2 " and " novel 3 author 3 " respectively, obtains first Search Results corresponding with " novel 1 author 1 ", second Search Results corresponding with " novel 2 author 2 " and three Search Results corresponding with " novel 3 author 3 ".
In the present embodiment, because described search module 201 utilizes the keyword of at least two elements to search for respectively, and obtain corresponding Search Results respectively, therefore obtained Search Results variation can be made, thus make described acquisition module 202 can get more much more full site information on more diversified data basis, ensure that the site information acquisition device 100 of the present embodiment can excavate the site information of more meet the requirements (having numerous novel).
It is the process flow diagram of the 3rd embodiment of site information acquisition methods of the present invention with reference to figure 9, Fig. 9.The present embodiment is similar to the above-mentioned first or second embodiment, and difference is:
In the described step obtaining described first Search Results and described second Search Results respectively (namely, step 7012) after, and in the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results (namely, step 702) before, described method is further comprising the steps of:
Step 901, described first extraction module 301 extracts the page link information in described first Search Results and described second Search Results.Particularly, code corresponding to each acquisition from described search module 201 in first Search Results and the second Search Results of described first extraction module 301 records, then the part corresponding with page link information is therefrom found out, and therefrom extract corresponding page link information (such as, network address).Further, described first extraction module 301, after extracting multiple page link information, compares extracted page link information, judges that whether domain name is wherein identical, identical multiple page link information only retain wherein one, and different multiple page link information then retain.
For described discussion, illustrate below:
Every bar record in described first extraction module 301 pairs of result of page searching carries out website extraction, generates data: novel 1, and website a, website b, website c ..., novel 2, website a ', website b ', website c ' ..., novel 3, and website a ", website b ", website c " and ... }.
In the present embodiment, owing to utilizing described first extraction module 301 to extract the page link information in Search Results, be therefore conducive to described acquisition module 202 and obtained information in corresponding website by this page link information.
With reference to the process flow diagram that Figure 10, Figure 10 are the 4th embodiments of site information acquisition methods of the present invention.The present embodiment is similar to above-mentioned first, second or the 3rd embodiment, and difference is:
The step of the page link information in described first Search Results of described extraction and described second Search Results (namely, step 901) after, and in the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results (namely, step 702) before, described method also comprises:
Step 1001, described statistical module 401, for the website corresponding with described page link information, is added up the quantity of described element and generates statistics.Particularly, the number of elements of the website corresponding to page link information that described statistical module 401 extracts above-mentioned first extraction module 301 is added up.
Step 1002, described screening module 402 is screened website corresponding to the element that described quantity is greater than predetermined value according to described statistics and is generated the selection result.Particularly, described predetermined value is set as M, M be positive integer (such as, 6), according to described statistics, described screening module 402 judges that the novel quantity of which website is greater than M after receiving described statistics, retain and be wherein quantitatively greater than the site identity of the website of M and the site identity these remained generates described the selection result, then abandoning of other at novel.
The described step (that is, step 702) according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results is:
Described acquisition module 202 obtains the predetermined portions information in the page of wherein respective site respectively according to described the selection result.
For described discussion, illustrate below:
Described statistical module 401 carries out the quantitative statistics of novel number according to website, obtains statistics: { website a, n1}, { website b, n2}, { website c, n3}.Wherein, n1, n2 and n3 are website a respectively, the novel quantity of website b and website c.Described screening module 402 filters out the numerical value being greater than M from n1, n2 and n3, and such as, n1 and n3, corresponding website is website a and website c.
In the present embodiment, by described first extraction module 301 extract the page link information obtained and add up and screen, therefore can obtain a fairly large number of website with novel.
With reference to the process flow diagram that Figure 11, Figure 11 are the 5th embodiments of site information acquisition methods of the present invention.The present embodiment is similar to any one embodiment in above-mentioned first to fourth, and difference is:
In the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results (namely, step 702) after, and obtained described finish message become corresponding data record described and generate before site information obtains the step of result, or obtained described finish message become corresponding data record described and generate after site information obtains the step of result, described method also comprises:
Step 1101, described judge module 501 judges whether described predetermined portions information comprises predetermined content and generate judged result.Particularly, described predetermined content can be this type of word of such as " novel ", " reading ", " up-to-date chapters and sections " etc.
Step 1102, described sorting module 203 when described judged result be described predetermined portions information do not comprise described predetermined content abandon described data record.
Step 1103, described sorting module 203 is that described predetermined portions packets of information retains described data record containing when described predetermined content in described judged result.
For described discussion, illustrate below:
Described judge module 501 judges that predetermined portions information that described website is corresponding (such as, title) in whether comprise such as the content of the one in " novel ", " reading ", " up-to-date chapters and sections " etc., if the title of website does not comprise arbitrary keyword wherein, then abandon this website, the website finally retained, namely can be used as newfound novel website.
In the present embodiment, by judging whether the predetermined portions information of described website comprises predetermined content, can judge whether described website is novel website, if novel website, then retains, otherwise abandons, and is conducive to like this filtering out satisfactory website further.
With reference to the process flow diagram that Figure 12, Figure 12 are the 6th embodiments of site information acquisition methods of the present invention.The present embodiment is similar to any one embodiment in above-mentioned first to the 5th, and difference is:
Search at the described keyword with the element in basic data set, and before obtaining the step (step 701) of corresponding Search Results, described method also comprises:
Step 1201, described second extraction module 601 extracts attribute information described at least two respectively from the file corresponding to described first element and described second element.Particularly, described second extraction module 601 can extract at least two attribute informations of the file corresponding to the first element and the second element by the technology of text search, identification.
For described discussion, illustrate below:
Described second extraction module 601 from operating personnel provide or machine Stochastic choice file (such as, " .pdf " document files, " .doc " document files, " .mp3 " audio file, " .rmvb " video files) corresponding at least two attribute informations of middle extraction, such as, about the attribute information of document title, authors' name.
In the present embodiment, extracted at least two attribute informations of the element in basic data set by the second extraction module 601, to make the purposes of the keyword searched for, be conducive to realizing automatically excavating above-mentioned site information.
In sum; although the present invention discloses as above with preferred embodiment; but above preferred embodiment is also not used to limit the present invention; those of ordinary skill in the art; without departing from the spirit and scope of the present invention; all can do various change and retouching, the scope that therefore protection scope of the present invention defines with claim is as the criterion.

Claims (22)

1. a site information acquisition methods, is characterized in that, described method comprises:
Search for the keyword of the element in basic data set, and obtain corresponding Search Results;
According to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results;
The site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message, and generates site information acquisition result according to described data record.
2. site information acquisition methods according to claim 1, it is characterized in that, described element at least comprises the first element and the second element, and described Search Results at least comprises first Search Results corresponding with described first element and second Search Results corresponding with described second element;
The described keyword with the element in basic data set is searched for, and the step obtaining corresponding Search Results is:
Search for the second keyword of the first keyword of described first element and described second element respectively, and obtain described first Search Results and described second Search Results respectively.
3. site information acquisition methods according to claim 2, it is characterized in that, described obtain the step of described first Search Results and described second Search Results respectively after, and before the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results, described method is further comprising the steps of:
Extract the page link information in described first Search Results and described second Search Results.
4. site information acquisition methods according to claim 3, it is characterized in that, after the step of the page link information in described first Search Results of described extraction and described second Search Results, and before the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results, described method also comprises:
For the website corresponding with described page link information, the quantity of described element is added up and generates statistics;
Screen website corresponding to element that described quantity is greater than predetermined value according to described statistics and generate the selection result;
The described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results is:
The described predetermined portions information in the page of wherein respective site is obtained respectively according to described the selection result.
5. site information acquisition methods according to claim 4, is characterized in that, after the described step according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results, described method also comprises:
Judge whether described predetermined portions information comprises predetermined content and generate judged result;
When described judged result be described predetermined portions information do not comprise described predetermined content abandon described data record;
Being described predetermined portions packets of information in described judged result retains described data record containing when described predetermined content.
6. site information acquisition methods according to claim 5, is characterized in that, described first keyword comprises at least two attribute informations of described first element, and described second keyword comprises at least two attribute informations of described second element;
Search at the described keyword with the element in basic data set, and before obtaining the step of corresponding Search Results, described method also comprises:
Extract respectively described first element and the file corresponding to described second element at least two described in attribute information.
7. site information acquisition methods as claimed in any of claims 2 to 6, is characterized in that, described basic data set is the set including the works of at least one first kind and the works of at least one Second Type;
Any one in the popularity of the described first kind and described Second Type and described works, situation of selling well degree, recommendation degree is relevant;
Described element is works described in each in described basic data set.
8. site information acquisition methods according to claim 7, is characterized in that, the works of the corresponding described first kind of described first element, the works of the corresponding described Second Type of described second element.
9. site information acquisition methods according to claim 7, is characterized in that, described works to comprise in literary works, work of fine arts, musical works, films and television programs the combination of more than any one or one.
10. site information acquisition methods according to claim 7, is characterized in that, at least two attribute informations of described first element comprise makes the name of an article and authors' name, and at least two attribute informations of described second element comprise makes the name of an article and authors' name.
11. site information acquisition methods as claimed in any of claims 1 to 6, is characterized in that, described predetermined portions information is the heading message in the page of described website.
12. 1 kinds of site information acquisition device, is characterized in that, described device comprises:
Search module, for searching for the keyword of the element in basic data set, and obtains corresponding Search Results;
Acquisition module, for according to the predetermined portions information in the page of the page link acquisition of information respective site in described Search Results;
Sorting module, for the site identity corresponding to described page link information is become corresponding data record with described predetermined portions finish message, and obtains result for generating site information according to described data record.
13. site information acquisition device according to claim 12, it is characterized in that, described element at least comprises the first element and the second element, and described Search Results at least comprises first Search Results corresponding with described first element and second Search Results corresponding with described second element;
Described search module is used for searching for the second keyword of the first keyword of described first element and described second element respectively, and obtains described first Search Results and described second Search Results respectively.
14. site information acquisition device according to claim 13, it is characterized in that, described device also comprises:
First extraction module, for extracting the page link information in described first Search Results and described second Search Results.
15. site information acquisition device according to claim 14, it is characterized in that, described device also comprises:
Statistical module, for for the website corresponding with described page link information, adds up the quantity of described element and generates statistics;
Screening module, website corresponding to the element being greater than predetermined value for screening described quantity according to described statistics also generates the selection result;
Described acquisition module is used for the described predetermined portions information obtained respectively according to described the selection result in the page of wherein respective site.
16. site information acquisition device according to claim 15, it is characterized in that, described device also comprises:
Judge module, for judging whether described predetermined portions information comprises predetermined content and generate judged result;
Described sorting module also in described judged result be described predetermined portions information do not comprise described predetermined content abandon described data record, and retain described data record for being described predetermined portions packets of information in described judged result containing when described predetermined content.
17. site information acquisition device according to claim 16, it is characterized in that, described first keyword comprises at least two attribute informations of described first element, described second keyword comprises at least two attribute informations of described second element;
Described device also comprises:
Second extraction module, for extract respectively file corresponding to described first element and the second element at least two described in attribute information.
18., according to claim 13 to the site information acquisition device described in any one in 17, is characterized in that, described basic data set is the set including the works of at least one first kind and the works of at least one Second Type;
Any one in the popularity of the described first kind and described Second Type and described works, situation of selling well degree, recommendation degree is relevant;
Described element is works described in each in described basic data set.
19. site information acquisition device according to claim 18, is characterized in that, the works of the corresponding described first kind of described first element, the works of the corresponding described Second Type of described second element.
20. site information acquisition device according to claim 18, is characterized in that, described works to comprise in literary works, work of fine arts, musical works, films and television programs the combination of more than any one or one.
21. site information acquisition device according to claim 18, is characterized in that, at least two attribute informations of described first element comprise makes the name of an article and authors' name, and at least two attribute informations of described second element comprise makes the name of an article and authors' name.
22. according to claim 12 to the site information acquisition device described in any one in 17, and it is characterized in that, described predetermined portions information is the heading message in the page of described website.
CN201310222196.4A 2013-06-05 2013-06-05 Site information acquiring method and device Pending CN104216928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310222196.4A CN104216928A (en) 2013-06-05 2013-06-05 Site information acquiring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310222196.4A CN104216928A (en) 2013-06-05 2013-06-05 Site information acquiring method and device

Publications (1)

Publication Number Publication Date
CN104216928A true CN104216928A (en) 2014-12-17

Family

ID=52098423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310222196.4A Pending CN104216928A (en) 2013-06-05 2013-06-05 Site information acquiring method and device

Country Status (1)

Country Link
CN (1) CN104216928A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649366A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Method and device for classifying keyword search results

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6625594B1 (en) * 2000-01-18 2003-09-23 With1Click, Inc. System and method for searching a global communication system using a sub-root domain name agent
CN101458713A (en) * 2008-12-29 2009-06-17 北京搜狗科技发展有限公司 Website classifying method and system
CN101944111A (en) * 2010-09-09 2011-01-12 中国科学技术大学 Method and device for searching news video
CN102646101A (en) * 2011-02-22 2012-08-22 阿里巴巴集团控股有限公司 Method and device for recommending product presentation information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6625594B1 (en) * 2000-01-18 2003-09-23 With1Click, Inc. System and method for searching a global communication system using a sub-root domain name agent
CN101458713A (en) * 2008-12-29 2009-06-17 北京搜狗科技发展有限公司 Website classifying method and system
CN101944111A (en) * 2010-09-09 2011-01-12 中国科学技术大学 Method and device for searching news video
CN102646101A (en) * 2011-02-22 2012-08-22 阿里巴巴集团控股有限公司 Method and device for recommending product presentation information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王春红,张世民: "搜索引擎", 《大学计算机基础教程》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649366A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Method and device for classifying keyword search results

Similar Documents

Publication Publication Date Title
Vishwakarma et al. Detection and veracity analysis of fake news via scrapping and authenticating the web search
US7861151B2 (en) Web site structure analysis
CN106383887B (en) Method and system for collecting, recommending and displaying environment-friendly news data
US8185530B2 (en) Method and system for web document clustering
CN102171689B (en) Method and system for providing search results
CN102760172B (en) Network searching method and network searching system
CN100476830C (en) Network resource searching method and system
CN102473190B (en) Keyword assignment to a web page
CN106095979B (en) URL merging processing method and device
CN104216881A (en) Method and device for recommending individual labels
CN106021418B (en) The clustering method and device of media event
CN102663060B (en) Method and device for identifying tampered webpage
CN103744856A (en) Method, device and system for linkage extended search
CN111259220B (en) Data acquisition method and system based on big data
CN103455758A (en) Method and device for identifying malicious website
US20090259649A1 (en) System and method for detecting templates of a website using hyperlink analysis
CN110069693A (en) Method and apparatus for determining target pages
Sivakumar Effectual web content mining using noise removal from web pages
CN104809173A (en) Search result processing method and device
CN105677921A (en) Method and system for acquiring Internet public opinion data
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
CN103631796A (en) Website sort management method and electronic device
KR100557874B1 (en) Method of scientific information analysis and media that can record computer program thereof
CN107622125B (en) Information crawling method and device and electronic equipment
Moumtzidou et al. Discovery of environmental nodes in the web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141217

RJ01 Rejection of invention patent application after publication