CN107330004A - A kind of collecting method based on url character strings - Google Patents

A kind of collecting method based on url character strings Download PDF

Info

Publication number
CN107330004A
CN107330004A CN201710440457.8A CN201710440457A CN107330004A CN 107330004 A CN107330004 A CN 107330004A CN 201710440457 A CN201710440457 A CN 201710440457A CN 107330004 A CN107330004 A CN 107330004A
Authority
CN
China
Prior art keywords
url
content
variable
core
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710440457.8A
Other languages
Chinese (zh)
Inventor
马建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Lian Yuan Mdt Infotech Ltd
Original Assignee
Shanghai Lian Yuan Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Lian Yuan Mdt Infotech Ltd filed Critical Shanghai Lian Yuan Mdt Infotech Ltd
Priority to CN201710440457.8A priority Critical patent/CN107330004A/en
Publication of CN107330004A publication Critical patent/CN107330004A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A kind of collecting method based on url character strings, the data for meeting user's request are gathered by acquisition system on the internet, including:A. input of the acquisition system based on user generates at least one url links and at least one url link configuration contents;B. input of the acquisition system based on user generates at least one core field and at least one core field configuration content;C. acquisition system is based on url links, url links configuration, core field and core field configuration content generation collection rule and based on collection rule gathered data.The present invention obtains url links, url link configurations content, core field and core field configuration content by user's input, based on url links, url links configuration, core field and core field configuration content generation collection rule and based on collection rule gathered data to appointing system, the present invention is simple to operate, pass through flexible and changeable collection rule, perfect screening function, the data acquisition of diversification is realized, with high commercial value.

Description

A kind of collecting method based on url character strings
Technical field
The invention belongs to data acquisition technology field, particularly a kind of collecting method based on url character strings.
Background technology
With the sustained and rapid development of internet and information industry, user can obtain the data of magnanimity on the internet, Wherein comprising a large amount of valuable information, such as government notice content information, national economy data message, Financial Information, social activity Information, consumption information, military information, entertainment information, news information etc., and the screening and integration to these information are then each Where the demand of user.
The excavation for internet public data is runed by specialized company at present, if domestic consumer needs Excavate that to meet the public datas of specified conditions be typically that the specialized company of commission provides corresponding service.
How a kind of increasing income is provided to domestic consumer, the collecting method of facilitation is that current needs are solved Technical problem, and do not have a kind of collecting method based on url character strings at present.
The content of the invention
The technological deficiency existed for prior art, is based on url character strings there is provided one kind according to an aspect of the present invention Collecting method, gather the data for meeting user's request on the internet by acquisition system, including:
A. input of the acquisition system based on user generates at least one url links and at least one url links are matched somebody with somebody Put content;
B. input of the acquisition system based on user generates at least one core field and at least one core field Configure content;
C. the acquisition system is based on url links, url links configuration, the core field and the core Heart field configuration content generates collection rule and based on the collection rule gathered data.
Preferably, multiple url links are generated in the step a as follows:
A1. user inputs an original url character string;
A2. replace variable in original url character strings using asterisk wildcard and generate form url character strings, the asterisk wildcard with The variable is corresponding;
A3. based on the multiple url links of the form url text string generations.
Url link configuration contents in collecting method according to claim 1, the step a pass through Following manner is generated:
A4. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character String;
A5. the url links configuration content is concatenated into based on the universal character.
Preferably, the url link configuration contents in the step a are generated in the following way:
A6. the url links configuration content is generated based on the specific character string that user inputs.
Preferably, the url link configuration contents in the step a are generated in the following way:
A7. the url links configuration content is generated based on User Defined script.
Preferably, the url links configuration content is as follows any or appoints a variety of:
- url links configuration the content is two character strings of determination search listing, and the search listing belongs to described Url links a part for corresponding source code;
- the url links a character string for configuring content for determination identification variable, and the identification variable is for determining together The url links of species;
- url links configuration the content is a character string of the necessary variable of determination, and the necessary variable is used to determine bag Url links containing the necessary variable;
- url links configuration the content rejects a character string of variable for determination, and the rejecting variable is used to determine not Include the url links of the rejecting variable;
- url links configuration the content is a character string of determination filtered variable, and the filtered variable is used to determine institute State the part that url links need to delete;
- url links configuration the content supplements a character string of prefix for determination, and the supplement prefix is used to be embedded into The url links are foremost;
- url links configuration the content supplements a character string of suffix for determination, and the supplement suffix is used to be embedded into It is last that the url is linked.
Preferably, the core field in the step b is generated in the following way:
B1. the input based on user retrieves the url and links one core character string of corresponding source code acquisition, the core Heart character string has uniqueness in the url links corresponding source code;
B2. replace variable in the core character string using asterisk wildcard and generate the core field, the asterisk wildcard with The variable is corresponding.
Preferably, the core field configuration content in the step b is generated in the following way:
B3. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character String;
B4. the core field configuration content is concatenated into based on the universal character.
The core field configuration content in collecting method according to claim 1, the step b passes through Following manner is generated:
B5. the core field configuration content is generated based on the specific character string that user inputs.
Preferably, the core field configuration content in the step b is generated in the following way:
B6. the core field configuration content is generated based on User Defined script.
Preferably, the core field configuration content is as follows any or appoints a variety of:
- core field configuration the content is a character string for determining necessary variable, and the necessary variable is used to determine The core field comprising the necessary variable;
- core field configuration the content replaces a character string of variable for determination, and the replacement variable is used to replace The part core field;
- core field configuration the content rejects a character string of variable for determination, and the rejecting variable is used to determine The core field needs the part deleted;
- core field configuration the content is a character string for determining filtered variable, and the filtered variable is used to determine The core field not comprising the filtered variable.
Preferably, also the input based on user generates at least one extended field and extended field is matched somebody with somebody in the step b Put content,
The links of url described in the step c, url links configuration, the core field, the core field configuration Content, the extended field and extended field configuration content generation collection rule simultaneously gather number based on the collection rule According to.
Preferably, the extended field is generated in the following way:
B7. the input based on user retrieves the url and links one escape character (ESC) string of corresponding source code acquisition, the expansion Open up character string has uniqueness in the url links corresponding source code;
B8. replace variable in the core character string using asterisk wildcard and generate the extended field, the asterisk wildcard with The variable is corresponding.
Preferably, the extended field configuration content is generated in the following way:
B9. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character String;
B10. the extended field configuration content is concatenated into based on the universal character.
Preferably, the extended field configuration content is generated in the following way:
B11. the extended field configuration content is generated based on User Defined script.
The present invention obtains url links, url link configurations content, core field and core by the input based on user Field configuration content, and matched somebody with somebody based on url links, url links configuration, the core field and the core field Put content generation collection rule and based on the collection rule gathered data to appointing system, the clear of data is completed by system Look at, call, merchandising etc., the present invention is simple to operate, practical, and passes through flexible and changeable collection rule, perfect screening Function, realizes the data acquisition of diversification, with high commercial value.
Brief description of the drawings
By reading the detailed description made with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 shows the embodiment of the present invention, a kind of collecting method based on url character strings it is specific Schematic flow sheet;
Fig. 2 shows the first embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 3 shows the second embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 4 shows the third embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 5 shows the fourth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 6 shows the fifth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 7 shows the sixth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 8 shows the seventh embodiment of the present invention, the input based on user generate at least one extended field and Extended field configures the idiographic flow schematic diagram of content;And
Fig. 9 shows the eighth embodiment of the present invention, the input based on user generate at least one extended field and Extended field configures the idiographic flow schematic diagram of content.
Embodiment
In order to preferably make technical scheme clearly show, the present invention is made into one below in conjunction with the accompanying drawings Walk explanation.
Fig. 1 shows the embodiment of the present invention, a kind of collecting method based on url character strings it is specific Schematic flow sheet, the data for meeting user's request are gathered by acquisition system on the internet, it will be appreciated by those skilled in the art that with The sustained and rapid development of internet and information industry, user can obtain the data of magnanimity on the internet, wherein comprising big Measure valuable information, such as government notice content information, national economy data message, Financial Information, social information, consumption letter Breath, military information, entertainment information, news information etc., and the screening and integration to these information are then the demands of each user Place, and these information all have url in internet, the present invention is based on url character strings, by deeply being excavated to url, The information of user's request is got, specifically, is comprised the following steps:
First, into step S101, input of the acquisition system based on user generate at least one url link and extremely Few url links configuration content, it will be appreciated by those skilled in the art that before the step S101, preferably in the collection The subject name of this data acquisition, subject description, collection purposes, theme source, subject categories etc. information are set in system, The information of above-mentioned collection carries out the listing title after being finished as collection, brief introduction the displaying of diversification.
For example, in a preferred embodiment, user needs to gather the column of military news one on certain website, then in theme Filled in title and domestic military news is filled in military news, subject description, collection purposes can fill in privately owned or external, main X websites are filled in topic source, and subject categories can have multiple choices, such as social networks, financial finance and economics, electric business shopping, automobile number According to, work occupation, house property data, health medical treatment, information news, amusement and leisure, game race etc., in the present embodiment, fill out Information news is write, after aforesaid operations have been performed, into step S101.
Further, the url links are the entrance configuration of the required data of user, with reference to above-described embodiment, if User needs to obtain the military news on certain website, then described one entrance configuration of military news correspondence, that is, pass through the entrance The website can be entered from terminal by configuring url, and the url links configuration content is to click on the whole that the military news occurs The link configuration of information, it will be appreciated by those skilled in the art that after user clicks on the military news, occurring in that a large amount of relevant military Headline, it is the body matter for entering certain a piece of news in military news to click on the headline, and in the present invention Url link configuration content is the news links subnet information of whole news in the military news, obtains described It is further described, refuses herein in the embodiment that url is linked and url link configuration contents will be described below Repeat.
Then, into step S102, input of the acquisition system based on user generate at least one core field and At least one core field configuration content, it will be appreciated by those skilled in the art that with reference to the embodiment shown in step S101, the core Heart field can be understood as the title and content of news, in such embodiments, and user enters certain by clicking on sublink When checking content in one military news, there are title and content body, the url Data-Links of news are got by step S101 Connect, further, gather the content in the link, the core field is obtained by core field recognition rule, the core Heart field configuration content is obtained by core field configuration content recognition rule, is made in these embodiments that will be described below It is further described through, will not be described here.
Finally, into step S103, the acquisition system is based on url links, url links configuration, the core Heart field and core field configuration content generation collection rule are simultaneously based on the collection rule gathered data, complete step After rapid S101 and step S102, system acquisition is linked to the url, the url links configuration, the core field and institute Core field configuration content is stated, complete collection rule is formed, further, links what acquisition demand was gathered by the url News links, link the link Data entries that configuration obtains all sublinks in the news links, further by the url Ground, the title and content of news in sublink are obtained by the core field, and based on the core field configuration content Keyword, the word for entering row headers and content are replaced, cleaning filtering, formatting etc. operation, so as to by above-mentioned rule, obtain To the full detail of the news of user's request.
Fig. 2 shows the first embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, it will be appreciated by those skilled in the art that this step Will to how to realize it is quick, easily obtain the required content-data of user on a certain website and be described in detail, specifically, Comprise the following steps:
First, into step S1011, user inputs an original url character string, it will be appreciated by those skilled in the art that user When carrying out demand data collection, it is not limited to gather the data of current page sometimes, when user needs to gather the url numbers of multipage During according to link, then need to carry out data acquisition to all pages, in such embodiments, we input wherein a certain first Original url character strings, the original url character strings correspond to a certain page info, and then by the crucial position in later-mentioned step Replacement is put, the page data link for all needing to gather is obtained.
Then, into step S1012, the variable replaced using asterisk wildcard in original url character strings generates form url characters String, the asterisk wildcard is corresponding with the variable, in a preferred embodiment, if user needs to adopt on a certain recruitment website Collect releasing news for a certain position, in described release news, there is the occupational information of page 30 to supply to check, in such embodiment In, the asterisk wildcard is page, by changing number of pages this variable, realizes the collection for all page infos, it is preferable that The website information interface of automatic increase, range of pages selectionbar and length column etc. are provided with acquisition system configuration Deng, the collection for all page infos, as form url character strings are realized by the setting of asterisk wildcard, it is further, described Variable is not limited solely to number of pages, can also be date etc..
And then, into step S1013, based on the multiple url links of the form url text string generations, so Embodiment in, user can by debug attempt connection be acquired checking for operation, with ensure need gather data believe Whether what is ceased can effectively connect, and by debugging the accuracy that the data message of needs collection is checked in display link, Duo Gesuo It is the data acquisition information required for us to state url links.
To execution of step S1013, i.e., into step S1014, the url links are retrieved in the input based on the user Corresponding source code obtains at least one universal character string, further, with reference to step S1011 to step S1012, is getting After the page info of data acquisition information in need, url link configuration contents are preferably obtained, for example, user passes through point The url links are hit, the audit function carried based on browser obtains corresponding source code, and found in the source code Title, the content of the url links of acquisition etc. information are needed, in such source code, due to the required each url chains of user Connect all in same page info, it, which starts, all has a universal character string, for example:The universal character string is found out,<a class>=" position_link " href=www.XXX.com/123456.httm, i.e., described universal character string is Www.XXX.com/123456.httm, url is replaced with by www.XXX.com/123456.httm below, is<a class >=" position_link " href=" [url] ".
Finally, step S1015 is performed, the url links configuration content, this area skill are concatenated into based on the universal character Art personnel understand, with reference to the embodiment shown in step S1014, concatenate into multiple url links, example based on the universal character Such as,<a class>=" position_link " href=www.XXX.com/123456.httm,<a class>=" Position_link " href=www.XXX.com/74874.httm,<a class>=" position_link " href= Www.XXX.com/12345641.httm,<a class>=" position_link " href=www.XXX.com/ 8414741.httm etc., above-mentioned url links are the required url configurations content of user.
Fig. 3 shows the second embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, is used as the second embodiment of the present invention, sheet Invention gives another mode for realizing url link configuration contents, wherein, step S1011 to step S1013 may be referred to figure The preferred embodiment shown in 2, will not be described here.
And then, into step S1016, the url links configuration content is generated based on the specific character string that user inputs, It will be appreciated by those skilled in the art that full detail of the embodiment shown in Fig. 2 suitable for sublink in a certain network address of collection, and the It is applied to gather the information of specific sublink in two embodiments, for example, with reference to embodiment in Fig. 2, user is by clicking on the url Link, the audit function carried based on browser obtains corresponding source code, and finds what needs were obtained in the source code The title of url links, content etc. information, the url for getting a certain sublink is<a class>=" position_link " Href=www.XXX.com/123456.httm, wherein, the specific character string is 123456, and by supplement prefix with And the mode of suffix, correspondingly url is for generation<aclass>=" position_link " href=www.XXX.com/ 123456.httm。
Fig. 4 shows the third embodiment of the present invention, and input of the acquisition system based on user generates at least one Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, as the third embodiment of the present invention, is given A kind of idiographic flow schematic diagram that the url links configuration content is generated based on User Defined script is gone out.
It will be appreciated by those skilled in the art that the preferred embodiment that step S1011 may be referred to show in Fig. 2 to step S1013, It will not be described here.
And then, into step S1017, the url links configuration content is generated based on User Defined script, so Embodiment in, the information that user can be according to required for the demand of oneself selects oneself in a large amount of sublinks, for example, with Family needs to search for the information all issued on this website on July 15th, 2015, or user needs to search for all comprising numeral 7189 data link, also or based on first embodiment of the invention, user needs to search for the occupational information all about nurse, These can be realized that further, acquisition system is generated according to the custom script according to User Defined script The url links configuration content.
With reference to the first embodiment of the present invention to 3rd embodiment, the url links configuration content can pass through a variety of sides Formula carries out the cleaning of content, filtering screening step, specifically:
In one embodiment, the url links configuration content is two character strings of determination search listing, the search List belongs to the part that the url links corresponding source code, it will be appreciated by those skilled in the art that the present embodiment mainly passes through The mode for reducing hunting zone determines url link configuration contents, specifically, in acquisition system, is preferably provided with list area Domain recognition rule, in source code reduces full text source code means by way of finding paging and is filtered.
In another embodiment, the url links configuration content is a character string of determination identification variable, the knowledge Other variable is used to determine the congener url links, in such embodiments, will be by entering to congener data message The mode of row search carries out matching operation, if the result of matching is not clean accurate enough, you can to enter in other way Row filtering screening, obtains most accurate result, specifically, may be referred to the first embodiment of the present invention to 3rd embodiment, Data acquisition is carried out by the manner.
In another embodiment, url link configuration content is determines a character string of necessary variable, it is described must Want variable be used for determine the url comprising the necessary variable link, the determination includes the url of the necessary variable Link comprising rule to the information of collection i.e. by screening, for example, according to a character string for determining identification variable, obtaining The larger url link configuration contents of scope, there is www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/ 1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/1544878.html, are being adopted 16 are filled in the character string that necessary variable is determined in collecting system, then system is filtered out according to comprising rule Www.XXX.com/jobs/1654164.html and www.XXX.com/jobs/165461.html are used as match information.
In another embodiment, the url links configuration content is described to pick to determine a character string of rejecting variable Except variable is linked for the url for determining not including the rejecting variable, the character string that variable is rejected in the determination is The information of collection is screened by rejecting rule, for example, according to a character string for determining to recognize variable, acquisition scope compared with Big url link configuration contents, there is www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/ 1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/1544878.html, and The www.XXX.com/jobs/item.position.html of redundancy, then determine to reject a word of variable in acquisition system Position is filled in symbol string, then the url links of redundancy can be weeded out, be left www.XXX.com/jobs/ 1216461.html, www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/165461.html, Www.XXX.com/jobs/1544878.html is the matching result required for user.
In another embodiment, the url links configuration content is a character string of determination filtered variable, the mistake Filter variable is used to determine that the url links need the part deleted, in such embodiments, described to determine the one of filtered variable Individual character string is screened by filtering rule to the information of collection, for example, according to a character string for determining identification variable, The larger url link configuration contents of scope are obtained, there is //www.XXX.com/jobs/1216461.html, // Www.XXX.com/jobs/1654164.html, //www.XXX.com/jobs/165461.html, //www.XXX.com/ Jobs/1544878.html, wherein, have in above-mentioned all url // in www foremost, then can be by collection Filled in system in a character string of filtered variable //, and then filter out //, obtaining last configuration content is Www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/ 165461.html, www.XXX.com/jobs/1544878.html.
In another embodiment, the url links configuration content is a character string of determination supplement prefix, the benefit Filling prefix is used to be embedded into the url links foremost, in such embodiments, if our final needs are https:The contents such as //www.XXX.com, with reference to above-described embodiment, then need to fill in https in supplement prefix one column://, The configuration content then finally given is https://www.XXX.com/jobs/1216461.html, https:// Www.XXX.com/jobs/1654164.html, https://www.XXX.com/jobs/165461.html, https:// www.XXX.com/jobs/1544878.html。
In another embodiment, the url links configuration content is a character string of determination supplement suffix, the benefit Filling suffix is used to be embedded into the last of the url links, for example, according to a character string for determining identification variable, obtaining scope Larger url link configuration contents, there is www.XXX.com/jobs/1216461, www.XXX.com/jobs/1654164, Www.XXX.com/jobs/165461 and www.XXX.com/jobs/1544878, now, lacks .html, then in suffix .html is filled in supplement suffix one column, final configuration content is obtained for www.XXX.com/jobs/1216461.html, Www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/ 1544878.html。
It will be appreciated by those skilled in the art that the system acquisition can also be gathered by data inverted order, page cookie checkings Etc. function come sophisticated systems collection, will not be described here.
Fig. 5 shows the fourth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content, as the fourth embodiment of the present invention, A kind of idiographic flow for generating core field and core field configuration content is given, corresponding to step S102, including it is as follows Step:
First, into step S1021, the input based on user retrieves the url and links corresponding source code acquisition one Core character string, the core character string has uniqueness, those skilled in the art in the url links corresponding source code Understand, the step S102 be mainly used in obtain core field in title and content information, and it is aftermentioned in be related to Extended field the information content, specifically, the core field is mainly used in the inquiry of specific words and expressions, such as in gathered data When, we preferably independently come out the title and content in core field, and in other examples, the core Heart field can be with standing time, cycle etc. information, and this does not affect the embodiment of the present invention, not superfluous herein State.
For example, in a preferred embodiment, it would be desirable to gather the job information in a certain recruitment website, further , after website is entered, there are " senior PHP Developmental Engineer " column, including job description in ground in sublink, wherein, it is described " senior PHP Developmental Engineer " is that the content in title content, job description is body matter information, further, is clicked on described " senior PHP Developmental Engineer " carries function by browser and checks source code in title, finds out comprising " senior PHP develops work Cheng Shi " source code information, for example:<H1_title " senior PHP Developmental Engineer '>, wherein, inputted in recognition rule< H1_title " senior PHP Developmental Engineer ">, it is described<H1_title " senior PHP Developmental Engineer ">As described core words Symbol string, the core character string has uniqueness in whole source code.
Then, into step S1022, the variable replaced using asterisk wildcard in the core character string generates the core words Section, the asterisk wildcard is corresponding with the variable, with reference to step S1021, is drawing<H1_title " senior PHP Development Engineerings Teacher ">, will after as described core character string<H1_title " senior PHP Developmental Engineer ">It is input in recognition rule, and makes The senior PHP Developmental Engineer is replaced with subject, the subject is asterisk wildcard, is drawn<h1_title “subject”>, that is, generate core field.
Further, in gathered data, if the core field of generation needs filtering, it can be filtered by data rule The word that must not be included in the word and title that must be included in principle, the replacement of data header word, title etc. content pair Data are filtered, so as to obtain the data that user finally goes for.
And then, into step S1023, the input based on the user is retrieved the corresponding source code of the url links and obtained At least one universal character string is taken, it will be appreciated by those skilled in the art that the step S1023 to step S1024 is mainly for user The core field configuration content of demand, in such embodiments, with reference to step S1021 to step S1022, in job description Content is body matter information, clicks on optional position in job description and obtains corresponding source code, for example, user is by searching Obtain<Dd class=" job_bt ">XXX contents</dd>Content, be filled up to data content identification rule in, further, It is described<Dd class=" job_bt "></dd>As universal character string.
Finally, into step S1024, the core field configuration content is concatenated into based on the universal character, further Ground, with reference to step S1023, gets<Dd class=" job_bt ">XXX contents</dd>Content, replaced using message The XXX contents, be<Dd class=" job_bt ">message</dd>, that is, generate the core field configuration content.
Fig. 6 shows the fifth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content, it will be appreciated by those skilled in the art that conduct The fifth embodiment of the present invention, step S1021 and step S1022 may be referred to the fourth embodiment shown in Fig. 5.
Further, into step S1025, generated based on the specific character string that user inputs in the core field configuration Hold.It will be appreciated by those skilled in the art that the embodiment shown in Fig. 5 is applied to gather the full content letter of sublink in a certain network address Breath, and it is applied to gather the information of specific sublink in the 5th embodiment, for example, with reference to embodiment in Fig. 4, user is by clicking on The url links, the audit function carried based on browser obtains corresponding source code, and searching is needed in the source code The title of the url to be obtained links, content etc. information, the title for getting a certain sublink are linked as<title> [subject]<title>, wherein, that is, search out correspondingly heading message:Game recruitment-the XX of PHP Developmental Engineer recruitment -4399 Net, further, by data filtering, obtains key message:PHP Developmental Engineer, is the title required for us.
Fig. 7 shows the sixth embodiment of the present invention, and input of the acquisition system based on user generates at least one The idiographic flow schematic diagram of core field and at least one core field configuration content, it will be appreciated by those skilled in the art that conduct The sixth embodiment of the present invention, step S1021 and step S1022 may be referred to the embodiment shown in Fig. 5 or Fig. 6.
Finally, into step S1026, the core field configuration content is generated based on User Defined script.So Embodiment in, user need not find web page source code, be scanned for by custom script in acquisition system, obtain user The text information needed, the text information corresponds to core field configuration content, and user can be according to the demand of oneself big The information required for oneself is selected in the content of quantum link, for example, user needs search rainy day April in this website The information of upper issue, or user need to search for the data link for all including recruitment, and these can be according to User Defined Script is realized that further, acquisition system generates the core field configuration content according to the custom script.
Have various ways in the preferred embodiment shown with reference to Fig. 5 into Fig. 7, the core field configuration, for example, In one embodiment, the core field configuration content is determines a character string of necessary variable, and the necessary variable is used for It is determined that the core field comprising the necessary variable, the core field of the determination comprising the necessary variable is to lead to Cross and the information of collection is screened comprising rule, for example, being according to acquisition system acquisition core field configuration content, and pass through A certain text information is inputted in comprising rule, the text information must be included in core field configuration content, if not wrapping Containing the text information, rejected.
In another embodiment, the core field configuration content is described to determine a character string of replacement variable Replacing variable is used to replace the part core field, in such embodiments, with reference to above-described embodiment, such as according to system Collection obtains the information about recruitment website, and the text information of " recruitment website " is largely there are in core field configuration content, leads to Cross to input in the column of variable one is replaced and replace with " recruitment website " " hunting cloud net ", you can realize to owning " recruitment website " text information Replacement.
In another embodiment, the core field configuration content is described to determine a character string of rejecting variable Rejecting variable is used to determine that the core field needs the part deleted, and with reference to above-described embodiment, for example, is obtained according to system acquisition The information about recruitment website is taken, the text information of " recruitment website " is largely there are in core field configuration content, by picking Except the text information that " recruitment website " is inputted in the column of variable one, you can realize to owning the rejecting of " recruitment website " text information.
In another embodiment, the core field configuration content is described to determine a character string of filtered variable Filtered variable is used to determine the core field not comprising the filtered variable.In such embodiments, before can combining Embodiment in stating:The core field configuration content is determines a character string of necessary variable, and the necessary variable is used for It is determined that the core field comprising the necessary variable, in the present embodiment, by the core not comprising the filtered variable Heart field rejects the core field comprising the filtered variable as final core field.
Fig. 8 shows the seventh embodiment of the present invention, the input based on user generate at least one extended field and Extended field configures the idiographic flow schematic diagram of content, it will be appreciated by those skilled in the art that in step s 102, the acquisition system Input based on user generates at least one core field and at least one core field configuration content, is configuring core words After the configuration content of section and core field, the configuration content preferably to extended field and extended field is acquired, In such embodiment, in step s 103, the url links, url links configuration, the core field, the core Field configuration content, the extended field and extended field configuration content generation collection rule are simultaneously advised based on the collection Then gathered data.
First, into step S10271, the input based on user retrieves the url and links corresponding source code acquisition one Escape character (ESC) string, the escape character (ESC) string has uniqueness, those skilled in the art in the url links corresponding source code Understand, url links, url links configuration, the core field, the core must be included in each data acquisition Heart field configuration content, and extended field can choose at most 48 extension words as the abundant unnecessary condition of data acquisition The configuration content of section and extended field, and in other examples, more extension words can also be chosen according to system upgrade Section, this does not affect technical scheme, for example, in recruitment information, wages scope, job site, working experience are wanted Ask, educational background limitation etc. will be shown as extended field, further, user clicks on the url links of the recruitment information Corresponding source code obtains an escape character (ESC) string, and the escape character (ESC) string preferably chooses wages scope, further, obtains Source code corresponding to the wages scope, the source code has uniqueness.
Then, into step S10272, the variable replaced using asterisk wildcard in the core character string generates the extension Field, the asterisk wildcard is corresponding with the variable, in such embodiments, according to step S10271, obtains the extension Character string:<dd class><“job request”><span class red>1.8k-2.5k</span>, wherein,<dd class><“job request”><span class red>It is as unique, and 1.8k-2.5k is wages scope, will 1.8k-2.5k replacing with extfield1, it is<dd class><“job request”><span class red> extfield1</span>, collection result is 1.8k-2.5k.
Subsequently, into step S10273, the input based on the user is retrieved the corresponding source code of the url links and obtained At least one universal character string is taken, step S10274 is finally entered, the extended field is concatenated into based on the universal character and matched somebody with somebody Content is put, extended field configuration content is obtained and may be referred to previous embodiment, universal character string is obtained based on the source code, and The extended field configuration content is obtained based on the universal character string.
Fig. 9 shows the eighth embodiment of the present invention, the input based on user generate at least one extended field and Extended field configures the idiographic flow schematic diagram of content.It will be appreciated by those skilled in the art that the step S10271 and step S10272 may be referred to the embodiment shown in Fig. 8.
Finally, into step S10275, the extended field configuration content is generated based on User Defined script.So Embodiment in, user need not find web page source code, be scanned for by custom script in acquisition system, obtain user The text information needed, the text information corresponds to extended field and configures content, and user can be according to the demand of oneself big The information required for oneself is selected in the content of quantum link, acquisition system generates the extension according to the custom script Field configuration content.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring the substantive content of the present invention.

Claims (15)

1. a kind of collecting method based on url character strings, is gathered and meets user's request on the internet by acquisition system Data, it is characterised in that including:
A. input of the acquisition system based on user is generated at least one url links and at least one url link configurations Hold;
B. input of the acquisition system based on user generates at least one core field and at least one core field configuration Content;
C. the acquisition system is based on url links, url links configuration, the core field and the core words Section configuration content generation collection rule is simultaneously based on the collection rule gathered data.
2. collecting method according to claim 1, it is characterised in that generated as follows in the step a Multiple url links:
A1. user inputs an original url character string;
A2. replace variable in original url character strings using asterisk wildcard and generate form url character strings, the asterisk wildcard with it is described Variable is corresponding;
A3. based on the multiple url links of the form url text string generations.
3. collecting method according to claim 1, it is characterised in that the url link configurations in the step a Content is generated in the following way:
A4. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
A5. the url links configuration content is concatenated into based on the universal character.
4. collecting method according to claim 1, it is characterised in that the url link configurations in the step a Content is generated in the following way:
A6. the url links configuration content is generated based on the specific character string that user inputs.
5. collecting method according to claim 1, it is characterised in that the url link configurations in the step a Content is generated in the following way:
A7. the url links configuration content is generated based on User Defined script.
6. the collecting method according to claim 3 or 4 or 5, it is characterised in that the url links configure content and are As follows any appoints a variety of:
- url links configuration the content is two character strings of determination search listing, and the search listing belongs to the url Link a part for corresponding source code;
- url links configuration the content recognizes a character string of variable for determination, and the identification variable is used to determine same species The url link;
- url links configuration the content is a character string of the necessary variable of determination, and the necessary variable is used to determine comprising institute State the url links of necessary variable;
- url links configuration the content rejects a character string of variable for determination, and the rejecting variable does not include for determination The url links for rejecting variable;
- url links configuration the content is a character string of determination filtered variable, and the filtered variable is described for determining Url links need the part deleted;
- url links configuration content is determines a character string of supplement prefix, and the supplement prefix is used to being embedded into described Url is linked foremost;
- url links configuration content is determines a character string of supplement suffix, and the supplement suffix is used to being embedded into described It is last that url is linked.
7. collecting method according to claim 1, it is characterised in that the core field in the step b is led to Cross following manner generation:
B1. the input based on user retrieves the url and links one core character string of corresponding source code acquisition, the core words Symbol string has uniqueness in the url links corresponding source code;
B2. replace variable in the core character string using asterisk wildcard and generate the core field, the asterisk wildcard with it is described Variable is corresponding.
8. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody Content is put to generate in the following way:
B3. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
B4. the core field configuration content is concatenated into based on the universal character.
9. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody Content is put to generate in the following way:
B5. the core field configuration content is generated based on the specific character string that user inputs.
10. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody Content is put to generate in the following way:
B6. the core field configuration content is generated based on User Defined script.
11. the collecting method according to claim 8 or 9 or 10, it is characterised in that the core field configuration content To be following any or appoint a variety of:
- core field configuration the content is a character string for determining necessary variable, and the necessary variable is included for determination The core field of the necessary variable;
- core field configuration the content replaces a character string of variable for determination, and the replacement variable is used to replace part The core field;
- core field configuration the content rejects a character string of variable for determination, and the rejecting variable is described for determining Core field needs the part deleted;
- core field configuration the content is a character string for determining filtered variable, and the filtered variable is used to determine not wrap The core field containing the filtered variable.
12. the collecting method according to any one of claim 1 to 11, it is characterised in that go back base in the step b At least one extended field and extended field configuration content are generated in the input of user,
Url described in the step c link, the url link configuration, the core field, the core field configuration content, The extended field and extended field configuration content generation collection rule are simultaneously based on the collection rule gathered data.
13. collecting method according to claim 12, it is characterised in that the extended field is given birth in the following way Into:
B7. the input based on user retrieves the url and links one escape character (ESC) string of corresponding source code acquisition, the extension word Symbol string has uniqueness in the url links corresponding source code;
B8. replace variable in the core character string using asterisk wildcard and generate the extended field, the asterisk wildcard with it is described Variable is corresponding.
14. collecting method according to claim 12, it is characterised in that the extended field configuration content passes through such as Under type is generated:
B9. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
B10. the extended field configuration content is concatenated into based on the universal character.
15. collecting method according to claim 12, it is characterised in that the extended field configuration content passes through such as Under type is generated:
B11. the extended field configuration content is generated based on User Defined script.
CN201710440457.8A 2017-06-12 2017-06-12 A kind of collecting method based on url character strings Pending CN107330004A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710440457.8A CN107330004A (en) 2017-06-12 2017-06-12 A kind of collecting method based on url character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710440457.8A CN107330004A (en) 2017-06-12 2017-06-12 A kind of collecting method based on url character strings

Publications (1)

Publication Number Publication Date
CN107330004A true CN107330004A (en) 2017-11-07

Family

ID=60195617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710440457.8A Pending CN107330004A (en) 2017-06-12 2017-06-12 A kind of collecting method based on url character strings

Country Status (1)

Country Link
CN (1) CN107330004A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558418A (en) * 2018-12-03 2019-04-02 上海熙菱信息技术有限公司 A kind of method of automatic identification information
CN110019486A (en) * 2018-07-19 2019-07-16 平安科技(深圳)有限公司 Collecting method, device, equipment and storage medium
WO2021088350A1 (en) * 2019-11-07 2021-05-14 南京莱斯网信技术研究院有限公司 Script-based web service paging data acquisition system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019486A (en) * 2018-07-19 2019-07-16 平安科技(深圳)有限公司 Collecting method, device, equipment and storage medium
CN110019486B (en) * 2018-07-19 2023-04-11 平安科技(深圳)有限公司 Data acquisition method, device, equipment and storage medium
CN109558418A (en) * 2018-12-03 2019-04-02 上海熙菱信息技术有限公司 A kind of method of automatic identification information
CN109558418B (en) * 2018-12-03 2023-04-07 上海熙菱信息技术有限公司 Method for automatically identifying information
WO2021088350A1 (en) * 2019-11-07 2021-05-14 南京莱斯网信技术研究院有限公司 Script-based web service paging data acquisition system

Similar Documents

Publication Publication Date Title
Singrodia et al. A review on web scrapping and its applications
CN102597993B (en) Managing application state information by means of uniform resource identifier (URI)
CN102855313B (en) The method that web page browsing equipment, the generation method of web-page summarization and webpage are opened
US8825706B1 (en) System for and method of processing business personnel information
CN107330004A (en) A kind of collecting method based on url character strings
CN104133820B (en) Content recommendation method and content recommendation device
CN102981746A (en) Handheld electronic device and method for calibrating input of webpage address
CN103116635B (en) Field-oriented method and system for collecting invisible web resources
CN107341399A (en) Assess the method and device of code file security
CN108574669A (en) User behavior tree constructing method and device
CN105631007A (en) Industry technical information collecting method and system
CN109684616A (en) Dynamic statement formula assembles the method and system made a report on
CN107943893A (en) A kind of search processing method and device based on internet
CN105095175A (en) Method and device for obtaining truncated web title
CN105117434A (en) Webpage classification method and webpage classification system
CN106547749A (en) The method and apparatus of collecting webpage data
CN106649557A (en) Semantic association mining method for defect report and mail list
CN104268282A (en) Web banner advertisement displaying method and system
CN105808623B (en) A kind of page access event correlation methodology and device based on search
CN112417165A (en) Method and system for constructing and inquiring lifetime planning knowledge graph
CN106445950A (en) Personalized distributed data mining system
CN106227661A (en) Data processing method and device
CN106021304A (en) Webpage address correcting method and system
CN106951540B (en) Generation method, device, server and the computer-readable storage medium of file directory
CN106611022A (en) Method and device for increasing website search efficiency

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20171107