CN107330004A - A kind of collecting method based on url character strings - Google Patents
A kind of collecting method based on url character strings Download PDFInfo
- Publication number
- CN107330004A CN107330004A CN201710440457.8A CN201710440457A CN107330004A CN 107330004 A CN107330004 A CN 107330004A CN 201710440457 A CN201710440457 A CN 201710440457A CN 107330004 A CN107330004 A CN 107330004A
- Authority
- CN
- China
- Prior art keywords
- url
- content
- variable
- core
- configuration
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
A kind of collecting method based on url character strings, the data for meeting user's request are gathered by acquisition system on the internet, including:A. input of the acquisition system based on user generates at least one url links and at least one url link configuration contents;B. input of the acquisition system based on user generates at least one core field and at least one core field configuration content;C. acquisition system is based on url links, url links configuration, core field and core field configuration content generation collection rule and based on collection rule gathered data.The present invention obtains url links, url link configurations content, core field and core field configuration content by user's input, based on url links, url links configuration, core field and core field configuration content generation collection rule and based on collection rule gathered data to appointing system, the present invention is simple to operate, pass through flexible and changeable collection rule, perfect screening function, the data acquisition of diversification is realized, with high commercial value.
Description
Technical field
The invention belongs to data acquisition technology field, particularly a kind of collecting method based on url character strings.
Background technology
With the sustained and rapid development of internet and information industry, user can obtain the data of magnanimity on the internet,
Wherein comprising a large amount of valuable information, such as government notice content information, national economy data message, Financial Information, social activity
Information, consumption information, military information, entertainment information, news information etc., and the screening and integration to these information are then each
Where the demand of user.
The excavation for internet public data is runed by specialized company at present, if domestic consumer needs
Excavate that to meet the public datas of specified conditions be typically that the specialized company of commission provides corresponding service.
How a kind of increasing income is provided to domestic consumer, the collecting method of facilitation is that current needs are solved
Technical problem, and do not have a kind of collecting method based on url character strings at present.
The content of the invention
The technological deficiency existed for prior art, is based on url character strings there is provided one kind according to an aspect of the present invention
Collecting method, gather the data for meeting user's request on the internet by acquisition system, including:
A. input of the acquisition system based on user generates at least one url links and at least one url links are matched somebody with somebody
Put content;
B. input of the acquisition system based on user generates at least one core field and at least one core field
Configure content;
C. the acquisition system is based on url links, url links configuration, the core field and the core
Heart field configuration content generates collection rule and based on the collection rule gathered data.
Preferably, multiple url links are generated in the step a as follows:
A1. user inputs an original url character string;
A2. replace variable in original url character strings using asterisk wildcard and generate form url character strings, the asterisk wildcard with
The variable is corresponding;
A3. based on the multiple url links of the form url text string generations.
Url link configuration contents in collecting method according to claim 1, the step a pass through
Following manner is generated:
A4. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character
String;
A5. the url links configuration content is concatenated into based on the universal character.
Preferably, the url link configuration contents in the step a are generated in the following way:
A6. the url links configuration content is generated based on the specific character string that user inputs.
Preferably, the url link configuration contents in the step a are generated in the following way:
A7. the url links configuration content is generated based on User Defined script.
Preferably, the url links configuration content is as follows any or appoints a variety of:
- url links configuration the content is two character strings of determination search listing, and the search listing belongs to described
Url links a part for corresponding source code;
- the url links a character string for configuring content for determination identification variable, and the identification variable is for determining together
The url links of species;
- url links configuration the content is a character string of the necessary variable of determination, and the necessary variable is used to determine bag
Url links containing the necessary variable;
- url links configuration the content rejects a character string of variable for determination, and the rejecting variable is used to determine not
Include the url links of the rejecting variable;
- url links configuration the content is a character string of determination filtered variable, and the filtered variable is used to determine institute
State the part that url links need to delete;
- url links configuration the content supplements a character string of prefix for determination, and the supplement prefix is used to be embedded into
The url links are foremost;
- url links configuration the content supplements a character string of suffix for determination, and the supplement suffix is used to be embedded into
It is last that the url is linked.
Preferably, the core field in the step b is generated in the following way:
B1. the input based on user retrieves the url and links one core character string of corresponding source code acquisition, the core
Heart character string has uniqueness in the url links corresponding source code;
B2. replace variable in the core character string using asterisk wildcard and generate the core field, the asterisk wildcard with
The variable is corresponding.
Preferably, the core field configuration content in the step b is generated in the following way:
B3. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character
String;
B4. the core field configuration content is concatenated into based on the universal character.
The core field configuration content in collecting method according to claim 1, the step b passes through
Following manner is generated:
B5. the core field configuration content is generated based on the specific character string that user inputs.
Preferably, the core field configuration content in the step b is generated in the following way:
B6. the core field configuration content is generated based on User Defined script.
Preferably, the core field configuration content is as follows any or appoints a variety of:
- core field configuration the content is a character string for determining necessary variable, and the necessary variable is used to determine
The core field comprising the necessary variable;
- core field configuration the content replaces a character string of variable for determination, and the replacement variable is used to replace
The part core field;
- core field configuration the content rejects a character string of variable for determination, and the rejecting variable is used to determine
The core field needs the part deleted;
- core field configuration the content is a character string for determining filtered variable, and the filtered variable is used to determine
The core field not comprising the filtered variable.
Preferably, also the input based on user generates at least one extended field and extended field is matched somebody with somebody in the step b
Put content,
The links of url described in the step c, url links configuration, the core field, the core field configuration
Content, the extended field and extended field configuration content generation collection rule simultaneously gather number based on the collection rule
According to.
Preferably, the extended field is generated in the following way:
B7. the input based on user retrieves the url and links one escape character (ESC) string of corresponding source code acquisition, the expansion
Open up character string has uniqueness in the url links corresponding source code;
B8. replace variable in the core character string using asterisk wildcard and generate the extended field, the asterisk wildcard with
The variable is corresponding.
Preferably, the extended field configuration content is generated in the following way:
B9. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character
String;
B10. the extended field configuration content is concatenated into based on the universal character.
Preferably, the extended field configuration content is generated in the following way:
B11. the extended field configuration content is generated based on User Defined script.
The present invention obtains url links, url link configurations content, core field and core by the input based on user
Field configuration content, and matched somebody with somebody based on url links, url links configuration, the core field and the core field
Put content generation collection rule and based on the collection rule gathered data to appointing system, the clear of data is completed by system
Look at, call, merchandising etc., the present invention is simple to operate, practical, and passes through flexible and changeable collection rule, perfect screening
Function, realizes the data acquisition of diversification, with high commercial value.
Brief description of the drawings
By reading the detailed description made with reference to the following drawings to non-limiting example, further feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 shows the embodiment of the present invention, a kind of collecting method based on url character strings it is specific
Schematic flow sheet;
Fig. 2 shows the first embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 3 shows the second embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 4 shows the third embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content;
Fig. 5 shows the fourth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 6 shows the fifth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 7 shows the sixth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content;
Fig. 8 shows the seventh embodiment of the present invention, the input based on user generate at least one extended field and
Extended field configures the idiographic flow schematic diagram of content;And
Fig. 9 shows the eighth embodiment of the present invention, the input based on user generate at least one extended field and
Extended field configures the idiographic flow schematic diagram of content.
Embodiment
In order to preferably make technical scheme clearly show, the present invention is made into one below in conjunction with the accompanying drawings
Walk explanation.
Fig. 1 shows the embodiment of the present invention, a kind of collecting method based on url character strings it is specific
Schematic flow sheet, the data for meeting user's request are gathered by acquisition system on the internet, it will be appreciated by those skilled in the art that with
The sustained and rapid development of internet and information industry, user can obtain the data of magnanimity on the internet, wherein comprising big
Measure valuable information, such as government notice content information, national economy data message, Financial Information, social information, consumption letter
Breath, military information, entertainment information, news information etc., and the screening and integration to these information are then the demands of each user
Place, and these information all have url in internet, the present invention is based on url character strings, by deeply being excavated to url,
The information of user's request is got, specifically, is comprised the following steps:
First, into step S101, input of the acquisition system based on user generate at least one url link and extremely
Few url links configuration content, it will be appreciated by those skilled in the art that before the step S101, preferably in the collection
The subject name of this data acquisition, subject description, collection purposes, theme source, subject categories etc. information are set in system,
The information of above-mentioned collection carries out the listing title after being finished as collection, brief introduction the displaying of diversification.
For example, in a preferred embodiment, user needs to gather the column of military news one on certain website, then in theme
Filled in title and domestic military news is filled in military news, subject description, collection purposes can fill in privately owned or external, main
X websites are filled in topic source, and subject categories can have multiple choices, such as social networks, financial finance and economics, electric business shopping, automobile number
According to, work occupation, house property data, health medical treatment, information news, amusement and leisure, game race etc., in the present embodiment, fill out
Information news is write, after aforesaid operations have been performed, into step S101.
Further, the url links are the entrance configuration of the required data of user, with reference to above-described embodiment, if
User needs to obtain the military news on certain website, then described one entrance configuration of military news correspondence, that is, pass through the entrance
The website can be entered from terminal by configuring url, and the url links configuration content is to click on the whole that the military news occurs
The link configuration of information, it will be appreciated by those skilled in the art that after user clicks on the military news, occurring in that a large amount of relevant military
Headline, it is the body matter for entering certain a piece of news in military news to click on the headline, and in the present invention
Url link configuration content is the news links subnet information of whole news in the military news, obtains described
It is further described, refuses herein in the embodiment that url is linked and url link configuration contents will be described below
Repeat.
Then, into step S102, input of the acquisition system based on user generate at least one core field and
At least one core field configuration content, it will be appreciated by those skilled in the art that with reference to the embodiment shown in step S101, the core
Heart field can be understood as the title and content of news, in such embodiments, and user enters certain by clicking on sublink
When checking content in one military news, there are title and content body, the url Data-Links of news are got by step S101
Connect, further, gather the content in the link, the core field is obtained by core field recognition rule, the core
Heart field configuration content is obtained by core field configuration content recognition rule, is made in these embodiments that will be described below
It is further described through, will not be described here.
Finally, into step S103, the acquisition system is based on url links, url links configuration, the core
Heart field and core field configuration content generation collection rule are simultaneously based on the collection rule gathered data, complete step
After rapid S101 and step S102, system acquisition is linked to the url, the url links configuration, the core field and institute
Core field configuration content is stated, complete collection rule is formed, further, links what acquisition demand was gathered by the url
News links, link the link Data entries that configuration obtains all sublinks in the news links, further by the url
Ground, the title and content of news in sublink are obtained by the core field, and based on the core field configuration content
Keyword, the word for entering row headers and content are replaced, cleaning filtering, formatting etc. operation, so as to by above-mentioned rule, obtain
To the full detail of the news of user's request.
Fig. 2 shows the first embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, it will be appreciated by those skilled in the art that this step
Will to how to realize it is quick, easily obtain the required content-data of user on a certain website and be described in detail, specifically,
Comprise the following steps:
First, into step S1011, user inputs an original url character string, it will be appreciated by those skilled in the art that user
When carrying out demand data collection, it is not limited to gather the data of current page sometimes, when user needs to gather the url numbers of multipage
During according to link, then need to carry out data acquisition to all pages, in such embodiments, we input wherein a certain first
Original url character strings, the original url character strings correspond to a certain page info, and then by the crucial position in later-mentioned step
Replacement is put, the page data link for all needing to gather is obtained.
Then, into step S1012, the variable replaced using asterisk wildcard in original url character strings generates form url characters
String, the asterisk wildcard is corresponding with the variable, in a preferred embodiment, if user needs to adopt on a certain recruitment website
Collect releasing news for a certain position, in described release news, there is the occupational information of page 30 to supply to check, in such embodiment
In, the asterisk wildcard is page, by changing number of pages this variable, realizes the collection for all page infos, it is preferable that
The website information interface of automatic increase, range of pages selectionbar and length column etc. are provided with acquisition system configuration
Deng, the collection for all page infos, as form url character strings are realized by the setting of asterisk wildcard, it is further, described
Variable is not limited solely to number of pages, can also be date etc..
And then, into step S1013, based on the multiple url links of the form url text string generations, so
Embodiment in, user can by debug attempt connection be acquired checking for operation, with ensure need gather data believe
Whether what is ceased can effectively connect, and by debugging the accuracy that the data message of needs collection is checked in display link, Duo Gesuo
It is the data acquisition information required for us to state url links.
To execution of step S1013, i.e., into step S1014, the url links are retrieved in the input based on the user
Corresponding source code obtains at least one universal character string, further, with reference to step S1011 to step S1012, is getting
After the page info of data acquisition information in need, url link configuration contents are preferably obtained, for example, user passes through point
The url links are hit, the audit function carried based on browser obtains corresponding source code, and found in the source code
Title, the content of the url links of acquisition etc. information are needed, in such source code, due to the required each url chains of user
Connect all in same page info, it, which starts, all has a universal character string, for example:The universal character string is found out,<a
class>=" position_link " href=www.XXX.com/123456.httm, i.e., described universal character string is
Www.XXX.com/123456.httm, url is replaced with by www.XXX.com/123456.httm below, is<a class
>=" position_link " href=" [url] ".
Finally, step S1015 is performed, the url links configuration content, this area skill are concatenated into based on the universal character
Art personnel understand, with reference to the embodiment shown in step S1014, concatenate into multiple url links, example based on the universal character
Such as,<a class>=" position_link " href=www.XXX.com/123456.httm,<a class>="
Position_link " href=www.XXX.com/74874.httm,<a class>=" position_link " href=
Www.XXX.com/12345641.httm,<a class>=" position_link " href=www.XXX.com/
8414741.httm etc., above-mentioned url links are the required url configurations content of user.
Fig. 3 shows the second embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, is used as the second embodiment of the present invention, sheet
Invention gives another mode for realizing url link configuration contents, wherein, step S1011 to step S1013 may be referred to figure
The preferred embodiment shown in 2, will not be described here.
And then, into step S1016, the url links configuration content is generated based on the specific character string that user inputs,
It will be appreciated by those skilled in the art that full detail of the embodiment shown in Fig. 2 suitable for sublink in a certain network address of collection, and the
It is applied to gather the information of specific sublink in two embodiments, for example, with reference to embodiment in Fig. 2, user is by clicking on the url
Link, the audit function carried based on browser obtains corresponding source code, and finds what needs were obtained in the source code
The title of url links, content etc. information, the url for getting a certain sublink is<a class>=" position_link "
Href=www.XXX.com/123456.httm, wherein, the specific character string is 123456, and by supplement prefix with
And the mode of suffix, correspondingly url is for generation<aclass>=" position_link " href=www.XXX.com/
123456.httm。
Fig. 4 shows the third embodiment of the present invention, and input of the acquisition system based on user generates at least one
Url is linked and at least one url links the idiographic flow schematic diagram of configuration content, as the third embodiment of the present invention, is given
A kind of idiographic flow schematic diagram that the url links configuration content is generated based on User Defined script is gone out.
It will be appreciated by those skilled in the art that the preferred embodiment that step S1011 may be referred to show in Fig. 2 to step S1013,
It will not be described here.
And then, into step S1017, the url links configuration content is generated based on User Defined script, so
Embodiment in, the information that user can be according to required for the demand of oneself selects oneself in a large amount of sublinks, for example, with
Family needs to search for the information all issued on this website on July 15th, 2015, or user needs to search for all comprising numeral
7189 data link, also or based on first embodiment of the invention, user needs to search for the occupational information all about nurse,
These can be realized that further, acquisition system is generated according to the custom script according to User Defined script
The url links configuration content.
With reference to the first embodiment of the present invention to 3rd embodiment, the url links configuration content can pass through a variety of sides
Formula carries out the cleaning of content, filtering screening step, specifically:
In one embodiment, the url links configuration content is two character strings of determination search listing, the search
List belongs to the part that the url links corresponding source code, it will be appreciated by those skilled in the art that the present embodiment mainly passes through
The mode for reducing hunting zone determines url link configuration contents, specifically, in acquisition system, is preferably provided with list area
Domain recognition rule, in source code reduces full text source code means by way of finding paging and is filtered.
In another embodiment, the url links configuration content is a character string of determination identification variable, the knowledge
Other variable is used to determine the congener url links, in such embodiments, will be by entering to congener data message
The mode of row search carries out matching operation, if the result of matching is not clean accurate enough, you can to enter in other way
Row filtering screening, obtains most accurate result, specifically, may be referred to the first embodiment of the present invention to 3rd embodiment,
Data acquisition is carried out by the manner.
In another embodiment, url link configuration content is determines a character string of necessary variable, it is described must
Want variable be used for determine the url comprising the necessary variable link, the determination includes the url of the necessary variable
Link comprising rule to the information of collection i.e. by screening, for example, according to a character string for determining identification variable, obtaining
The larger url link configuration contents of scope, there is www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/
1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/1544878.html, are being adopted
16 are filled in the character string that necessary variable is determined in collecting system, then system is filtered out according to comprising rule
Www.XXX.com/jobs/1654164.html and www.XXX.com/jobs/165461.html are used as match information.
In another embodiment, the url links configuration content is described to pick to determine a character string of rejecting variable
Except variable is linked for the url for determining not including the rejecting variable, the character string that variable is rejected in the determination is
The information of collection is screened by rejecting rule, for example, according to a character string for determining to recognize variable, acquisition scope compared with
Big url link configuration contents, there is www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/
1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/1544878.html, and
The www.XXX.com/jobs/item.position.html of redundancy, then determine to reject a word of variable in acquisition system
Position is filled in symbol string, then the url links of redundancy can be weeded out, be left www.XXX.com/jobs/
1216461.html, www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/165461.html,
Www.XXX.com/jobs/1544878.html is the matching result required for user.
In another embodiment, the url links configuration content is a character string of determination filtered variable, the mistake
Filter variable is used to determine that the url links need the part deleted, in such embodiments, described to determine the one of filtered variable
Individual character string is screened by filtering rule to the information of collection, for example, according to a character string for determining identification variable,
The larger url link configuration contents of scope are obtained, there is //www.XXX.com/jobs/1216461.html, //
Www.XXX.com/jobs/1654164.html, //www.XXX.com/jobs/165461.html, //www.XXX.com/
Jobs/1544878.html, wherein, have in above-mentioned all url // in www foremost, then can be by collection
Filled in system in a character string of filtered variable //, and then filter out //, obtaining last configuration content is
Www.XXX.com/jobs/1216461.html, www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/
165461.html, www.XXX.com/jobs/1544878.html.
In another embodiment, the url links configuration content is a character string of determination supplement prefix, the benefit
Filling prefix is used to be embedded into the url links foremost, in such embodiments, if our final needs are
https:The contents such as //www.XXX.com, with reference to above-described embodiment, then need to fill in https in supplement prefix one column://,
The configuration content then finally given is https://www.XXX.com/jobs/1216461.html, https://
Www.XXX.com/jobs/1654164.html, https://www.XXX.com/jobs/165461.html, https://
www.XXX.com/jobs/1544878.html。
In another embodiment, the url links configuration content is a character string of determination supplement suffix, the benefit
Filling suffix is used to be embedded into the last of the url links, for example, according to a character string for determining identification variable, obtaining scope
Larger url link configuration contents, there is www.XXX.com/jobs/1216461, www.XXX.com/jobs/1654164,
Www.XXX.com/jobs/165461 and www.XXX.com/jobs/1544878, now, lacks .html, then in suffix
.html is filled in supplement suffix one column, final configuration content is obtained for www.XXX.com/jobs/1216461.html,
Www.XXX.com/jobs/1654164.html, www.XXX.com/jobs/165461.html, www.XXX.com/jobs/
1544878.html。
It will be appreciated by those skilled in the art that the system acquisition can also be gathered by data inverted order, page cookie checkings
Etc. function come sophisticated systems collection, will not be described here.
Fig. 5 shows the fourth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content, as the fourth embodiment of the present invention,
A kind of idiographic flow for generating core field and core field configuration content is given, corresponding to step S102, including it is as follows
Step:
First, into step S1021, the input based on user retrieves the url and links corresponding source code acquisition one
Core character string, the core character string has uniqueness, those skilled in the art in the url links corresponding source code
Understand, the step S102 be mainly used in obtain core field in title and content information, and it is aftermentioned in be related to
Extended field the information content, specifically, the core field is mainly used in the inquiry of specific words and expressions, such as in gathered data
When, we preferably independently come out the title and content in core field, and in other examples, the core
Heart field can be with standing time, cycle etc. information, and this does not affect the embodiment of the present invention, not superfluous herein
State.
For example, in a preferred embodiment, it would be desirable to gather the job information in a certain recruitment website, further
, after website is entered, there are " senior PHP Developmental Engineer " column, including job description in ground in sublink, wherein, it is described " senior
PHP Developmental Engineer " is that the content in title content, job description is body matter information, further, is clicked on described
" senior PHP Developmental Engineer " carries function by browser and checks source code in title, finds out comprising " senior PHP develops work
Cheng Shi " source code information, for example:<H1_title " senior PHP Developmental Engineer '>, wherein, inputted in recognition rule<
H1_title " senior PHP Developmental Engineer ">, it is described<H1_title " senior PHP Developmental Engineer ">As described core words
Symbol string, the core character string has uniqueness in whole source code.
Then, into step S1022, the variable replaced using asterisk wildcard in the core character string generates the core words
Section, the asterisk wildcard is corresponding with the variable, with reference to step S1021, is drawing<H1_title " senior PHP Development Engineerings
Teacher ">, will after as described core character string<H1_title " senior PHP Developmental Engineer ">It is input in recognition rule, and makes
The senior PHP Developmental Engineer is replaced with subject, the subject is asterisk wildcard, is drawn<h1_title
“subject”>, that is, generate core field.
Further, in gathered data, if the core field of generation needs filtering, it can be filtered by data rule
The word that must not be included in the word and title that must be included in principle, the replacement of data header word, title etc. content pair
Data are filtered, so as to obtain the data that user finally goes for.
And then, into step S1023, the input based on the user is retrieved the corresponding source code of the url links and obtained
At least one universal character string is taken, it will be appreciated by those skilled in the art that the step S1023 to step S1024 is mainly for user
The core field configuration content of demand, in such embodiments, with reference to step S1021 to step S1022, in job description
Content is body matter information, clicks on optional position in job description and obtains corresponding source code, for example, user is by searching
Obtain<Dd class=" job_bt ">XXX contents</dd>Content, be filled up to data content identification rule in, further,
It is described<Dd class=" job_bt "></dd>As universal character string.
Finally, into step S1024, the core field configuration content is concatenated into based on the universal character, further
Ground, with reference to step S1023, gets<Dd class=" job_bt ">XXX contents</dd>Content, replaced using message
The XXX contents, be<Dd class=" job_bt ">message</dd>, that is, generate the core field configuration content.
Fig. 6 shows the fifth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content, it will be appreciated by those skilled in the art that conduct
The fifth embodiment of the present invention, step S1021 and step S1022 may be referred to the fourth embodiment shown in Fig. 5.
Further, into step S1025, generated based on the specific character string that user inputs in the core field configuration
Hold.It will be appreciated by those skilled in the art that the embodiment shown in Fig. 5 is applied to gather the full content letter of sublink in a certain network address
Breath, and it is applied to gather the information of specific sublink in the 5th embodiment, for example, with reference to embodiment in Fig. 4, user is by clicking on
The url links, the audit function carried based on browser obtains corresponding source code, and searching is needed in the source code
The title of the url to be obtained links, content etc. information, the title for getting a certain sublink are linked as<title>
[subject]<title>, wherein, that is, search out correspondingly heading message:Game recruitment-the XX of PHP Developmental Engineer recruitment -4399
Net, further, by data filtering, obtains key message:PHP Developmental Engineer, is the title required for us.
Fig. 7 shows the sixth embodiment of the present invention, and input of the acquisition system based on user generates at least one
The idiographic flow schematic diagram of core field and at least one core field configuration content, it will be appreciated by those skilled in the art that conduct
The sixth embodiment of the present invention, step S1021 and step S1022 may be referred to the embodiment shown in Fig. 5 or Fig. 6.
Finally, into step S1026, the core field configuration content is generated based on User Defined script.So
Embodiment in, user need not find web page source code, be scanned for by custom script in acquisition system, obtain user
The text information needed, the text information corresponds to core field configuration content, and user can be according to the demand of oneself big
The information required for oneself is selected in the content of quantum link, for example, user needs search rainy day April in this website
The information of upper issue, or user need to search for the data link for all including recruitment, and these can be according to User Defined
Script is realized that further, acquisition system generates the core field configuration content according to the custom script.
Have various ways in the preferred embodiment shown with reference to Fig. 5 into Fig. 7, the core field configuration, for example,
In one embodiment, the core field configuration content is determines a character string of necessary variable, and the necessary variable is used for
It is determined that the core field comprising the necessary variable, the core field of the determination comprising the necessary variable is to lead to
Cross and the information of collection is screened comprising rule, for example, being according to acquisition system acquisition core field configuration content, and pass through
A certain text information is inputted in comprising rule, the text information must be included in core field configuration content, if not wrapping
Containing the text information, rejected.
In another embodiment, the core field configuration content is described to determine a character string of replacement variable
Replacing variable is used to replace the part core field, in such embodiments, with reference to above-described embodiment, such as according to system
Collection obtains the information about recruitment website, and the text information of " recruitment website " is largely there are in core field configuration content, leads to
Cross to input in the column of variable one is replaced and replace with " recruitment website " " hunting cloud net ", you can realize to owning " recruitment website " text information
Replacement.
In another embodiment, the core field configuration content is described to determine a character string of rejecting variable
Rejecting variable is used to determine that the core field needs the part deleted, and with reference to above-described embodiment, for example, is obtained according to system acquisition
The information about recruitment website is taken, the text information of " recruitment website " is largely there are in core field configuration content, by picking
Except the text information that " recruitment website " is inputted in the column of variable one, you can realize to owning the rejecting of " recruitment website " text information.
In another embodiment, the core field configuration content is described to determine a character string of filtered variable
Filtered variable is used to determine the core field not comprising the filtered variable.In such embodiments, before can combining
Embodiment in stating:The core field configuration content is determines a character string of necessary variable, and the necessary variable is used for
It is determined that the core field comprising the necessary variable, in the present embodiment, by the core not comprising the filtered variable
Heart field rejects the core field comprising the filtered variable as final core field.
Fig. 8 shows the seventh embodiment of the present invention, the input based on user generate at least one extended field and
Extended field configures the idiographic flow schematic diagram of content, it will be appreciated by those skilled in the art that in step s 102, the acquisition system
Input based on user generates at least one core field and at least one core field configuration content, is configuring core words
After the configuration content of section and core field, the configuration content preferably to extended field and extended field is acquired,
In such embodiment, in step s 103, the url links, url links configuration, the core field, the core
Field configuration content, the extended field and extended field configuration content generation collection rule are simultaneously advised based on the collection
Then gathered data.
First, into step S10271, the input based on user retrieves the url and links corresponding source code acquisition one
Escape character (ESC) string, the escape character (ESC) string has uniqueness, those skilled in the art in the url links corresponding source code
Understand, url links, url links configuration, the core field, the core must be included in each data acquisition
Heart field configuration content, and extended field can choose at most 48 extension words as the abundant unnecessary condition of data acquisition
The configuration content of section and extended field, and in other examples, more extension words can also be chosen according to system upgrade
Section, this does not affect technical scheme, for example, in recruitment information, wages scope, job site, working experience are wanted
Ask, educational background limitation etc. will be shown as extended field, further, user clicks on the url links of the recruitment information
Corresponding source code obtains an escape character (ESC) string, and the escape character (ESC) string preferably chooses wages scope, further, obtains
Source code corresponding to the wages scope, the source code has uniqueness.
Then, into step S10272, the variable replaced using asterisk wildcard in the core character string generates the extension
Field, the asterisk wildcard is corresponding with the variable, in such embodiments, according to step S10271, obtains the extension
Character string:<dd class><“job request”><span class red>1.8k-2.5k</span>, wherein,<dd
class><“job request”><span class red>It is as unique, and 1.8k-2.5k is wages scope, will
1.8k-2.5k replacing with extfield1, it is<dd class><“job request”><span class red>
extfield1</span>, collection result is 1.8k-2.5k.
Subsequently, into step S10273, the input based on the user is retrieved the corresponding source code of the url links and obtained
At least one universal character string is taken, step S10274 is finally entered, the extended field is concatenated into based on the universal character and matched somebody with somebody
Content is put, extended field configuration content is obtained and may be referred to previous embodiment, universal character string is obtained based on the source code, and
The extended field configuration content is obtained based on the universal character string.
Fig. 9 shows the eighth embodiment of the present invention, the input based on user generate at least one extended field and
Extended field configures the idiographic flow schematic diagram of content.It will be appreciated by those skilled in the art that the step S10271 and step
S10272 may be referred to the embodiment shown in Fig. 8.
Finally, into step S10275, the extended field configuration content is generated based on User Defined script.So
Embodiment in, user need not find web page source code, be scanned for by custom script in acquisition system, obtain user
The text information needed, the text information corresponds to extended field and configures content, and user can be according to the demand of oneself big
The information required for oneself is selected in the content of quantum link, acquisition system generates the extension according to the custom script
Field configuration content.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring the substantive content of the present invention.
Claims (15)
1. a kind of collecting method based on url character strings, is gathered and meets user's request on the internet by acquisition system
Data, it is characterised in that including:
A. input of the acquisition system based on user is generated at least one url links and at least one url link configurations
Hold;
B. input of the acquisition system based on user generates at least one core field and at least one core field configuration
Content;
C. the acquisition system is based on url links, url links configuration, the core field and the core words
Section configuration content generation collection rule is simultaneously based on the collection rule gathered data.
2. collecting method according to claim 1, it is characterised in that generated as follows in the step a
Multiple url links:
A1. user inputs an original url character string;
A2. replace variable in original url character strings using asterisk wildcard and generate form url character strings, the asterisk wildcard with it is described
Variable is corresponding;
A3. based on the multiple url links of the form url text string generations.
3. collecting method according to claim 1, it is characterised in that the url link configurations in the step a
Content is generated in the following way:
A4. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
A5. the url links configuration content is concatenated into based on the universal character.
4. collecting method according to claim 1, it is characterised in that the url link configurations in the step a
Content is generated in the following way:
A6. the url links configuration content is generated based on the specific character string that user inputs.
5. collecting method according to claim 1, it is characterised in that the url link configurations in the step a
Content is generated in the following way:
A7. the url links configuration content is generated based on User Defined script.
6. the collecting method according to claim 3 or 4 or 5, it is characterised in that the url links configure content and are
As follows any appoints a variety of:
- url links configuration the content is two character strings of determination search listing, and the search listing belongs to the url
Link a part for corresponding source code;
- url links configuration the content recognizes a character string of variable for determination, and the identification variable is used to determine same species
The url link;
- url links configuration the content is a character string of the necessary variable of determination, and the necessary variable is used to determine comprising institute
State the url links of necessary variable;
- url links configuration the content rejects a character string of variable for determination, and the rejecting variable does not include for determination
The url links for rejecting variable;
- url links configuration the content is a character string of determination filtered variable, and the filtered variable is described for determining
Url links need the part deleted;
- url links configuration content is determines a character string of supplement prefix, and the supplement prefix is used to being embedded into described
Url is linked foremost;
- url links configuration content is determines a character string of supplement suffix, and the supplement suffix is used to being embedded into described
It is last that url is linked.
7. collecting method according to claim 1, it is characterised in that the core field in the step b is led to
Cross following manner generation:
B1. the input based on user retrieves the url and links one core character string of corresponding source code acquisition, the core words
Symbol string has uniqueness in the url links corresponding source code;
B2. replace variable in the core character string using asterisk wildcard and generate the core field, the asterisk wildcard with it is described
Variable is corresponding.
8. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody
Content is put to generate in the following way:
B3. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
B4. the core field configuration content is concatenated into based on the universal character.
9. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody
Content is put to generate in the following way:
B5. the core field configuration content is generated based on the specific character string that user inputs.
10. collecting method according to claim 1, it is characterised in that the core field in the step b is matched somebody with somebody
Content is put to generate in the following way:
B6. the core field configuration content is generated based on User Defined script.
11. the collecting method according to claim 8 or 9 or 10, it is characterised in that the core field configuration content
To be following any or appoint a variety of:
- core field configuration the content is a character string for determining necessary variable, and the necessary variable is included for determination
The core field of the necessary variable;
- core field configuration the content replaces a character string of variable for determination, and the replacement variable is used to replace part
The core field;
- core field configuration the content rejects a character string of variable for determination, and the rejecting variable is described for determining
Core field needs the part deleted;
- core field configuration the content is a character string for determining filtered variable, and the filtered variable is used to determine not wrap
The core field containing the filtered variable.
12. the collecting method according to any one of claim 1 to 11, it is characterised in that go back base in the step b
At least one extended field and extended field configuration content are generated in the input of user,
Url described in the step c link, the url link configuration, the core field, the core field configuration content,
The extended field and extended field configuration content generation collection rule are simultaneously based on the collection rule gathered data.
13. collecting method according to claim 12, it is characterised in that the extended field is given birth in the following way
Into:
B7. the input based on user retrieves the url and links one escape character (ESC) string of corresponding source code acquisition, the extension word
Symbol string has uniqueness in the url links corresponding source code;
B8. replace variable in the core character string using asterisk wildcard and generate the extended field, the asterisk wildcard with it is described
Variable is corresponding.
14. collecting method according to claim 12, it is characterised in that the extended field configuration content passes through such as
Under type is generated:
B9. the input based on the user retrieves the corresponding source code of the url links and obtains at least one universal character string;
B10. the extended field configuration content is concatenated into based on the universal character.
15. collecting method according to claim 12, it is characterised in that the extended field configuration content passes through such as
Under type is generated:
B11. the extended field configuration content is generated based on User Defined script.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710440457.8A CN107330004A (en) | 2017-06-12 | 2017-06-12 | A kind of collecting method based on url character strings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710440457.8A CN107330004A (en) | 2017-06-12 | 2017-06-12 | A kind of collecting method based on url character strings |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107330004A true CN107330004A (en) | 2017-11-07 |
Family
ID=60195617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710440457.8A Pending CN107330004A (en) | 2017-06-12 | 2017-06-12 | A kind of collecting method based on url character strings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107330004A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558418A (en) * | 2018-12-03 | 2019-04-02 | 上海熙菱信息技术有限公司 | A kind of method of automatic identification information |
CN110019486A (en) * | 2018-07-19 | 2019-07-16 | 平安科技(深圳)有限公司 | Collecting method, device, equipment and storage medium |
WO2021088350A1 (en) * | 2019-11-07 | 2021-05-14 | 南京莱斯网信技术研究院有限公司 | Script-based web service paging data acquisition system |
-
2017
- 2017-06-12 CN CN201710440457.8A patent/CN107330004A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019486A (en) * | 2018-07-19 | 2019-07-16 | 平安科技(深圳)有限公司 | Collecting method, device, equipment and storage medium |
CN110019486B (en) * | 2018-07-19 | 2023-04-11 | 平安科技(深圳)有限公司 | Data acquisition method, device, equipment and storage medium |
CN109558418A (en) * | 2018-12-03 | 2019-04-02 | 上海熙菱信息技术有限公司 | A kind of method of automatic identification information |
CN109558418B (en) * | 2018-12-03 | 2023-04-07 | 上海熙菱信息技术有限公司 | Method for automatically identifying information |
WO2021088350A1 (en) * | 2019-11-07 | 2021-05-14 | 南京莱斯网信技术研究院有限公司 | Script-based web service paging data acquisition system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Singrodia et al. | A review on web scrapping and its applications | |
CN102597993B (en) | Managing application state information by means of uniform resource identifier (URI) | |
CN102855313B (en) | The method that web page browsing equipment, the generation method of web-page summarization and webpage are opened | |
US8825706B1 (en) | System for and method of processing business personnel information | |
CN107330004A (en) | A kind of collecting method based on url character strings | |
CN104133820B (en) | Content recommendation method and content recommendation device | |
CN102981746A (en) | Handheld electronic device and method for calibrating input of webpage address | |
CN103116635B (en) | Field-oriented method and system for collecting invisible web resources | |
CN107341399A (en) | Assess the method and device of code file security | |
CN108574669A (en) | User behavior tree constructing method and device | |
CN105631007A (en) | Industry technical information collecting method and system | |
CN109684616A (en) | Dynamic statement formula assembles the method and system made a report on | |
CN107943893A (en) | A kind of search processing method and device based on internet | |
CN105095175A (en) | Method and device for obtaining truncated web title | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
CN106547749A (en) | The method and apparatus of collecting webpage data | |
CN106649557A (en) | Semantic association mining method for defect report and mail list | |
CN104268282A (en) | Web banner advertisement displaying method and system | |
CN105808623B (en) | A kind of page access event correlation methodology and device based on search | |
CN112417165A (en) | Method and system for constructing and inquiring lifetime planning knowledge graph | |
CN106445950A (en) | Personalized distributed data mining system | |
CN106227661A (en) | Data processing method and device | |
CN106021304A (en) | Webpage address correcting method and system | |
CN106951540B (en) | Generation method, device, server and the computer-readable storage medium of file directory | |
CN106611022A (en) | Method and device for increasing website search efficiency |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20171107 |