CN110413859A - Webpage information search method, apparatus, computer equipment and storage medium - Google Patents
Webpage information search method, apparatus, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110413859A CN110413859A CN201910568616.1A CN201910568616A CN110413859A CN 110413859 A CN110413859 A CN 110413859A CN 201910568616 A CN201910568616 A CN 201910568616A CN 110413859 A CN110413859 A CN 110413859A
- Authority
- CN
- China
- Prior art keywords
- search
- information
- objective browser
- browser
- input information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000003860 storage Methods 0.000 title claims abstract description 16
- 230000006870 function Effects 0.000 claims description 41
- 238000012795 verification Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004321 preservation Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 238000004088 simulation Methods 0.000 claims description 5
- 238000005096 rolling process Methods 0.000 claims description 4
- 238000011161 development Methods 0.000 abstract description 11
- 238000005516 engineering process Methods 0.000 abstract description 3
- 239000008186 active pharmaceutical agent Substances 0.000 description 33
- 238000010801 machine learning Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 17
- 239000003795 chemical substances by application Substances 0.000 description 15
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 12
- 238000012549 training Methods 0.000 description 12
- 238000010200 validation analysis Methods 0.000 description 12
- BUGBHKTXTAQXES-UHFFFAOYSA-N Selenium Chemical compound [Se] BUGBHKTXTAQXES-UHFFFAOYSA-N 0.000 description 6
- 229910052711 selenium Inorganic materials 0.000 description 6
- 239000011669 selenium Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000009193 crawling Effects 0.000 description 4
- 241000985630 Lota lota Species 0.000 description 3
- 206010029412 Nightmare Diseases 0.000 description 3
- 241000282485 Vulpes vulpes Species 0.000 description 3
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000013475 authorization Methods 0.000 description 3
- 235000008954 quail grass Nutrition 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000012800 visualization Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
This application involves data collecting fields, specifically used data crawl technology, it is crawled by the way of web crawlers, and disclose a kind of webpage information search method, apparatus, computer equipment and storage medium based on crawler, starting has the search container of crawler function, predefined target API is transferred by described search container, corresponding objective browser driving is loaded according to the target API, the input information of user is obtained, and the objective browser according to objective browser driving dry run is to carry out webpage information search to the input information.Through the above way, the application can load virtual objective browser driving by the input information of user, to which operational objective browser carries out the information search of profession, ensure the accuracy of search result, guarantee that information quality is met the needs of users, working efficiency is improved, the development of big data acquisition technique is promoted, meets and meet the trend of technical intelligenceization development.
Description
Technical field
This application involves the network informations to obtain field more particularly to a kind of webpage information search method based on crawler, dress
It sets, computer equipment and storage medium.
Background technique
In the technical field that the network information obtains, (be otherwise known as web crawlers webpage spider, network robot, In
It is more frequent to be known as webpage follower among the community FOAF), be it is a kind of according to certain rules, automatically grab WWW letter
The program or script of breath.There are also ant, automatic indexing, simulation program or worms for the rarely needed name of other.
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these
Information becomes a huge challenge.Search engine, such as traditional universal search engine Baidu and Google are assisted as one
The tool that people retrieve information becomes entrance and guide that user accesses WWW.But these versatility search engines are also deposited
In certain limitation, such as:
(1) different field, different background user often there is different retrieval purpose and demand, universal search engine institute
The result of return includes the unconcerned webpage of a large number of users.
(2) target of universal search engine is the network coverage as big as possible, limited search engine server resource
Contradiction between unlimited network data resource will further deepen.
(3) abundant and network technology the continuous development of world wide web data form, the more matchmakers of picture, database, audio, video
The different data such as body largely occur, often intensive to these information contents and data with certain structure of universal search engine without
It can be power, cannot find and obtain well.
Particularly, in the art, tool is crawled for some of picture, is language generation based on programming by the way of
Code, is accordingly crawled for some specific website (URL), but needs to analyze whether the website includes required figure in advance
Piece, and it is unable to control the content of picture, therefore, the prior art can no longer meet the demand of user, can not solve therefrom
Existing technological deficiency, it is even more impossible to the search engines using profession to obtain the reliable search result of profession, acquire to big data
Biggish interference is brought, subsequent intelligent recognition processing result, and then the intelligentized development trend of influence technique are influenced.
Summary of the invention
This application provides a kind of webpage information search method, apparatus, equipment and storage medium based on crawler, Neng Goutong
The input information for crossing user loads virtual objective browser driving, so that the information that operational objective browser carries out profession is searched
Rope, it is ensured that the accuracy of search result guarantees that information quality is met the needs of users, and improves working efficiency, and big data is promoted to adopt
The development of collection technology meets and meets the trend of technical intelligenceization development.
In a first aspect, this application provides a kind of webpage information search method based on crawler, comprising:
Starting has the search container of crawler function;
Predefined target API is transferred by described search container;
Corresponding objective browser driving is loaded according to the target API;
The input information of user is obtained, and the objective browser according to objective browser driving dry run is with right
The input information carries out webpage information search.
Second aspect, this application provides a kind of webpage information search device, the webpage information search device includes:
Starting module, for starting the search container for having crawler function;
Module is transferred, for transferring predefined target API by described search container;
Drive module drives for loading corresponding objective browser according to the target API;
Module is run, for obtaining the input information of user, and according to objective browser driving dry run
Objective browser is to carry out webpage information search to the input information.
The third aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing
Device;
The memory is for storing computer program;
The processor, for executing the computer program and being realized as described above when executing the computer program
Webpage information search method.
Fourth aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium
It is stored with computer program, the computer program makes the processor realize webpage letter as described above when being executed by processor
Cease searching method.
This application discloses a kind of webpage information search method, apparatus, computer equipment and storage medium based on crawler,
Starting has the search container of crawler function, predefined target API is transferred by described search container, according to the target
API loads corresponding objective browser driving, obtains the input information of user, and drive simulation fortune according to the objective browser
The row objective browser is to carry out webpage information search to the input information.By the above-mentioned means, the application can pass through
The input information of user loads virtual objective browser driving, so that operational objective browser carries out the information search of profession,
The accuracy for ensuring search result guarantees that information quality is met the needs of users, and improves working efficiency, and big data is promoted to acquire skill
The development of art meets and meets the trend of technical intelligenceization development.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic flow diagram of the webpage information search method provided by the embodiments of the present application based on crawler;
Fig. 2 is the input information that user is obtained shown in Fig. 1, and the mesh according to objective browser driving dry run
Browser is marked to carry out other step schematic flow diagrams after webpage information search step to the input information.
Fig. 3 is the input information that user is obtained shown in Fig. 1, and the mesh according to objective browser driving dry run
Browser is marked to carry out the specific steps schematic flow diagram of webpage information search to the input information.
Fig. 4 A is a kind of structural representation block diagram for webpage information search device that one embodiment of the application provides.
Fig. 4 B is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not
It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical
The sequence of execution is possible to change according to the actual situation.
Embodiments herein provide a kind of webpage information search method, apparatus based on crawler, computer equipment and
Storage medium can load virtual objective browser driving by the input information of user, thus operational objective browser into
The information search of row profession, it is ensured that the accuracy of search result guarantees that information quality is met the needs of users, and improves work effect
Rate promotes the development of big data acquisition technique, meets and meet the trend of technical intelligenceization development.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following
Feature in embodiment and embodiment can be combined with each other.
Referring to Fig. 1, Fig. 1 shows the step of being the webpage information search method based on crawler of embodiments herein offer
Meaning flow chart.It should be noted that the webpage information search method based on crawler of the present embodiment can be applied to user terminal, institute
Stating user terminal can be on desktop computer, laptop, tablet computer, mobile phone or other artificial smart machine.
As shown in Figure 1, the herein described webpage information search method based on crawler, can include but is not limited to step
S101 to step S104.
S101, starting have the search container of crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring
The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to
Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser,
launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、
Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、
PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
S102 transfers predefined target API by described search container.
In the present embodiment, target API refers to application programming interface (application programming
Interface, API).
S103 loads corresponding objective browser according to the target API and drives.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands
Device, or the browser of Advanced Search function.
S104 obtains the input information of user, and the target browsing according to objective browser driving dry run
Device is to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example,
Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser
Operation browser is supported in the automation set.
As shown in Fig. 2, obtaining the input information of user in S104 described in the present embodiment, and driven according to the objective browser
Objective browser described in dynamic dry run to carry out webpage information search to the input information, later include step S105 with
S106。
S105 obtains the simulation objective browser and scans for returned expectation search knot to the input information
Fruit.
S106 carries out information crawler to the expectation search result, to crawl the graph text information of target.
Specifically, information crawler is carried out to the expectation search result in S106 described in the present embodiment, to crawl target
Graph text information, may include: obtain return expectation search result;The expectation is searched according to predefined simulated operation collection
Hitch fruit carries out simulated operation.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment
Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested
Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP
Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser
Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser
The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code
After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address
It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap
It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute
Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output
Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output;
Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture
The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block
Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input
Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target,
The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label
Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset
Threshold value, it is determined that the machine learning model training is completed.
It is noted that the starting of S101 described in the present embodiment has the search container of crawler function, can also wrap before
Include: building search container obtains the search purpose of user;Predefine what at least one function matched according to described search purpose
Objective browser.
Match it should be strongly noted that predefining at least one function according to described search purpose described in the present embodiment
Objective browser, specifically include: according to multiple functions of a variety of search purposes, the multiple targets to match of predefined correspondence are clear
Look at device queue.
It is understandable to be, as shown in figure 3, being obtained in S104 described in the present embodiment if being provided with objective browser queue
The input information of user, and the objective browser according to objective browser driving dry run is to the input information
Webpage information search is carried out, can also include step S1041 to S1044.
S1041 obtains the input information of user.
S1042 identifies to obtain the current purpose function of user according to the input information.
S1043 simulates fortune according to the current purpose function that identification obtains in predefined objective browser queue
The matched objective browser of row.
S1044 carries out webpage information search to the input information using the objective browser of dry run.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment
Matched objective browser queue can also include: to judge in the same same function of searching for purpose, if there are phases
The multiple objective browsers matched;If it exists, then priority setting is carried out to multiple objective browsers, and switching is set and is used effectively
The trigger condition of objective browser, wherein the trigger condition includes the highest first object browser of current priority can not
With or abnormal and preferential the second high objective browser of level it is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred
Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser,
Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the webpage information search method can also include: that the target browsing is arranged in advance
The proxy server of device;Or, judging agency's clothes when loading the driving of corresponding objective browser according to the target API
Whether business device is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, according to described
Proxy server is arranged in setting parameter in real time.
For example, it during using googlechrome Google browser, is needed in network request time-out, IP access authority
When confirming, picture validation code occur, different disposal is carried out by search container, when the webpage that crawler needs to crawl is accessed,
The content of webpage is downloaded and saved, while quantity, period and the size limitation of each image etc. that preservation can be set are wanted
It asks.
It illustrates, can be set for googlechrome using Agent IP, to guarantee the normal of search container
It uses, the present embodiment can use following processing modes: the input file for searching for container is loaded into the memory of user terminal;When
When generate there are the operation of previous secondary program, demand uses the file of specific IP of proxy access, this document content is loaded;Take one
Row input file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Judge web page loading time
It whether is more than the threshold value set, if so, judging whether Agent IP allows to access;If it is not allowed, replacing other Agent IP weights
It is new to initiate network request, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined
When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount
Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first,
It is set to call different target API according to different search purposes, for example, googlechrome can be according to unused use
On the way, plurality of target API is provided, including following several.
Table one stablizes API
One table two of table, Beta API
Table two
Table three, Dev API
Table three
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called
Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key
Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Incorporated by reference to Fig. 1 to Fig. 3 refering to Fig. 4 A, the embodiment of the present application provides a kind of webpage information search device, the webpage
Information search device may include starting module 31, transfer module 32, drive module 33 and operation module 34.
Specifically, the starting module 31 is for starting the search container for having crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring
The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to
Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser,
launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、
Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、
PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
The module 32 of transferring is for transferring predefined target API by described search container.
In the present embodiment, target API refers to application programming interface.
The drive module 33 is used to load corresponding objective browser according to the target API and drive.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands
Device, or the browser of Advanced Search function.
The operation module 34 is used to obtain the input information of user, and drives dry run according to the objective browser
The objective browser is to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example,
Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser
Operation browser is supported in the automation set.
Operation module 34 described in the present embodiment is used to obtain the input information of user, and is driven according to the objective browser
Objective browser described in dry run is to carry out webpage information search to the input information, and the operation module 34 is used for later
It obtains the simulation objective browser and returned expectation search result is scanned for the input information;The operation module
34 for carrying out information crawler to the expectation search result, to crawl the graph text information of target.
Specifically, operation module 34 described in the present embodiment is used to carry out information crawler to the expectation search result, with
The graph text information for crawling target may include: that the operation module 34 is used to obtain the expectation search result returned;According to predetermined
The simulated operation collection of justice carries out simulated operation to the expectation search result.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment
Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested
Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP
Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser
Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser
The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code
After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address
It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap
It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute
Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output
Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output;
Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture
The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block
Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input
Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target,
The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label
Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset
Threshold value, it is determined that the machine learning model training is completed.
It is noted that starting module 31 described in the present embodiment is used to start the search container for having crawler function, it
Before can also include: building module (not shown), for construct search container, obtain the search purpose of user;Definition module (figure
Do not show), for predefining the objective browser that at least one function matches according to described search purpose.
It should be strongly noted that definition module described in the present embodiment predefines at least one function according to described search purpose
The objective browser that can be matched, specifically includes: the definition module according to it is a variety of search purposes multiple functions, predefine pair
Answer multiple objective browser queues to match.
Understandable to be, if being provided with objective browser queue, operation module 34 described in the present embodiment is used for obtaining
The input information at family, and according to the objective browser drive dry run described in objective browser with to the input information into
Row webpage information search, the operation module 34 are used to obtain the input information of user, and the operation module 34 is used for according to institute
It states input information to identify to obtain the current purpose function of user, the operation module 34 is used to be obtained according to identification described current
Purpose function matched objective browser of dry run in predefined objective browser queue, the operation module 34 are used for
Webpage information search is carried out to the input information using the objective browser of dry run.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment
Matched objective browser queue can also include: that the operation module 34 is used to judge the same of same search purpose
In function, if there are the multiple objective browsers to match;If it exists, then priority is carried out to multiple objective browsers to set
It sets, and the trigger condition that switching uses effective target browser is set, wherein the trigger condition includes current priority highest
First object browser it is unavailable or abnormal and preferential the second high objective browser of level is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred
Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser,
Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the operation module 34 is used to be arranged in advance the agency service of the objective browser
Device;Or, the operation module 34 is described for judging when loading the driving of corresponding objective browser according to the target API
Whether proxy server is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, root
According to the setting parameter, proxy server is set in real time.
For example, during using google chrome Google browser, in network request time-out, IP access authority
When needing to confirm, picture validation code occur, different disposal is carried out by search container, needs the webpage that crawls crawler is accessed
When, the content of webpage is downloaded and saved, while preservation can be set quantity, period and the size limitation of each image etc.
It is required that.
It illustrates, can be set for google chrome using Agent IP, to guarantee the normal of search container
It uses, operation module 34 is for the input file for searching for container to be loaded into the memory of user terminal described in the present embodiment;When depositing
When generate in the operation of previous secondary program, demand uses the file of specific IP of proxy access, this document content is loaded;Take a line
Input file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Judging web page loading time is
No is more than the threshold value of setting, if so, judging whether Agent IP allows to access;If it is not allowed, replacing other Agent IPs again
Network request is initiated, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined
When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount
Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first,
It is set to call different target API according to different search purposes, for example, google chrome can be according to unused use
On the way, plurality of target API is provided.
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called
Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key
Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Specifically, continuing with Fig. 1 to Fig. 3 is combined, refering to Fig. 4 B, the embodiment of the present application provides a kind of computer equipment, institute
Stating computer equipment may include memory 40 and processor 41, and the memory 40 is for storing computer program, the place
Reason device 41 for execute the computer program and when executing the computer program for realizing such as Fig. 1 to Fig. 3 and in fact
Apply webpage information search method described in example.
Specifically, the processor 41 is for starting the search container for having crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring
The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to
Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser,
launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、
Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、
PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
The processor 41 is used to transfer predefined target API by described search container.
In the present embodiment, target API refers to application programming interface.
The processor 41 is used to load corresponding objective browser according to the target API and drive.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands
Device, or the browser of Advanced Search function.
The processor 41 is used to obtain the input information of user, and drives dry run institute according to the objective browser
Objective browser is stated to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example,
Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser
Operation browser is supported in the automation set.
Processor 41 described in the present embodiment is used to obtain the input information of user, and drives mould according to the objective browser
The quasi- operation objective browser is to carry out webpage information search to the input information, and the processor 41 is for obtaining later
It simulates the objective browser and returned expectation search result is scanned for the input information;The processor 41 is used for
Information crawler is carried out to the expectation search result, to crawl the graph text information of target.
Specifically, processor 41 described in the present embodiment is used to carry out information crawler to the expectation search result, to climb
The graph text information for taking target may include: that the processor 41 is used to obtain the expectation search result returned;According to predefined
Simulated operation collection carries out simulated operation to the expectation search result.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment
Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested
Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP
Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser
Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser
The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code
After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address
It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap
It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute
Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output
Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output;
Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture
The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block
Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input
Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target,
The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label
Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset
Threshold value, it is determined that the machine learning model training is completed.
It is noted that processor 41 described in the present embodiment is used to start the search container for having crawler function, before
It can also include: building search container, obtain the search purpose of user;At least one function is predefined according to described search purpose
The objective browser to match.
Match it should be strongly noted that predefining at least one function according to described search purpose described in the present embodiment
Objective browser, specifically include: according to multiple functions of a variety of search purposes, the multiple targets to match of predefined correspondence are clear
Look at device queue.
Understandable to be, if being provided with objective browser queue, processor 41 described in the present embodiment is for obtaining user
Input information, and according to the objective browser drive dry run described in objective browser with to the input information progress
Webpage information search, the processor 41 are used to obtain the input information of user, and the processor 41 is used for according to the input
Information identifies to obtain the current purpose function of user, and the processor 41 is used for the current purpose function of obtaining according to identification
The matched objective browser of dry run in predefined objective browser queue, the processor 41 are used to utilize simulation fortune
Capable objective browser carries out webpage information search to the input information.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment
Matched objective browser queue can also include: the same function that the processor 41 is used to judge same search purpose
In energy, if there are the multiple objective browsers to match;If it exists, then priority setting is carried out to multiple objective browsers,
And the trigger condition that switching uses effective target browser is set, wherein the trigger condition includes that current priority is highest
First object browser is unavailable or abnormal and preferentially the second high objective browser of level is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred
Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser,
Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the processor 41 is used to be arranged in advance the proxy server of the objective browser;
Or, the processor 41 is for judging agency's clothes when loading the driving of corresponding objective browser according to the target API
Whether business device is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, according to described
Proxy server is arranged in setting parameter in real time.
For example, during using google chrome Google browser, in network request time-out, IP access authority
When needing to confirm, picture validation code occur, different disposal is carried out by search container, needs the webpage that crawls crawler is accessed
When, the content of webpage is downloaded and saved, while preservation can be set quantity, period and the size limitation of each image etc.
It is required that.
It illustrates, can be set for google chrome using Agent IP, to guarantee the normal of search container
It uses, processor 41 described in the present embodiment is for the input file for searching for container to be loaded into the memory of user terminal;Work as presence
When previous secondary program operation generates, demand uses the file of specific IP of proxy access, this document content is loaded;Take a line defeated
Enter file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Whether judge web page loading time
More than the threshold value of setting, if so, judging whether Agent IP allows to access;It is sent out again if it is not allowed, replacing other Agent IPs
Network request is played, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined
When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount
Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first,
It is set to call different target API according to different search purposes, for example, google chrome can be according to unused use
On the way, plurality of target API is provided.
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called
Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key
Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Incorporated by reference to said one or multiple embodiments, the application also provides a kind of computer readable storage medium, the meter
Calculation machine readable storage medium storing program for executing is stored with computer program, for realizing such as Fig. 1-figure when the computer program is executed by processor
3 and embodiment described in webpage information search method.
It should be understood that the present embodiment processor can be central processing unit (Central Processing
Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal
Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing
At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete
Door or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor
It is also possible to any conventional processor etc..
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment
Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer
The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any
Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace
It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right
It is required that protection scope subject to.
Claims (10)
1. a kind of webpage information search method based on crawler characterized by comprising
Starting has the search container of crawler function;
Predefined target API is transferred by described search container;
Corresponding objective browser driving is loaded according to the target API;
The input information of user is obtained, and the objective browser according to objective browser driving dry run is to described
It inputs information and carries out webpage information search.
2. webpage information search method according to claim 1, which is characterized in that the input information for obtaining user,
And the objective browser according to objective browser driving dry run is searched with carrying out webpage information to the input information
Rope includes: later
It obtains the simulation objective browser and returned expectation search result is scanned for the input information;
Information crawler is carried out to the expectation search result, to crawl the graph text information of target.
3. webpage information search method according to claim 2, which is characterized in that it is described to the expectation search result into
Row information crawls, to crawl the graph text information of target, comprising:
Obtain the expectation search result returned;
Simulated operation is carried out to the expectation search result according to predefined simulated operation collection.
4. webpage information search method according to claim 3, which is characterized in that described according to predefined simulated operation
Collection carries out simulated operation to the expectation search result, comprising:
It is browsed according to the simulated operation, the input of page turning, preservation, contents interception, rolling, screening and/or Authority Verification.
5. webpage information search method according to claim 1, which is characterized in that the starting has searching for crawler function
Rope container includes: before
Building search container, obtains the search purpose of user;
The objective browser that at least one function matches is predefined according to described search purpose.
6. webpage information search method according to claim 5, which is characterized in that described predetermined according to described search purpose
The objective browser that at least one adopted function matches, specifically includes:
According to multiple functions of a variety of search purposes, the multiple objective browser queues to match of predefined correspondence.
7. webpage information search method according to claim 6, which is characterized in that the input information for obtaining user,
And the objective browser according to objective browser driving dry run is searched with carrying out webpage information to the input information
Rope, further includes:
Obtain the input information of user;
It identifies to obtain the current purpose function of user according to the input information;
The current purpose function of being obtained according to identification matched mesh of dry run in predefined objective browser queue
Mark browser;
Webpage information search is carried out to the input information using the objective browser of dry run.
8. a kind of webpage information search device, which is characterized in that the webpage information search device includes:
Starting module, for starting the search container for having crawler function;
Module is transferred, for transferring predefined target API by described search container;
Drive module drives for loading corresponding objective browser according to the target API;
Module is run, for obtaining the input information of user, and the target according to objective browser driving dry run
Browser is to carry out webpage information search to the input information.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor;
The memory is for storing computer program;
The processor, for executing the computer program and realization such as claim 1 when executing the computer program
To webpage information search method described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence is searched when the computer program is executed by processor for realizing the webpage information as described in any one of claims 1 to 7
Suo Fangfa.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910568616.1A CN110413859A (en) | 2019-06-27 | 2019-06-27 | Webpage information search method, apparatus, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910568616.1A CN110413859A (en) | 2019-06-27 | 2019-06-27 | Webpage information search method, apparatus, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110413859A true CN110413859A (en) | 2019-11-05 |
Family
ID=68358351
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910568616.1A Pending CN110413859A (en) | 2019-06-27 | 2019-06-27 | Webpage information search method, apparatus, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413859A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368167A (en) * | 2020-03-06 | 2020-07-03 | 北京师范大学 | Chinese literature data automatic acquisition method based on web crawler technology |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111625690A (en) * | 2020-05-13 | 2020-09-04 | 北京达佳互联信息技术有限公司 | Object recommendation method, device, equipment and medium |
CN112347326A (en) * | 2020-09-29 | 2021-02-09 | 武汉虹旭信息技术有限责任公司 | Crawler detection method and device based on browser end |
CN112818212A (en) * | 2020-04-23 | 2021-05-18 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method and device, computer equipment and storage medium |
CN114928532A (en) * | 2022-05-17 | 2022-08-19 | 北京达佳互联信息技术有限公司 | Method, device, equipment and storage medium for generating alarm message |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090094249A1 (en) * | 2007-10-05 | 2009-04-09 | Microsoft Corporation | Creating search enabled web pages |
CN101751428A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Information search method and device |
CN102792244A (en) * | 2010-01-13 | 2012-11-21 | 洛克迈特公司 | Preview functionality for increased browsing speed |
CN105893622A (en) * | 2016-04-29 | 2016-08-24 | 深圳市中润四方信息技术有限公司 | Polymerization search method and polymerization search system |
CN109815380A (en) * | 2018-12-20 | 2019-05-28 | 山东中创软件工程股份有限公司 | A kind of information crawler method, apparatus, equipment and computer readable storage medium |
-
2019
- 2019-06-27 CN CN201910568616.1A patent/CN110413859A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090094249A1 (en) * | 2007-10-05 | 2009-04-09 | Microsoft Corporation | Creating search enabled web pages |
CN101751428A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | Information search method and device |
CN102792244A (en) * | 2010-01-13 | 2012-11-21 | 洛克迈特公司 | Preview functionality for increased browsing speed |
CN105893622A (en) * | 2016-04-29 | 2016-08-24 | 深圳市中润四方信息技术有限公司 | Polymerization search method and polymerization search system |
CN109815380A (en) * | 2018-12-20 | 2019-05-28 | 山东中创软件工程股份有限公司 | A kind of information crawler method, apparatus, equipment and computer readable storage medium |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368167A (en) * | 2020-03-06 | 2020-07-03 | 北京师范大学 | Chinese literature data automatic acquisition method based on web crawler technology |
CN112818212A (en) * | 2020-04-23 | 2021-05-18 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method and device, computer equipment and storage medium |
CN112818212B (en) * | 2020-04-23 | 2023-10-13 | 腾讯科技(深圳)有限公司 | Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium |
CN111597421A (en) * | 2020-04-30 | 2020-08-28 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111597421B (en) * | 2020-04-30 | 2022-08-30 | 武汉思普崚技术有限公司 | Method, device, equipment and storage medium for realizing website picture crawler |
CN111625690A (en) * | 2020-05-13 | 2020-09-04 | 北京达佳互联信息技术有限公司 | Object recommendation method, device, equipment and medium |
CN111625690B (en) * | 2020-05-13 | 2024-03-08 | 北京达佳互联信息技术有限公司 | Object recommendation method, device, equipment and medium |
CN112347326A (en) * | 2020-09-29 | 2021-02-09 | 武汉虹旭信息技术有限责任公司 | Crawler detection method and device based on browser end |
CN114928532A (en) * | 2022-05-17 | 2022-08-19 | 北京达佳互联信息技术有限公司 | Method, device, equipment and storage medium for generating alarm message |
CN114928532B (en) * | 2022-05-17 | 2023-12-12 | 北京达佳互联信息技术有限公司 | Alarm message generation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110413859A (en) | Webpage information search method, apparatus, computer equipment and storage medium | |
CN108595583B (en) | Dynamic graph page data crawling method, device, terminal and storage medium | |
US9363310B2 (en) | Standard commands for native commands | |
CN103530114B (en) | Picture managing method and device | |
CN109600385B (en) | Access control method and device | |
CN110909229A (en) | Webpage data acquisition and storage system based on simulated browser access | |
CN107609150A (en) | A kind of interactive network reptile creation method chosen based on page elements and system | |
CN111324440A (en) | Method, device and equipment for executing automation process and readable storage medium | |
CN108920146A (en) | Page control assembly and visual Simulation operating system | |
CN110378749A (en) | Appraisal procedure, device, terminal device and the storage medium of user data similitude | |
CN108491420A (en) | Configuration method, application server and the computer readable storage medium of web page crawl | |
CN110162682A (en) | A kind of crawling method of network data, device, storage medium and terminal device | |
CN108959619A (en) | Content screen method, user equipment, storage medium and device | |
CN107203470A (en) | Page adjustment method and device | |
CN112988185A (en) | Cloud application updating method, device and system, electronic equipment and storage medium | |
CN108427639B (en) | Automated testing method, application server and computer readable storage medium | |
CN105930487A (en) | Topic search method and apparatus applied to mobile terminal | |
CN109829821A (en) | A kind of abnormal processing method of digital asset address transfer, apparatus and system | |
CN107547944A (en) | Interface realizing method and device, set top box | |
CN111859069B (en) | Network malicious crawler identification method, system, terminal and storage medium | |
CN104298716B (en) | A kind of network crawler system and implementation method for supporting artificial conversation grafting | |
CN107741980A (en) | Topic searching method, topic searcher and electric terminal | |
CN112307464A (en) | Fraud identification method and device and electronic equipment | |
WO2023115968A1 (en) | Method and device for identifying violation data at user end, medium, and program product | |
CN112036843A (en) | Flow element positioning method, device, equipment and medium based on RPA and AI |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |