CN110413859A - Webpage information search method, apparatus, computer equipment and storage medium - Google Patents

Webpage information search method, apparatus, computer equipment and storage medium Download PDF

Info

Publication number
CN110413859A
CN110413859A CN201910568616.1A CN201910568616A CN110413859A CN 110413859 A CN110413859 A CN 110413859A CN 201910568616 A CN201910568616 A CN 201910568616A CN 110413859 A CN110413859 A CN 110413859A
Authority
CN
China
Prior art keywords
search
information
objective browser
browser
input information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910568616.1A
Other languages
Chinese (zh)
Inventor
王涛
朱葛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910568616.1A priority Critical patent/CN110413859A/en
Publication of CN110413859A publication Critical patent/CN110413859A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

This application involves data collecting fields, specifically used data crawl technology, it is crawled by the way of web crawlers, and disclose a kind of webpage information search method, apparatus, computer equipment and storage medium based on crawler, starting has the search container of crawler function, predefined target API is transferred by described search container, corresponding objective browser driving is loaded according to the target API, the input information of user is obtained, and the objective browser according to objective browser driving dry run is to carry out webpage information search to the input information.Through the above way, the application can load virtual objective browser driving by the input information of user, to which operational objective browser carries out the information search of profession, ensure the accuracy of search result, guarantee that information quality is met the needs of users, working efficiency is improved, the development of big data acquisition technique is promoted, meets and meet the trend of technical intelligenceization development.

Description

Webpage information search method, apparatus, computer equipment and storage medium
Technical field
This application involves the network informations to obtain field more particularly to a kind of webpage information search method based on crawler, dress It sets, computer equipment and storage medium.
Background technique
In the technical field that the network information obtains, (be otherwise known as web crawlers webpage spider, network robot, In It is more frequent to be known as webpage follower among the community FOAF), be it is a kind of according to certain rules, automatically grab WWW letter The program or script of breath.There are also ant, automatic indexing, simulation program or worms for the rarely needed name of other.
With the rapid development of network, WWW becomes the carrier of bulk information, how to efficiently extract and use these Information becomes a huge challenge.Search engine, such as traditional universal search engine Baidu and Google are assisted as one The tool that people retrieve information becomes entrance and guide that user accesses WWW.But these versatility search engines are also deposited In certain limitation, such as:
(1) different field, different background user often there is different retrieval purpose and demand, universal search engine institute The result of return includes the unconcerned webpage of a large number of users.
(2) target of universal search engine is the network coverage as big as possible, limited search engine server resource Contradiction between unlimited network data resource will further deepen.
(3) abundant and network technology the continuous development of world wide web data form, the more matchmakers of picture, database, audio, video The different data such as body largely occur, often intensive to these information contents and data with certain structure of universal search engine without It can be power, cannot find and obtain well.
Particularly, in the art, tool is crawled for some of picture, is language generation based on programming by the way of Code, is accordingly crawled for some specific website (URL), but needs to analyze whether the website includes required figure in advance Piece, and it is unable to control the content of picture, therefore, the prior art can no longer meet the demand of user, can not solve therefrom Existing technological deficiency, it is even more impossible to the search engines using profession to obtain the reliable search result of profession, acquire to big data Biggish interference is brought, subsequent intelligent recognition processing result, and then the intelligentized development trend of influence technique are influenced.
Summary of the invention
This application provides a kind of webpage information search method, apparatus, equipment and storage medium based on crawler, Neng Goutong The input information for crossing user loads virtual objective browser driving, so that the information that operational objective browser carries out profession is searched Rope, it is ensured that the accuracy of search result guarantees that information quality is met the needs of users, and improves working efficiency, and big data is promoted to adopt The development of collection technology meets and meets the trend of technical intelligenceization development.
In a first aspect, this application provides a kind of webpage information search method based on crawler, comprising:
Starting has the search container of crawler function;
Predefined target API is transferred by described search container;
Corresponding objective browser driving is loaded according to the target API;
The input information of user is obtained, and the objective browser according to objective browser driving dry run is with right The input information carries out webpage information search.
Second aspect, this application provides a kind of webpage information search device, the webpage information search device includes:
Starting module, for starting the search container for having crawler function;
Module is transferred, for transferring predefined target API by described search container;
Drive module drives for loading corresponding objective browser according to the target API;
Module is run, for obtaining the input information of user, and according to objective browser driving dry run Objective browser is to carry out webpage information search to the input information.
The third aspect, present invention also provides a kind of computer equipment, the computer equipment includes memory and processing Device;
The memory is for storing computer program;
The processor, for executing the computer program and being realized as described above when executing the computer program Webpage information search method.
Fourth aspect, present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is stored with computer program, the computer program makes the processor realize webpage letter as described above when being executed by processor Cease searching method.
This application discloses a kind of webpage information search method, apparatus, computer equipment and storage medium based on crawler, Starting has the search container of crawler function, predefined target API is transferred by described search container, according to the target API loads corresponding objective browser driving, obtains the input information of user, and drive simulation fortune according to the objective browser The row objective browser is to carry out webpage information search to the input information.By the above-mentioned means, the application can pass through The input information of user loads virtual objective browser driving, so that operational objective browser carries out the information search of profession, The accuracy for ensuring search result guarantees that information quality is met the needs of users, and improves working efficiency, and big data is promoted to acquire skill The development of art meets and meets the trend of technical intelligenceization development.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the step schematic flow diagram of the webpage information search method provided by the embodiments of the present application based on crawler;
Fig. 2 is the input information that user is obtained shown in Fig. 1, and the mesh according to objective browser driving dry run Browser is marked to carry out other step schematic flow diagrams after webpage information search step to the input information.
Fig. 3 is the input information that user is obtained shown in Fig. 1, and the mesh according to objective browser driving dry run Browser is marked to carry out the specific steps schematic flow diagram of webpage information search to the input information.
Fig. 4 A is a kind of structural representation block diagram for webpage information search device that one embodiment of the application provides.
Fig. 4 B is a kind of structural representation block diagram for computer equipment that one embodiment of the application provides.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
Flow chart shown in the drawings only illustrates, it is not necessary to including all content and operation/step, also not It is that must be executed by described sequence.For example, some operation/steps can also decompose, combine or partially merge, therefore practical The sequence of execution is possible to change according to the actual situation.
Embodiments herein provide a kind of webpage information search method, apparatus based on crawler, computer equipment and Storage medium can load virtual objective browser driving by the input information of user, thus operational objective browser into The information search of row profession, it is ensured that the accuracy of search result guarantees that information quality is met the needs of users, and improves work effect Rate promotes the development of big data acquisition technique, meets and meet the trend of technical intelligenceization development.
With reference to the accompanying drawing, it elaborates to some embodiments of the application.In the absence of conflict, following Feature in embodiment and embodiment can be combined with each other.
Referring to Fig. 1, Fig. 1 shows the step of being the webpage information search method based on crawler of embodiments herein offer Meaning flow chart.It should be noted that the webpage information search method based on crawler of the present embodiment can be applied to user terminal, institute Stating user terminal can be on desktop computer, laptop, tablet computer, mobile phone or other artificial smart machine.
As shown in Figure 1, the herein described webpage information search method based on crawler, can include but is not limited to step S101 to step S104.
S101, starting have the search container of crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser, launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、 Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、 PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
S102 transfers predefined target API by described search container.
In the present embodiment, target API refers to application programming interface (application programming Interface, API).
S103 loads corresponding objective browser according to the target API and drives.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands Device, or the browser of Advanced Search function.
S104 obtains the input information of user, and the target browsing according to objective browser driving dry run Device is to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example, Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser Operation browser is supported in the automation set.
As shown in Fig. 2, obtaining the input information of user in S104 described in the present embodiment, and driven according to the objective browser Objective browser described in dynamic dry run to carry out webpage information search to the input information, later include step S105 with S106。
S105 obtains the simulation objective browser and scans for returned expectation search knot to the input information Fruit.
S106 carries out information crawler to the expectation search result, to crawl the graph text information of target.
Specifically, information crawler is carried out to the expectation search result in S106 described in the present embodiment, to crawl target Graph text information, may include: obtain return expectation search result;The expectation is searched according to predefined simulated operation collection Hitch fruit carries out simulated operation.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output; Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target, The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset Threshold value, it is determined that the machine learning model training is completed.
It is noted that the starting of S101 described in the present embodiment has the search container of crawler function, can also wrap before Include: building search container obtains the search purpose of user;Predefine what at least one function matched according to described search purpose Objective browser.
Match it should be strongly noted that predefining at least one function according to described search purpose described in the present embodiment Objective browser, specifically include: according to multiple functions of a variety of search purposes, the multiple targets to match of predefined correspondence are clear Look at device queue.
It is understandable to be, as shown in figure 3, being obtained in S104 described in the present embodiment if being provided with objective browser queue The input information of user, and the objective browser according to objective browser driving dry run is to the input information Webpage information search is carried out, can also include step S1041 to S1044.
S1041 obtains the input information of user.
S1042 identifies to obtain the current purpose function of user according to the input information.
S1043 simulates fortune according to the current purpose function that identification obtains in predefined objective browser queue The matched objective browser of row.
S1044 carries out webpage information search to the input information using the objective browser of dry run.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment Matched objective browser queue can also include: to judge in the same same function of searching for purpose, if there are phases The multiple objective browsers matched;If it exists, then priority setting is carried out to multiple objective browsers, and switching is set and is used effectively The trigger condition of objective browser, wherein the trigger condition includes the highest first object browser of current priority can not With or abnormal and preferential the second high objective browser of level it is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser, Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the webpage information search method can also include: that the target browsing is arranged in advance The proxy server of device;Or, judging agency's clothes when loading the driving of corresponding objective browser according to the target API Whether business device is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, according to described Proxy server is arranged in setting parameter in real time.
For example, it during using googlechrome Google browser, is needed in network request time-out, IP access authority When confirming, picture validation code occur, different disposal is carried out by search container, when the webpage that crawler needs to crawl is accessed, The content of webpage is downloaded and saved, while quantity, period and the size limitation of each image etc. that preservation can be set are wanted It asks.
It illustrates, can be set for googlechrome using Agent IP, to guarantee the normal of search container It uses, the present embodiment can use following processing modes: the input file for searching for container is loaded into the memory of user terminal;When When generate there are the operation of previous secondary program, demand uses the file of specific IP of proxy access, this document content is loaded;Take one Row input file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Judge web page loading time It whether is more than the threshold value set, if so, judging whether Agent IP allows to access;If it is not allowed, replacing other Agent IP weights It is new to initiate network request, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first, It is set to call different target API according to different search purposes, for example, googlechrome can be according to unused use On the way, plurality of target API is provided, including following several.
Table one stablizes API
One table two of table, Beta API
Table two
Table three, Dev API
Table three
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Incorporated by reference to Fig. 1 to Fig. 3 refering to Fig. 4 A, the embodiment of the present application provides a kind of webpage information search device, the webpage Information search device may include starting module 31, transfer module 32, drive module 33 and operation module 34.
Specifically, the starting module 31 is for starting the search container for having crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser, launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、 Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、 PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
The module 32 of transferring is for transferring predefined target API by described search container.
In the present embodiment, target API refers to application programming interface.
The drive module 33 is used to load corresponding objective browser according to the target API and drive.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands Device, or the browser of Advanced Search function.
The operation module 34 is used to obtain the input information of user, and drives dry run according to the objective browser The objective browser is to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example, Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser Operation browser is supported in the automation set.
Operation module 34 described in the present embodiment is used to obtain the input information of user, and is driven according to the objective browser Objective browser described in dry run is to carry out webpage information search to the input information, and the operation module 34 is used for later It obtains the simulation objective browser and returned expectation search result is scanned for the input information;The operation module 34 for carrying out information crawler to the expectation search result, to crawl the graph text information of target.
Specifically, operation module 34 described in the present embodiment is used to carry out information crawler to the expectation search result, with The graph text information for crawling target may include: that the operation module 34 is used to obtain the expectation search result returned;According to predetermined The simulated operation collection of justice carries out simulated operation to the expectation search result.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output; Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target, The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset Threshold value, it is determined that the machine learning model training is completed.
It is noted that starting module 31 described in the present embodiment is used to start the search container for having crawler function, it Before can also include: building module (not shown), for construct search container, obtain the search purpose of user;Definition module (figure Do not show), for predefining the objective browser that at least one function matches according to described search purpose.
It should be strongly noted that definition module described in the present embodiment predefines at least one function according to described search purpose The objective browser that can be matched, specifically includes: the definition module according to it is a variety of search purposes multiple functions, predefine pair Answer multiple objective browser queues to match.
Understandable to be, if being provided with objective browser queue, operation module 34 described in the present embodiment is used for obtaining The input information at family, and according to the objective browser drive dry run described in objective browser with to the input information into Row webpage information search, the operation module 34 are used to obtain the input information of user, and the operation module 34 is used for according to institute It states input information to identify to obtain the current purpose function of user, the operation module 34 is used to be obtained according to identification described current Purpose function matched objective browser of dry run in predefined objective browser queue, the operation module 34 are used for Webpage information search is carried out to the input information using the objective browser of dry run.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment Matched objective browser queue can also include: that the operation module 34 is used to judge the same of same search purpose In function, if there are the multiple objective browsers to match;If it exists, then priority is carried out to multiple objective browsers to set It sets, and the trigger condition that switching uses effective target browser is set, wherein the trigger condition includes current priority highest First object browser it is unavailable or abnormal and preferential the second high objective browser of level is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser, Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the operation module 34 is used to be arranged in advance the agency service of the objective browser Device;Or, the operation module 34 is described for judging when loading the driving of corresponding objective browser according to the target API Whether proxy server is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, root According to the setting parameter, proxy server is set in real time.
For example, during using google chrome Google browser, in network request time-out, IP access authority When needing to confirm, picture validation code occur, different disposal is carried out by search container, needs the webpage that crawls crawler is accessed When, the content of webpage is downloaded and saved, while preservation can be set quantity, period and the size limitation of each image etc. It is required that.
It illustrates, can be set for google chrome using Agent IP, to guarantee the normal of search container It uses, operation module 34 is for the input file for searching for container to be loaded into the memory of user terminal described in the present embodiment;When depositing When generate in the operation of previous secondary program, demand uses the file of specific IP of proxy access, this document content is loaded;Take a line Input file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Judging web page loading time is No is more than the threshold value of setting, if so, judging whether Agent IP allows to access;If it is not allowed, replacing other Agent IPs again Network request is initiated, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first, It is set to call different target API according to different search purposes, for example, google chrome can be according to unused use On the way, plurality of target API is provided.
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Specifically, continuing with Fig. 1 to Fig. 3 is combined, refering to Fig. 4 B, the embodiment of the present application provides a kind of computer equipment, institute Stating computer equipment may include memory 40 and processor 41, and the memory 40 is for storing computer program, the place Reason device 41 for execute the computer program and when executing the computer program for realizing such as Fig. 1 to Fig. 3 and in fact Apply webpage information search method described in example.
Specifically, the processor 41 is for starting the search container for having crawler function.
It should be noted that described search container can crawl tool for crawler etc..
It is noted that the search container of the present embodiment starting, can be and start on user terminal backstage, without occurring The mode of user interface, that is to say, that be not necessarily required to that the visualization interfaces such as browser, APP are operated.Certainly, in order to Facilitate user to understand, visual user interface also can be set, be not limited thereto.
For example, the present embodiment described search container is no interface browser, be specifically as follows benv, browser, launcher、Browserjet、CasperJS、DalekJSGhostbuster、HeadlessBrowser、HtmlUnit、 Jasmine-Headless-Webkit、Jaunt、jBrowserDriver、jedi-crawler、Lotte、Nightmare、 PhantomJS, Selenium, SlimerJS, trifleJS or Zombie.js etc. are without interface browser.
The processor 41 is used to transfer predefined target API by described search container.
In the present embodiment, target API refers to application programming interface.
The processor 41 is used to load corresponding objective browser according to the target API and drive.
Specifically, the objective browser of the present embodiment, can be the professional dedicated classes browsing for user's particular demands Device, or the browser of Advanced Search function.
The processor 41 is used to obtain the input information of user, and drives dry run institute according to the objective browser Objective browser is stated to carry out webpage information search to the input information.
It should be noted that the input information of the present embodiment, can use the text file of json format, for example, Search container uses Selenium, then cooperates ChromeDriver and java language environment API, by specified browser Operation browser is supported in the automation set.
Processor 41 described in the present embodiment is used to obtain the input information of user, and drives mould according to the objective browser The quasi- operation objective browser is to carry out webpage information search to the input information, and the processor 41 is for obtaining later It simulates the objective browser and returned expectation search result is scanned for the input information;The processor 41 is used for Information crawler is carried out to the expectation search result, to crawl the graph text information of target.
Specifically, processor 41 described in the present embodiment is used to carry out information crawler to the expectation search result, to climb The graph text information for taking target may include: that the processor 41 is used to obtain the expectation search result returned;According to predefined Simulated operation collection carries out simulated operation to the expectation search result.
Furthermore, mould is carried out to the expectation search result according to predefined simulated operation collection described in the present embodiment Quasi- operation, comprising: browsed according to the simulated operation, page turning, preservation, contents interception, rolling, screening and/or permission are tested Card input.
It is easily understood that present embodiment is by way of simulated operation, and it is user-friendly, it avoids with being limited IP Location uses or various unreasonable verifyings, and can guarantee to crawl the privacy and stability of movement to the maximum extent.
Wherein, for the embodiment of Authority Verification input, if user needs the side using multithreading multiple target browser Formula, then the case where can be potentially encountered verifying, then the present embodiment can use following processing mode: persistently be visited using objective browser The webpage for needing crawler to crawl is asked, until there is picture validation code;Obtain picture validation code region;It is verified in picture validation code After success, other web placement elements not crawled is selected to continue to use objective browser access webpage;Until there is IP address It restricted access, switches other objective browsers and scan for crawling.
It is noted that the present embodiment can also carry out Weigh sensor verification processing to verification mode, can wrap It includes following process: when receiving the feedback information of targeted website requirement input identifying code, obtaining target verification code picture;By institute Stating target verification code picture investment pre-training, good machine learning model is identified, obtains the machine learning model output Identifying code answer;The verification operation that the targeted website requires input identifying code is executed according to the identifying code answer of the output; Behind the verifying by the targeted website, swash evidence of fetching from the targeted website.
For example, the present embodiment can specifically include, and obtain multiple identifying code pictures, will for each identifying code picture The identifying code picture is cut into each picture block comprising individual authentication code, carries out at binaryzation to each picture block Reason is that each picture block after binaryzation marks corresponding identifying code answer, using the picture block after each binaryzation as input Investment obtains the training answer of the machine learning model output to machine learning model, using each trained answer as target, The model parameter of the machine learning model is adjusted, is answered with minimizing the identifying code of obtained each trained answer and each label Error between case, if the error rate between the training answer of each output and the identifying code answer of each label is less than preset Threshold value, it is determined that the machine learning model training is completed.
It is noted that processor 41 described in the present embodiment is used to start the search container for having crawler function, before It can also include: building search container, obtain the search purpose of user;At least one function is predefined according to described search purpose The objective browser to match.
Match it should be strongly noted that predefining at least one function according to described search purpose described in the present embodiment Objective browser, specifically include: according to multiple functions of a variety of search purposes, the multiple targets to match of predefined correspondence are clear Look at device queue.
Understandable to be, if being provided with objective browser queue, processor 41 described in the present embodiment is for obtaining user Input information, and according to the objective browser drive dry run described in objective browser with to the input information progress Webpage information search, the processor 41 are used to obtain the input information of user, and the processor 41 is used for according to the input Information identifies to obtain the current purpose function of user, and the processor 41 is used for the current purpose function of obtaining according to identification The matched objective browser of dry run in predefined objective browser queue, the processor 41 are used to utilize simulation fortune Capable objective browser carries out webpage information search to the input information.
It is noted that according to multiple functions of a variety of search purposes, the predefined multiple phases of correspondence described in the present embodiment Matched objective browser queue can also include: the same function that the processor 41 is used to judge same search purpose In energy, if there are the multiple objective browsers to match;If it exists, then priority setting is carried out to multiple objective browsers, And the trigger condition that switching uses effective target browser is set, wherein the trigger condition includes that current priority is highest First object browser is unavailable or abnormal and preferentially the second high objective browser of level is available.
For example, if the objective browser of search pictures includes Google and Baidu, can be set it is preferential using Google, and hundred Degree is then next.
Specifically, objective browser described in present embodiment can for Google's browser, baidu browser, Lunascape browser, Wiseie browser, IE browser or red fox browser.
It should be added that the processor 41 is used to be arranged in advance the proxy server of the objective browser; Or, the processor 41 is for judging agency's clothes when loading the driving of corresponding objective browser according to the target API Whether business device is limited to countries and regions or user group, if so, obtaining pre-stored proxy server is arranged parameter, according to described Proxy server is arranged in setting parameter in real time.
For example, during using google chrome Google browser, in network request time-out, IP access authority When needing to confirm, picture validation code occur, different disposal is carried out by search container, needs the webpage that crawls crawler is accessed When, the content of webpage is downloaded and saved, while preservation can be set quantity, period and the size limitation of each image etc. It is required that.
It illustrates, can be set for google chrome using Agent IP, to guarantee the normal of search container It uses, processor 41 described in the present embodiment is for the input file for searching for container to be loaded into the memory of user terminal;Work as presence When previous secondary program operation generates, demand uses the file of specific IP of proxy access, this document content is loaded;Take a line defeated Enter file content, obtains Agent IP;Network request is initiated to same a line content using Agent IP;Whether judge web page loading time More than the threshold value of setting, if so, judging whether Agent IP allows to access;It is sent out again if it is not allowed, replacing other Agent IPs Network request is played, it can the satisfactory agency that uses of normal authorization until obtaining.
In addition, the present embodiment, which can also define crawler, crawls sequence to objective browser queue, search result is defined When searching for processing mode corresponding to the relevance threshold of purpose with user, and being arranged to search result greater than certain amount Screening mode etc..
It should be added that the present invention also needs to carry out classification processing to the target API for capableing of open call first, It is set to call different target API according to different search purposes, for example, google chrome can be according to unused use On the way, plurality of target API is provided.
This preferred embodiment is driven using google Chrome browser, simulates google Chrome browser, is called Google search as long as the input information such as incoming keyword, will crawl automatically, and utilizes corresponding to google search key Picture, image content and quality can be by google search engines as guaranteeing, accuracy is high, high-quality.
Incorporated by reference to said one or multiple embodiments, the application also provides a kind of computer readable storage medium, the meter Calculation machine readable storage medium storing program for executing is stored with computer program, for realizing such as Fig. 1-figure when the computer program is executed by processor 3 and embodiment described in webpage information search method.
It should be understood that the present embodiment processor can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing At programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or the processor It is also possible to any conventional processor etc..
Wherein, the computer readable storage medium can be the storage inside of computer equipment described in previous embodiment Unit, such as the hard disk or memory of the computer equipment.The computer readable storage medium is also possible to the computer The plug-in type hard disk being equipped on the External memory equipment of equipment, such as the computer equipment, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should all cover within the scope of protection of this application.Therefore, the protection scope of the application should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of webpage information search method based on crawler characterized by comprising
Starting has the search container of crawler function;
Predefined target API is transferred by described search container;
Corresponding objective browser driving is loaded according to the target API;
The input information of user is obtained, and the objective browser according to objective browser driving dry run is to described It inputs information and carries out webpage information search.
2. webpage information search method according to claim 1, which is characterized in that the input information for obtaining user, And the objective browser according to objective browser driving dry run is searched with carrying out webpage information to the input information Rope includes: later
It obtains the simulation objective browser and returned expectation search result is scanned for the input information;
Information crawler is carried out to the expectation search result, to crawl the graph text information of target.
3. webpage information search method according to claim 2, which is characterized in that it is described to the expectation search result into Row information crawls, to crawl the graph text information of target, comprising:
Obtain the expectation search result returned;
Simulated operation is carried out to the expectation search result according to predefined simulated operation collection.
4. webpage information search method according to claim 3, which is characterized in that described according to predefined simulated operation Collection carries out simulated operation to the expectation search result, comprising:
It is browsed according to the simulated operation, the input of page turning, preservation, contents interception, rolling, screening and/or Authority Verification.
5. webpage information search method according to claim 1, which is characterized in that the starting has searching for crawler function Rope container includes: before
Building search container, obtains the search purpose of user;
The objective browser that at least one function matches is predefined according to described search purpose.
6. webpage information search method according to claim 5, which is characterized in that described predetermined according to described search purpose The objective browser that at least one adopted function matches, specifically includes:
According to multiple functions of a variety of search purposes, the multiple objective browser queues to match of predefined correspondence.
7. webpage information search method according to claim 6, which is characterized in that the input information for obtaining user, And the objective browser according to objective browser driving dry run is searched with carrying out webpage information to the input information Rope, further includes:
Obtain the input information of user;
It identifies to obtain the current purpose function of user according to the input information;
The current purpose function of being obtained according to identification matched mesh of dry run in predefined objective browser queue Mark browser;
Webpage information search is carried out to the input information using the objective browser of dry run.
8. a kind of webpage information search device, which is characterized in that the webpage information search device includes:
Starting module, for starting the search container for having crawler function;
Module is transferred, for transferring predefined target API by described search container;
Drive module drives for loading corresponding objective browser according to the target API;
Module is run, for obtaining the input information of user, and the target according to objective browser driving dry run Browser is to carry out webpage information search to the input information.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor;
The memory is for storing computer program;
The processor, for executing the computer program and realization such as claim 1 when executing the computer program To webpage information search method described in any one of 7.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence is searched when the computer program is executed by processor for realizing the webpage information as described in any one of claims 1 to 7 Suo Fangfa.
CN201910568616.1A 2019-06-27 2019-06-27 Webpage information search method, apparatus, computer equipment and storage medium Pending CN110413859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910568616.1A CN110413859A (en) 2019-06-27 2019-06-27 Webpage information search method, apparatus, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910568616.1A CN110413859A (en) 2019-06-27 2019-06-27 Webpage information search method, apparatus, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110413859A true CN110413859A (en) 2019-11-05

Family

ID=68358351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910568616.1A Pending CN110413859A (en) 2019-06-27 2019-06-27 Webpage information search method, apparatus, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110413859A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368167A (en) * 2020-03-06 2020-07-03 北京师范大学 Chinese literature data automatic acquisition method based on web crawler technology
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111625690A (en) * 2020-05-13 2020-09-04 北京达佳互联信息技术有限公司 Object recommendation method, device, equipment and medium
CN112347326A (en) * 2020-09-29 2021-02-09 武汉虹旭信息技术有限责任公司 Crawler detection method and device based on browser end
CN112818212A (en) * 2020-04-23 2021-05-18 腾讯科技(深圳)有限公司 Corpus data acquisition method and device, computer equipment and storage medium
CN114928532A (en) * 2022-05-17 2022-08-19 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for generating alarm message

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094249A1 (en) * 2007-10-05 2009-04-09 Microsoft Corporation Creating search enabled web pages
CN101751428A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Information search method and device
CN102792244A (en) * 2010-01-13 2012-11-21 洛克迈特公司 Preview functionality for increased browsing speed
CN105893622A (en) * 2016-04-29 2016-08-24 深圳市中润四方信息技术有限公司 Polymerization search method and polymerization search system
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094249A1 (en) * 2007-10-05 2009-04-09 Microsoft Corporation Creating search enabled web pages
CN101751428A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 Information search method and device
CN102792244A (en) * 2010-01-13 2012-11-21 洛克迈特公司 Preview functionality for increased browsing speed
CN105893622A (en) * 2016-04-29 2016-08-24 深圳市中润四方信息技术有限公司 Polymerization search method and polymerization search system
CN109815380A (en) * 2018-12-20 2019-05-28 山东中创软件工程股份有限公司 A kind of information crawler method, apparatus, equipment and computer readable storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368167A (en) * 2020-03-06 2020-07-03 北京师范大学 Chinese literature data automatic acquisition method based on web crawler technology
CN112818212A (en) * 2020-04-23 2021-05-18 腾讯科技(深圳)有限公司 Corpus data acquisition method and device, computer equipment and storage medium
CN112818212B (en) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium
CN111597421A (en) * 2020-04-30 2020-08-28 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111597421B (en) * 2020-04-30 2022-08-30 武汉思普崚技术有限公司 Method, device, equipment and storage medium for realizing website picture crawler
CN111625690A (en) * 2020-05-13 2020-09-04 北京达佳互联信息技术有限公司 Object recommendation method, device, equipment and medium
CN111625690B (en) * 2020-05-13 2024-03-08 北京达佳互联信息技术有限公司 Object recommendation method, device, equipment and medium
CN112347326A (en) * 2020-09-29 2021-02-09 武汉虹旭信息技术有限责任公司 Crawler detection method and device based on browser end
CN114928532A (en) * 2022-05-17 2022-08-19 北京达佳互联信息技术有限公司 Method, device, equipment and storage medium for generating alarm message
CN114928532B (en) * 2022-05-17 2023-12-12 北京达佳互联信息技术有限公司 Alarm message generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110413859A (en) Webpage information search method, apparatus, computer equipment and storage medium
CN108595583B (en) Dynamic graph page data crawling method, device, terminal and storage medium
US9363310B2 (en) Standard commands for native commands
CN103530114B (en) Picture managing method and device
CN109600385B (en) Access control method and device
CN110909229A (en) Webpage data acquisition and storage system based on simulated browser access
CN107609150A (en) A kind of interactive network reptile creation method chosen based on page elements and system
CN111324440A (en) Method, device and equipment for executing automation process and readable storage medium
CN108920146A (en) Page control assembly and visual Simulation operating system
CN110378749A (en) Appraisal procedure, device, terminal device and the storage medium of user data similitude
CN108491420A (en) Configuration method, application server and the computer readable storage medium of web page crawl
CN110162682A (en) A kind of crawling method of network data, device, storage medium and terminal device
CN108959619A (en) Content screen method, user equipment, storage medium and device
CN107203470A (en) Page adjustment method and device
CN112988185A (en) Cloud application updating method, device and system, electronic equipment and storage medium
CN108427639B (en) Automated testing method, application server and computer readable storage medium
CN105930487A (en) Topic search method and apparatus applied to mobile terminal
CN109829821A (en) A kind of abnormal processing method of digital asset address transfer, apparatus and system
CN107547944A (en) Interface realizing method and device, set top box
CN111859069B (en) Network malicious crawler identification method, system, terminal and storage medium
CN104298716B (en) A kind of network crawler system and implementation method for supporting artificial conversation grafting
CN107741980A (en) Topic searching method, topic searcher and electric terminal
CN112307464A (en) Fraud identification method and device and electronic equipment
WO2023115968A1 (en) Method and device for identifying violation data at user end, medium, and program product
CN112036843A (en) Flow element positioning method, device, equipment and medium based on RPA and AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination