CN108595583A - Dynamic chart class page data crawling method, device, terminal and storage medium - Google Patents

Dynamic chart class page data crawling method, device, terminal and storage medium Download PDF

Info

Publication number
CN108595583A
CN108595583A CN201810349975.3A CN201810349975A CN108595583A CN 108595583 A CN108595583 A CN 108595583A CN 201810349975 A CN201810349975 A CN 201810349975A CN 108595583 A CN108595583 A CN 108595583A
Authority
CN
China
Prior art keywords
sectional drawing
page
picture
data
crawled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810349975.3A
Other languages
Chinese (zh)
Other versions
CN108595583B (en
Inventor
阮晓雯
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810349975.3A priority Critical patent/CN108595583B/en
Priority to PCT/CN2018/100159 priority patent/WO2019200783A1/en
Publication of CN108595583A publication Critical patent/CN108595583A/en
Application granted granted Critical
Publication of CN108595583B publication Critical patent/CN108595583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Abstract

A kind of dynamic chart class page data crawling method, including:Browser is started using automated test tool, inputs the link of the website of data to be crawled;It is crawled from the website of data to be crawled and crawls the relevant page info of keyword with input by user;The page crawled is rendered and parsed;Sectional drawing picture is obtained to the page progress sectional drawing after parsing by automated test tool and preserves sectional drawing picture;Sectional drawing picture is identified according to picture recognition model trained in advance, obtains the content in sectional drawing picture;Whether the website and the corresponding page for crawling keyword for judging data to be crawled have traversed;It had all been traversed when determining, has then terminated flow;Otherwise, the above process is continued to execute.The present invention also provides a kind of dynamic chart class page datas to crawl device, terminal and storage medium.The present invention can crawl the chart class data of dynamic load and can recognize that the content in picture automatically.

Description

Dynamic chart class page data crawling method, device, terminal and storage medium
Technical field
The present invention relates to web crawlers technical fields, and in particular to a kind of dynamic chart class page data crawling method, dress It sets, terminal and storage medium.
Background technology
Popular approach with the interactive web application of establishment without sacrificing browser compatibility The modern times such as (Asynchronous JavaScript and XML, Ajax) web technologies are popularized, and the form of web data is also sent out Deep variation is given birth to.Occur more and more content of pages using Ajax dynamic generations on internet, user is often Encounter some webpages prompt " it is more to click load " or as mouse rollovers load more contents automatically.These neomorphs Webpage needs user interactive operation to trigger the generation and display of content, improves user's viewing experience to a certain extent, but It is that stern challenge is proposed to collecting method of the tradition based on crawl html file.
Especially for the chart class data of dynamic load in webpage, typically by being shown after Asynchronous loading, And traditional reptile is difficult to crawl;Some text datas are shown after encryption technology also by the form of chart, and are schemed Table can not directly download acquisition;The problem of needing input can be frequently encountered during crawling data;In addition can increase on chart Add some interference informations so that the truthful data information in chart is difficult to obtain.A large amount of human input is generally required at this stage Dynamic chart class data can just be got.
Invention content
In view of the foregoing, it is necessary to propose a kind of dynamic chart class page data crawling method, device, terminal and storage Medium can crawl the chart class data of dynamic load automatically, be input to after carrying out sectional drawing for the chart class data crawled In advance trained picture recognition model, the content in picture is identified, compared to traditional web crawlers Products Compatibility Well, speed is fast, data grabber is accurate.
The first aspect of the present invention provides a kind of dynamic chart class page data crawling method, the method includes:
A) it uses automated test tool to start browser, inputs the link of the website of data to be crawled;
B) it is crawled from the website of the data to be crawled and crawls the relevant page info of keyword with input by user;
C) page crawled is rendered and is parsed;
D) sectional drawing is carried out to the page after parsing by the automated test tool to obtain described in sectional drawing picture and preservation Sectional drawing picture;
E) the sectional drawing picture is identified according to picture recognition model trained in advance, is obtained in the sectional drawing picture Content;
Whether the page that keyword is crawled described in the website of data to be crawled and correspondence described in f) judging has traversed;And
The website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be all traversed, then tie Line journey;Or
The website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be not traversed, then continue It executes above-mentioned b) to f).
It is described that sectional drawing is carried out to the page after parsing by the automated test tool in a kind of preferred embodiment Obtain sectional drawing picture and preserve the sectional drawing picture include:
Judge to whether there is chart in the page after parsing by the automated test tool;
When determining in the page after parsing there is no when chart, the information in the page after parsing is crawled, and according to advance The data format of setting preserves the information crawled;And
When determining in the page after parsing there are when chart, sectional drawing is carried out to the chart in the page after the parsing and is obtained Sectional drawing picture.
It is described that sectional drawing is carried out to the page after parsing by the automated test tool in a kind of preferred embodiment Obtain sectional drawing picture and preserve the sectional drawing picture include:
Calculate the perceptual hash value of the sectional drawing picture;
Judge the sectional drawing picture perceptual hash value and the similarity between the perceptual hash value of sectional drawing picture whether More than pre-set similarity threshold;
When the similarity between the perceptual hash value for determining the sectional drawing picture and the perceptual hash value of sectional drawing picture is big When pre-set similarity threshold, the sectional drawing picture is deleted.
It is described that sectional drawing is carried out to the page after parsing by the automated test tool in a kind of preferred embodiment Obtain sectional drawing picture and preserve the sectional drawing picture further include:
When the similarity between the perceptual hash value for determining the sectional drawing picture and the perceptual hash value of sectional drawing picture is small When pre-set similarity threshold, the page after the sectional drawing picture and corresponding parsing is associated and is deposited It is stored in pre-set specific position.
In a kind of preferred embodiment, the picture recognition model trained in advance includes:
Obtain plurality of pictures;
The plurality of pictures is pre-processed, the data set for waiting participating in training picture recognition model is obtained;
The data set is carried out using the method for cross validation to be divided into training set and test set;
The training set training picture recognition model of the first preset quantity is randomly choosed in the training set;
The accuracy rate of trained picture recognition model is tested using the test set;
If the accuracy rate is more than or equal to default accuracy rate threshold value, training terminates;
If the accuracy rate is less than the default accuracy rate threshold value, re -training picture recognition model.
In a kind of preferred embodiment, the re -training picture recognition model includes:
From the training set in the training set in addition to the training set of first preset quantity, increase by the second present count In the training set of amount to the training set of first preset quantity, until the accuracy rate of picture recognition model trained be more than or Person is equal to the default accuracy rate threshold value.
In a kind of preferred embodiment, second preset quantity is pre-set fixed value, or is pre-set Ratio value or first preset quantity preset ratio value.
A kind of dynamic chart class page data of the second aspect of the present invention offer crawls device, and described device includes:
Starting module inputs the link of the website of data to be crawled for starting browser using automated test tool;
Module is crawled, for being crawled from the website of the data to be crawled and input by user to crawl keyword relevant Page info;
Parsing module, for the page crawled to be rendered and parsed;
Screen capture module obtains sectional drawing picture for carrying out sectional drawing to the page after parsing by the automated test tool And preserve the sectional drawing picture;
Identification module obtains institute for the sectional drawing picture to be identified according to picture recognition model trained in advance State the content in sectional drawing picture.
The third aspect of the present invention provides a kind of terminal, and the terminal includes processor and memory, and the processor is used The dynamic chart class page data crawling method is realized when executing the computer program stored in the memory.
The fourth aspect of the present invention provides a kind of computer readable storage medium, is stored thereon with computer program, described The dynamic chart class page data crawling method is realized when computer program is executed by processor.
Dynamic chart class page data crawling method, device, terminal and storage medium of the present invention use The operations such as Selenium technical modelling users log in browser, dynamic load and sectional drawing are downloaded, in conjunction with web crawlers technology, from And the chart class data of dynamic load can be crawled automatically, the graph text information complete one that the information and real user crawled is seen It causes, is input in advance trained picture recognition model after carrying out sectional drawing for the chart class data crawled, identifies figure Content in piece is good compared to traditional web crawlers Products Compatibility, speed is fast, data grabber is accurate.
Secondly, it in the training process of picture recognition model, by being stepped up the quantity for the training set for participating in training, is protecting Under the premise of the discrimination for demonstrate,proving picture recognition model, training is participated in less sample, picture knowledge can be shortened to greatest extent The training time of other model, improve picture recognition model training effectiveness, i.e., the accuracy rate and efficiency of picture recognition model it Between find the quantity of best training set.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is the flow chart for the dynamic chart class page data crawling method that the embodiment of the present invention one provides.
Fig. 2 is that the page progress sectional drawing provided by Embodiment 2 of the present invention to after parsing obtains described in sectional drawing picture and preservation The flow chart of the method for sectional drawing picture.
Fig. 3 is the flow chart of the training method for the picture recognition model that the embodiment of the present invention three provides.
Fig. 4 is the structure chart that the dynamic chart class page data that the embodiment of the present invention four provides crawls device.
Fig. 5 is the sub-function module figure for the deduplication module that the embodiment of the present invention five provides.
Fig. 6 is the sub-function module figure for the training module that the embodiment of the present invention six provides.
Fig. 7 is the structure chart for the terminal that the embodiment of the present invention seven provides.
Following specific implementation mode will be further illustrated the present invention in conjunction with above-mentioned attached drawing.
Specific implementation mode
To better understand the objects, features and advantages of the present invention, below in conjunction with the accompanying drawings and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, the embodiment of the present invention and embodiment In feature can be combined with each other.
Elaborate many details in the following description to facilitate a thorough understanding of the present invention, described embodiment only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill The every other embodiment that personnel are obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all of technologies and scientific terms used here by the article and belong to the technical field of the present invention The normally understood meaning of technical staff is identical.Used term is intended merely to description tool in the description of the invention herein The purpose of the embodiment of body, it is not intended that in the limitation present invention.
The dynamic chart class page data crawling method of the embodiment of the present invention is applied in one or more terminal.It is described Dynamic chart class page data crawling method can also be applied to by terminal and the clothes being attached by network and the terminal In the hardware environment that business device is constituted.Network includes but not limited to:Wide area network, Metropolitan Area Network (MAN) or LAN.The embodiment of the present invention Dynamic chart class page data crawling method can be executed by server, can also be executed by terminal;It can also be by taking Business device and terminal execute jointly.
The terminal for needing progress dynamic chart class page data crawling method can integrate directly in terminal The dynamic chart class page data that the method for the present invention is provided crawls function, or installation for realizing method of the invention Client.For another example, method provided by the present invention can also be with Software Development Kit (Software Development Kit, SDK) form operate in the equipment such as server, in the form of SDK provide dynamic chart class page data crawl function Interface, the tracking of hand can be realized by the interface of offer for terminal or other equipment.
Embodiment one
Fig. 1 is the flow chart for the dynamic chart class page data crawling method that the embodiment of the present invention one provides.According to difference Demand, the execution sequence in the flow chart can change, and certain steps can be omitted.
S11, browser is started using automated test tool, inputs the link of the website of data to be crawled.
Computer software automatization testing technique Selenium Web Driver (hereinafter referred to as Selenium) have compared with Strong visualization automatic interaction function, the interaction of people and webpage are simulated by programming, and to trigger dynamic data load, are obtained The data of dynamic generation.Selenium technologies can the operation that is executed on website and webpage of true analog subscriber, such as simulate User clicks " checking more ", " automated log on ", " clickthrough ", " filling in list ", " roll mouse ", " mouse drag ", " page Is scrolled down through after the completion of the load of face ", " click page turning ", the operations such as " sectional drawing preservation ".
In the present embodiment, by Selenium tool open browsers, the website of data to be crawled is inputted in a browser Link (Uniform Resource Locator, URL), Selenium tools call get () method open it is input by user The Web page of the website of data to be crawled.
For example, user needs to crawl " recognition of face books " data " when working as " on website, then pass through selenium tools Open browser (for example, Google browsers), the URL " www.dangdang.com " of input " when working as " website, you can start " when working as " website, the Web page of display " when working as " website.
It, can be by the website of multiple data to be crawled if user needs to crawl the data of multiple websites in the present embodiment The link queue that input passes through the browser of selenium tool open simultaneously in, crawlers crawl the multiple wait for successively Crawl the data in the website of data.
S12, it is crawled from the website of the data to be crawled and crawls the relevant page info of keyword with input by user.
When by described in Selenium tool open when crawling the website of data, user input crawl keyword, for example, " recognition of face ", then " recognition of face " on the website of data to be crawled described in Selenium tools analog subscriber browsing is all The page info of webpage.
S13, the page crawled is rendered and is parsed.
Selenium tools can trigger Ajax when crawling the page to server Asynchronous Request data, receive the original of reply After data, formatting is assembled into new HTML nodes, is inserted into initial HTML file, finally by browser kernel rendering engine Dynamic content is shown.It services to send by selenium and obtains Page Service request to wire agreements, then operate clear The device API that lookes at obtains the parent page of browser load.In being serviced back to selenium by wire agreements, when selenium takes Business gives parsing module progress page parsing after taking the page.
S14, sectional drawing picture is obtained to the page progress sectional drawing after parsing by the automated test tool and preserves institute State sectional drawing picture.
The driver instruction browser of Selenium tools executes order, finally carries out sectional drawing in kernel by browser Operation is preserved, it is identical that final effect with user intercepts the effect of picture and preservation using mouse on the page.
Preferably, described that sectional drawing picture is obtained simultaneously to the page progress sectional drawing after parsing by the automated test tool Preserving the sectional drawing picture can also include:Duplicate removal is carried out to the table in the page after parsing according to perceptual hash value.
Sectional drawing picture is obtained to the page progress sectional drawing after parsing by the automated test tool to step S14 and is protected The process that the sectional drawing picture further refines is deposited to describe referring specifically to Fig. 2 and its accordingly.
S15, the sectional drawing picture is identified according to picture recognition model trained in advance, obtains the sectional drawing picture In content.
In the present embodiment, the training method of the picture recognition model is referring specifically to Fig. 3 and its corresponding description.
The website of data to be crawled described in S16, judgement and it is corresponding described in crawl the page of keyword and whether traversed.
The website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be all traversed, then tie Line journey;Otherwise, the website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be not traversed, then Continue to execute above-mentioned S12 to S15.
In conclusion dynamic chart class page data crawling method of the present invention, using Selenium technical modellings User logs in the operations such as browser, dynamic load and sectional drawing download, dynamic so as to crawl automatically in conjunction with web crawlers technology The chart class data of state load, the graph text information that the information and real user crawled is seen is completely the same, for the figure crawled Table class data are input to after carrying out sectional drawing in advance trained picture recognition model, identify the content in picture, compared to Traditional web crawlers Products Compatibility is good, speed is fast, data grabber is accurate.
Embodiment two
Fig. 2 is that the page progress sectional drawing provided by Embodiment 2 of the present invention to after parsing obtains described in sectional drawing picture and preservation The flow chart of the method for sectional drawing picture.The execution sequence in the flow chart can change according to different requirements, and certain steps can To omit.
S21, judge to whether there is chart in the page after parsing by the automated test tool.
In the present embodiment, the automated test tool is whether there is in the page after the parsing and institute by identifying Chart is stated to show and control relevant label and then judge to whether there is chart in the page after parsing.
Exist in the page after automated test tool identifies the parsing and shows and control with the chart Relevant label, it is determined that there are charts in the page after the parsing;When the automated test tool identifies the solution There is no show and control relevant label with the chart in the page after analysis, it is determined that is not deposited in the page after the parsing In chart.
It is described to show and control relevant label with chart and include:The labels such as img, table, tr, td, colspan.
Because the chart in webpage is write using html language, wherein can exist it is many control page display formats DIV, CSS and with the relevant html tag of chart, by judging whether after can determine whether parsing with the relevant tag attributes of chart The page in whether there is chart, when recognizing tag attributes relevant with chart, determine parsing after the page in exist figure Table determines and chart is not present in the page after parsing when not recognizing tag attributes relevant with chart.
When determining in the page after parsing there is no when chart, step S22 is executed;Otherwise, page after determining parsing In there are when chart, execute step S23.
S22, information in the page after parsing is crawled, and the letter that crawls is preserved according to pre-set data format Breath.
When determining in the page after parsing there is no when chart, sectional drawing is not carried out to the page after parsing, using reptile journey Sequence directly crawls the information in the page after parsing, and is stored according to pre-set data format.
In the present embodiment, by judging in the page after parsing with the presence or absence of chart to execute different operations, parsing When having chart in the page afterwards, sectional drawing is carried out to the page after parsing, sectional drawing is carried out to the chart in the page simultaneously, after parsing When chart being not present in the page, then without shot operation, it can so be convenient for saving Internet resources, avoid to after all parsings The page carries out sectional drawing to waste Internet resources;In addition, when chart is not present in the page after parsing, without shot operation, Operating process is simplified, helps to improve and crawls efficiency.
S23, sectional drawing picture is obtained to the chart progress sectional drawing in the page after the parsing.
In the present embodiment, sectional drawing is carried out to the chart in the page after the parsing by Selenium tools analog subscriber Further include being downloaded to the chart in the page after the parsing.
S24, the perceptual hash value for calculating the sectional drawing picture.
In the present embodiment, sectional drawing picture is calculated using perceptual hash algorithm (perceptual hash algorithm) Perceptual hash value, detailed process include:
1) gray processing processing is carried out to sectional drawing picture;
2) average gray of gray processing treated sectional drawing picture is calculated;
3) compare gray processing treated the size of the gray value and the average gray of each pixel of sectional drawing picture;
4) gray value of the pixel of gray processing treated sectional drawing picture is greater than or equal to the note of the average gray It is 1,0 is denoted as by what the gray value of the pixel of gray processing treated sectional drawing picture was less than the average gray;
5) comparison result of each pixel obtained in 4) is attached according to pre-set concatenate rule, obtains institute State the perceptual hash value of sectional drawing picture.
S25, the perceptual hash value for judging the sectional drawing picture and the similarity between the perceptual hash value of sectional drawing picture Whether pre-set similarity threshold is more than.
In the present embodiment, the perceptual hash value of the perceptual hash value for judging the sectional drawing picture and sectional drawing picture it Between similarity whether specifically included more than pre-set similarity threshold:Compare the perceptual hash value of the sectional drawing picture with Between the perceptual hash value of sectional drawing picture identical numerical value digit;It is described pre- to judge whether the digit of identical numerical value is more than The similarity threshold being first arranged.
For example, gray processing treated sectional drawing picture is 8*8 pixels, average gray 45, the first row first row When the gray value of pixel is more than 45, comparison result is denoted as 1, otherwise comparison result is denoted as 0;The ash of the pixel of the first row secondary series When angle value is more than 45, comparison result is denoted as 1, and otherwise comparison result is denoted as 0;The gray value of the tertial pixel of the first row is more than 45 When, comparison result is denoted as 1, and otherwise comparison result is denoted as 0;And so on.Then from left to right, from the top down by comparison result group 64 digits are synthesized, which is the perceptual hash value of the sectional drawing picture.When the perceptual hash for judging the sectional drawing picture The digit (such as 61) of numerical value having the same is set in advance more than described between value and the perceptual hash value of the picture of sectional drawing When similarity threshold (for example, 60) set, illustrate that the sectional drawing picture is identical as the picture of sectional drawing.
When the similarity between the perceptual hash value for determining the sectional drawing picture and the perceptual hash value of sectional drawing picture is big When pre-set similarity threshold, step S26 is executed;Otherwise, when the perceptual hash value for determining the sectional drawing picture and When similarity between the perceptual hash value of sectional drawing picture is less than or equal to pre-set similarity threshold, step is executed S27。
S26, the sectional drawing picture is deleted.
S27, the page after the sectional drawing picture and corresponding parsing is associated be stored in it is pre-set specific Position.
In the present embodiment, the pre-set specific position is exclusively used in storing the sectional drawing picture and corresponding solution The page after analysis.The specific position can be a specific file or a text named with specific names Part presss from both sides.By each time sectional drawing picture and corresponding parsing after page line associated storage, convenient for can quickly find afterwards The page where chart, according to location information etc. of the chart in the page, the method based on context semantic analysis is into one Step parses the content of the chart in the page.
In conclusion sectional drawing picture De-weight method provided by the invention, judges the sectional drawing picture according to perceptual hash value With sectional drawing picture it is whether identical to achieve the purpose that duplicate removal, perceptual hash result of calculation is accurate, to identical content Download carries out deletion or duplicate removal processing, can remove the sectional drawing picture of redundancy, be effectively saved memory space.In addition, association The page after sectional drawing picture and corresponding parsing is stored, convenient for subsequent management and analysis.
Embodiment three
Fig. 3 is the flow chart of the training method for the picture recognition model that the embodiment of the present invention three provides.According to different need It asks, the execution sequence in the flow chart can change, and certain steps can be omitted.
S31, plurality of pictures is obtained.
In the present embodiment, multiple can be obtained from each website provided on internet automatically by other small reptile Picture can also download plurality of pictures manually from each search engine (for example, Baidu, Google, 360), form image data Collection preserves in the local database.Content in picture may include, but be not limited to:Number, character, letter, image, table Deng letter can also distinguish between capital and small letter.
S32, the plurality of pictures is pre-processed, obtains the data set for waiting participating in training picture recognition model.
In the present embodiment, the every pictures concentrated respectively to the image data pre-process, and the pretreatment includes: Background removal, segmentation, scaling, cutting, overturning and/or distortion etc. make that picture is trained to be of the same size and identical visual angle Afterwards, then the training of picture recognition model is carried out, to effectively improve the authenticity and accuracy rate of picture recognition model.
In the present embodiment, binarization method may be used and carry out background removal, set in advance if the pixel on picture is more than The threshold values set is then white, is otherwise black, i.e., original image is converted into only black and white picture to effectively remove figure The interference element of piece background.
In the present embodiment, every pictures that the image data is concentrated can be split using segmentation function, will be schemed Each of piece number or each character etc. are divided into single number or character.
S33, the data set is carried out using the method for cross validation to be divided into training set and test set.
The training set is to training picture recognition model, and the test set is testing trained picture recognition mould The performance of type.If the accuracy rate of test is higher, show that the performance of trained picture recognition model is better;If the standard of test True rate is relatively low, then shows that the performance of trained picture recognition model is poor.
The data set can be divided according to suitable ratio (for example, 3 to 2), obtains training set and training set.
S34, the training set training picture recognition model that the first preset quantity is randomly choosed in the training set.
In the present embodiment, the picture in all original training sets need not be carried out to the instruction of picture recognition model Practice, but the training set of the first preset quantity is selected to participate in training in the original training set, it is possible to reduce participates in training The quantity of training set saves the training time of picture recognition model.
In addition, being randomly choosed using Generating Random Number, the randomness for the training set for participating in training can be increased, The robustness of picture recognition model can be improved.
In the first embodiment, first preset quantity can be a pre-set fixed value, for example, 60, i.e., Pick out the training that 60 samples participate in picture recognition model at random in original training set.
In a second embodiment, first preset quantity can be a pre-set ratio value, for example, 1/10, The sample for selecting 1/10 ratio at random in original training set participates in the training of picture recognition model.
S35, the accuracy rate that trained picture recognition model is tested using the test set, if accuracy rate is more than or waits In default accuracy rate threshold value, then training terminates;If accuracy rate is less than default accuracy rate threshold value, re -training picture recognition mould Type.
Preferably, the re -training picture recognition model includes:First preset quantity is removed from the training set Training set except training set in, increase the second preset quantity training set to the training set of first preset quantity in, And above-mentioned steps S32 to S35 is re-executed, until the accuracy rate for the picture recognition model trained is more than or equal to default standard True rate threshold value.
In the first embodiment, second preset quantity can be a pre-set fixed value, for example, 20, i.e., 20 pictures are picked out in training set in the training set in addition to the training set of the first preset quantity at random and participate in picture The training of identification model.
In a second embodiment, second preset quantity can be a pre-set ratio value, for example, 1/20, The figure of 1/20 ratio is selected in training set i.e. in the training set in addition to the training set of first preset quantity at random Piece participates in the training of picture recognition model.
In the third embodiment, second preset quantity can be the preset ratio value of first preset quantity, example Such as, 1/5, i.e., in the training set in the training set in addition to the training set of the first preset quantity, described first is selected at random The picture of 1/5 ratio of preset quantity participates in the training of picture recognition model.
Picture recognition model training method provided by the invention, by being stepped up the quantity for the training set for participating in training, Under the premise of ensureing the discrimination of picture recognition model, training is participated in less sample, figure can be shortened to greatest extent The training time of piece identification model improves the training effectiveness of picture recognition model, i.e., in the accuracy rate and effect of picture recognition model The quantity of best training set is found between rate.
The above is only the specific implementation mode of the present invention, but scope of protection of the present invention is not limited thereto, for For those skilled in the art, without departing from the concept of the premise of the invention, improvement, but these can also be made It all belongs to the scope of protection of the present invention.
With reference to the 4th to 7 figure, respectively to the function of the terminal of the above-mentioned dynamic chart class page data crawling method of realization Module and hardware configuration are introduced.
Example IV
Fig. 4 is the functional block diagram that the dynamic chart class page data that the embodiment of the present invention four provides crawls device.
In some embodiments, the dynamic chart class page data crawls device 40 and runs in terminal.The dynamic It may include multiple function modules being made of program code segments that chart class page data, which crawls device 40,.The dynamic chart The program code that class page data crawls each program segment in device 40 can be stored in memory, and by least one It manages performed by device, dynamic chart class page data is crawled with execution (referring to Fig. 1 and its associated description).
In the present embodiment, the dynamic chart class page data of the terminal crawls function of the device 40 performed by it, Multiple function modules can be divided into.The function module may include:Starting module 401 crawls module 402, parsing mould Block 403, screen capture module 404, deduplication module 405, training module 406, identification module 407 and judgment module 408.Alleged by the present invention Module refer to that a kind of performed by least one processor and can complete the series of computation machine journey of fixed function Sequence section is stored in the memory.It in some embodiments, will be in subsequent embodiment in detail about the function of each module It states.
Starting module 401 inputs the chain of the website of data to be crawled for starting browser using automated test tool It connects.
Computer software automatization testing technique Selenium Web Driver (hereinafter referred to as Selenium) have compared with Strong visualization automatic interaction function, the interaction of people and webpage are simulated by programming, and to trigger dynamic data load, are obtained The data of dynamic generation.Selenium technologies can the operation that is executed on website and webpage of true analog subscriber, such as simulate User clicks " checking more ", " automated log on ", " clickthrough ", " filling in list ", " roll mouse ", " mouse drag ", " page Is scrolled down through after the completion of the load of face ", " click page turning ", the operations such as " sectional drawing preservation ".
In the present embodiment, by Selenium tool open browsers, the website of data to be crawled is inputted in a browser Link (Uniform Resource Locator, URL), Selenium tools call get () method open it is input by user The Web page of the website of data to be crawled.
For example, user needs to crawl " recognition of face books " data " when working as " on website, then pass through selenium tools Open browser (for example, Google browsers), the URL " www.dangdang.com " of input " when working as " website, you can start " when working as " website, the Web page of display " when working as " website.
It, can be by the website of multiple data to be crawled if user needs to crawl the data of multiple websites in the present embodiment The link queue that input passes through the browser of selenium tool open simultaneously in, crawlers crawl the multiple wait for successively Crawl the data in the website of data.
Module 402 is crawled, keyword phase is crawled with input by user for being crawled from the website of the data to be crawled The page info of pass.
When by described in Selenium tool open when crawling the website of data, user input crawl keyword, for example, " recognition of face ", then " recognition of face " on the website of data to be crawled described in Selenium tools analog subscriber browsing is all The page info of webpage.
Parsing module 403, for the page crawled to be rendered and parsed.
Selenium tools can trigger Ajax when crawling the page to server Asynchronous Request data, receive the original of reply After data, formatting is assembled into new HTML nodes, is inserted into initial HTML file, finally by browser kernel rendering engine Dynamic content is shown.It services to send by selenium and obtains Page Service request to wire agreements, then operate clear The device API that lookes at obtains the parent page of browser load.In being serviced back to selenium by wire agreements, when selenium takes Business gives parsing module progress page parsing after taking the page.
Screen capture module 404 obtains sectional drawing for carrying out sectional drawing to the page after parsing by the automated test tool Picture simultaneously preserves the sectional drawing picture.
The driver instruction browser of Selenium tools executes order, finally carries out sectional drawing in kernel by browser Operation is preserved, it is identical that final effect with user intercepts the effect of picture and preservation using mouse on the page.
Deduplication module 405, for carrying out duplicate removal to the table in the page after parsing according to perceptual hash value.
Training module 406, for training picture recognition model.
Identification module 407 is obtained for the sectional drawing picture to be identified according to picture recognition model trained in advance Content in the sectional drawing picture.
Judgment module 408, website for judging the data to be crawled and it is corresponding described in crawl the page of keyword and be It is no to have traversed.The page of keyword is crawled described in the website of data to be crawled and correspondence described in being determined when the judgment module 408 Face has not traversed, repeats above-mentioned module 401,402,403,404,405 and 407.
In conclusion dynamic chart class page data of the present invention crawls device, using Selenium technical modellings User logs in the operations such as browser, dynamic load and sectional drawing download, dynamic so as to crawl automatically in conjunction with web crawlers technology The chart class data of state load, the graph text information that the information and real user crawled is seen is completely the same, for the figure crawled Table class data are input to after carrying out sectional drawing in advance trained picture recognition model, identify the content in picture, compared to Traditional web crawlers Products Compatibility is good, speed is fast, data grabber is accurate.
Embodiment five
Fig. 5 is the sub-function module figure for the deduplication module that the embodiment of the present invention five provides.The deduplication module 405 includes: First judging submodule 4051, preservation submodule 4052, sectional drawing submodule 4053, computational submodule 4054, second judge submodule Block 4055 deletes submodule 4056 and association submodule 4057.
First judging submodule 4051, for judging whether deposit in the page after parsing by the automated test tool In chart.
In the present embodiment, the automated test tool is whether there is in the page after the parsing and institute by identifying Chart is stated to show and control relevant label and then judge to whether there is chart in the page after parsing.
Exist in the page after automated test tool identifies the parsing and shows and control with the chart Relevant label, it is determined that there are charts in the page after the parsing;When the automated test tool identifies the solution There is no show and control relevant label with the chart in the page after analysis, it is determined that is not deposited in the page after the parsing In chart.
It is described to show and control relevant label with chart and include:The labels such as img, table, tr, td, colspan.
Because the chart in webpage is write using html language, wherein can exist it is many control page display formats DIV, CSS and with the relevant html tag of chart, by judging whether after can determine whether parsing with the relevant tag attributes of chart The page in whether there is chart, when recognizing tag attributes relevant with chart, determine parsing after the page in exist figure Table determines and chart is not present in the page after parsing when not recognizing tag attributes relevant with chart.
Submodule 4052 is preserved, for when there is no figures in the page that first judging submodule 4051 determines after parsing The information in the page after parsing is crawled when table, and the information crawled is preserved according to pre-set data format.
When determining in the page after parsing there is no when chart, sectional drawing is not carried out to the page after parsing, using reptile journey Sequence directly crawls the information in the page after parsing, and is stored according to pre-set data format.
In the present embodiment, by judging in the page after parsing with the presence or absence of chart to execute different operations, parsing When having chart in the page afterwards, sectional drawing is carried out to the page after parsing, sectional drawing is carried out to the chart in the page simultaneously, after parsing When chart being not present in the page, then without shot operation, it can so be convenient for saving Internet resources, avoid to after all parsings The page carries out sectional drawing to waste Internet resources;In addition, when chart is not present in the page after parsing, without shot operation, Operating process is simplified, helps to improve and crawls efficiency.
Sectional drawing submodule 4053, for when there are charts in the page that first judging submodule 4051 determines after parsing When, sectional drawing is carried out to the chart in the page after the parsing and obtains sectional drawing picture.
In the present embodiment, sectional drawing is carried out to the chart in the page after the parsing by Selenium tools analog subscriber Further include being downloaded to the chart in the page after the parsing.
Computational submodule 4054, the perceptual hash value for calculating the sectional drawing picture.
In the present embodiment, 4054 detailed process of the computational submodule includes:
1) gray processing processing is carried out to sectional drawing picture;
2) average gray of gray processing treated sectional drawing picture is calculated;
3) compare gray processing treated the size of the gray value and the average gray of each pixel of sectional drawing picture;
4) gray value of the pixel of gray processing treated sectional drawing picture is greater than or equal to the note of the average gray It is 1,0 is denoted as by what the gray value of the pixel of gray processing treated sectional drawing picture was less than the average gray;
5) comparison result of each pixel obtained in 4) is attached according to pre-set concatenate rule, obtains institute State the perceptual hash value of sectional drawing picture.
Second judgment submodule 4055, the perception of perceptual hash value and sectional drawing picture for judging the sectional drawing picture Whether the similarity between cryptographic Hash is more than pre-set similarity threshold.
In the present embodiment, the perceptual hash value of the perceptual hash value for judging the sectional drawing picture and sectional drawing picture it Between similarity whether specifically included more than pre-set similarity threshold:Compare the perceptual hash value of the sectional drawing picture with Between the perceptual hash value of sectional drawing picture identical numerical value digit;It is described pre- to judge whether the digit of identical numerical value is more than The similarity threshold being first arranged.
For example, gray processing treated sectional drawing picture is 8*8 pixels, average gray 45, the first row first row When the gray value of pixel is more than 45, comparison result is denoted as 1, otherwise comparison result is denoted as 0;The ash of the pixel of the first row secondary series When angle value is more than 45, comparison result is denoted as 1, and otherwise comparison result is denoted as 0;The gray value of the tertial pixel of the first row is more than 45 When, comparison result is denoted as 1, and otherwise comparison result is denoted as 0;And so on.Then from left to right, from the top down by comparison result group 64 digits are synthesized, which is the perceptual hash value of the sectional drawing picture.When the perceptual hash for judging the sectional drawing picture The digit (such as 61) of numerical value having the same is set in advance more than described between value and the perceptual hash value of the picture of sectional drawing When similarity threshold (for example, 60) set, illustrate that the sectional drawing picture is identical as the picture of sectional drawing.
Delete submodule 4056, the perceptual hash for determining the sectional drawing picture when the second judgment submodule 4055 When value and the similarity between the perceptual hash value of sectional drawing picture are more than pre-set similarity threshold, the sectional drawing is deleted Picture.
It is associated with submodule 4057, the perceptual hash for determining the sectional drawing picture when the second judgment submodule 4055 When value and the similarity between the perceptual hash value of sectional drawing picture are less than or equal to pre-set similarity threshold, by institute It states the page after sectional drawing picture and corresponding parsing and is associated and be stored in pre-set specific position.
In the present embodiment, the pre-set specific position is exclusively used in storing the sectional drawing picture and corresponding solution The page after analysis.The specific position can be a specific file or a text named with specific names Part presss from both sides.By each time sectional drawing picture and corresponding parsing after page line associated storage, convenient for can quickly find afterwards The page where chart, according to location information etc. of the chart in the page, the method based on context semantic analysis is into one Step parses the content of the chart in the page.
Embodiment six
Fig. 6 is the sub-function module figure for the training module that the embodiment of the present invention six provides.The training module 406 includes: Acquisition submodule 4061, preprocessing module 4062 divide submodule 4063, selection submodule 4064 and test submodule 4065.
Acquisition submodule 4061, for obtaining plurality of pictures.
In the present embodiment, multiple can be obtained from each website provided on internet automatically by other small reptile Picture can also download plurality of pictures manually from each search engine (for example, Baidu, Google, 360), form image data Collection preserves in the local database.Content in picture may include, but be not limited to:Number, character, letter, image, table Deng letter can also distinguish between capital and small letter.
Preprocessing module 4062 obtains waiting participating in training picture recognition mould for pre-processing the plurality of pictures The data set of type.
In the present embodiment, the every pictures concentrated respectively to the image data pre-process, and the pretreatment includes: Background removal, segmentation, scaling, cutting, overturning and/or distortion etc. make that picture is trained to be of the same size and identical visual angle Afterwards, then the training of picture recognition model is carried out, to effectively improve the authenticity and accuracy rate of picture recognition model.
In the present embodiment, binarization method may be used and carry out background removal, set in advance if the pixel on picture is more than The threshold values set is then white, is otherwise black, i.e., original image is converted into only black and white picture to effectively remove figure The interference element of piece background.
In the present embodiment, every pictures that the image data is concentrated can be split using segmentation function, will be schemed Each of piece number or each character etc. are divided into single number or character.
Submodule 4063 is divided, for carrying out being divided into training set and survey to the data set using the method for cross validation Examination collection.
The training set is to training picture recognition model, and the test set is testing trained picture recognition mould The performance of type.If the accuracy rate of test is higher, show that the performance of trained picture recognition model is better;If the standard of test True rate is relatively low, then shows that the performance of trained picture recognition model is poor.
The data set can be divided according to suitable ratio (for example, 3 to 2), obtains training set and training set.
Submodule 4064 is selected, the training set for randomly choosing the first preset quantity in the training set trains picture Identification model.
In the present embodiment, the picture in all original training sets need not be carried out to the instruction of picture recognition model Practice, but the training set of the first preset quantity is selected to participate in training in the original training set, it is possible to reduce participates in training The quantity of training set saves the training time of picture recognition model.
In addition, being randomly choosed using Generating Random Number, the randomness for the training set for participating in training can be increased, The robustness of picture recognition model can be improved.
In the first embodiment, first preset quantity can be a pre-set fixed value, for example, 60, i.e., Pick out the training that 60 samples participate in picture recognition model at random in original training set.
In a second embodiment, first preset quantity can be a pre-set ratio value, for example, 1/10, The sample for selecting 1/10 ratio at random in original training set participates in the training of picture recognition model.
Submodule 4065 is tested, the accuracy rate for testing trained picture recognition model using the test set, if Accuracy rate is more than or equal to default accuracy rate threshold value, then training terminates;If accuracy rate is less than default accuracy rate threshold value, described It selects submodule 4064 from the training set in the training set in addition to the training set of first preset quantity, increases by second In the training set of preset quantity to the training set of first preset quantity, and test submodule 4065 is re-executed, until institute The accuracy rate of trained picture recognition model is more than or equal to default accuracy rate threshold value.
In the first embodiment, second preset quantity can be a pre-set fixed value, for example, 20, i.e., 20 pictures are picked out in training set in the training set in addition to the training set of the first preset quantity at random and participate in picture The training of identification model.
In a second embodiment, second preset quantity can be a pre-set ratio value, for example, 1/20, The figure of 1/20 ratio is selected in training set i.e. in the training set in addition to the training set of first preset quantity at random Piece participates in the training of picture recognition model.
In the third embodiment, second preset quantity can be the preset ratio value of first preset quantity, example Such as, 1/5, i.e., in the training set in the training set in addition to the training set of the first preset quantity, described first is selected at random The picture of 1/5 ratio of preset quantity participates in the training of picture recognition model.
Picture recognition model training method provided by the invention, by being stepped up the quantity for the training set for participating in training, Under the premise of ensureing the discrimination of picture recognition model, training is participated in less sample, figure can be shortened to greatest extent The training time of piece identification model improves the training effectiveness of picture recognition model, i.e., in the accuracy rate and effect of picture recognition model The quantity of best training set is found between rate.
The above-mentioned integrated unit realized in the form of software function module, can be stored in one and computer-readable deposit In storage media.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer It is each that equipment (can be personal computer, double screen equipment or the network equipment etc.) or processor (processor) execute the present invention The part of a embodiment the method.
Embodiment seven
Fig. 7 is the schematic diagram for the terminal that the embodiment of the present invention five provides.
The terminal 7 includes:Memory 71, at least one processor 72 are stored in the memory 71 and can be in institute State the computer program 73 run at least one processor 72, at least one communication bus 74.
At least one processor 72 realizes above-mentioned dynamic chart class page data when executing the computer program 73 Step in crawling method embodiment, alternatively, at least one processor 72 is realized when executing the computer program 73 State the function of each module/unit in device embodiment.
Illustratively, the computer program 73 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 71, and are executed by least one processor 72, to complete this hair It is bright.One or more of module/units can be the series of computation machine program instruction section that can complete specific function, this refers to Enable section for describing implementation procedure of the computer program 73 in the terminal 7.
The terminal 7 can be the computing devices such as desktop PC, notebook, palm PC and cloud server.This Field technology personnel are appreciated that the schematic diagram 5 is only the example of terminal 7, and the not restriction of structure paired terminal 7 can be with Including components more more or fewer than diagram, certain components or different components are either combined, such as the terminal 7 may be used also To include input-output equipment, network access equipment, bus etc..
At least one processor 72 can be central processing unit (Central Processing Unit, CPU), It can also be other general processors, digital signal processor (Digital Signal Processor, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..The processor 72 can be microprocessor or the processor 72 can also be any conventional processor Deng the processor 72 is the control centre of the terminal 7, utilizes each portion of various interfaces and the entire terminal of connection 7 Point.
The memory 71 can be used for storing the computer program 73 and/or module/unit, and the processor 72 passes through Operation executes the computer program and/or module/unit being stored in the memory 71, and calls and be stored in memory Data in 71 realize the various functions of the terminal 7.The memory 71 can include mainly storing program area and storage data Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function, Image player function etc.) etc.;Storage data field can be stored uses created data (such as audio data, electricity according to terminal 7 Script for story-telling etc.) etc..In addition, memory 71 may include high-speed random access memory, can also include nonvolatile memory, example Such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.
If the integrated module/unit of the terminal 7 is realized in the form of SFU software functional unit and as independent product Sale in use, can be stored in a computer read/write memory medium.Based on this understanding, in present invention realization All or part of flow in embodiment method is stated, relevant hardware can also be instructed to complete by computer program, institute The computer program stated can be stored in a computer readable storage medium, which, can when being executed by processor The step of realizing above-mentioned each embodiment of the method.Wherein, the computer program includes computer program code, the computer Program code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer can Reading medium may include:Any entity or device, recording medium, USB flash disk, mobile hard of the computer program code can be carried Disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It needs to illustrate It is that the content that the computer-readable medium includes can be fitted according to legislation in jurisdiction and the requirement of patent practice When increase and decrease, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium does not include that electric carrier wave is believed Number and telecommunication signal.
In several embodiments provided by the present invention, it should be understood that disclosed terminal and method can pass through it Its mode is realized.For example, terminal embodiment described above is only schematical, for example, the division of the unit, only Only a kind of division of logic function, formula that in actual implementation, there may be another division manner.
In addition, each functional unit in each embodiment of the present invention can be integrated in same treatment unit, it can also That each unit physically exists alone, can also two or more units be integrated in same unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of hardware adds software function module.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case of without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Profit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent requirements of the claims Variation includes within the present invention.Any attached drawing table in claim should not be remembered to be considered as and limit the claims involved.This Outside, it is clear that one word of " comprising " is not excluded for other units or, odd number is not excluded for plural number.The multiple units stated in system claims Or device can also be realized by a unit or device by software or hardware.The first, the second equal words are used for indicating name Claim, and does not represent any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, it will be understood by those of ordinary skill in the art that, it can be to the present invention's Technical solution is modified or equivalent replacement, without departing from the spirit of the technical scheme of the invention range.

Claims (10)

1. a kind of dynamic chart class page data crawling method, which is characterized in that the method includes:
A) it uses automated test tool to start browser, inputs the link of the website of data to be crawled;
B) it is crawled from the website of the data to be crawled and crawls the relevant page info of keyword with input by user;
C) page crawled is rendered and is parsed;
D) sectional drawing picture is obtained to the page progress sectional drawing after parsing by the automated test tool and preserves the sectional drawing Picture;
E) the sectional drawing picture is identified according to picture recognition model trained in advance, is obtained interior in the sectional drawing picture Hold;
Whether the page that keyword is crawled described in the website of data to be crawled and correspondence described in f) judging has traversed;And
The website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be all traversed, then terminate to flow Journey;Or
The website of the data to be crawled described in the determination and it is corresponding described in crawl the page of keyword and be not traversed, then continue to execute It is above-mentioned b) to f).
2. the method as described in claim 1, which is characterized in that it is described by the automated test tool to the page after parsing Face progress sectional drawing obtains sectional drawing picture and preserves the sectional drawing picture:
Judge to whether there is chart in the page after parsing by the automated test tool;
When determine chart is not present in the page after parsing when, crawl the information in the page after parsing, and according to pre-setting Data format preserve the information that crawls;And
When determining in the page after parsing there are when chart, sectional drawing is carried out to the chart in the page after the parsing and obtains sectional drawing Picture.
3. method as claimed in claim 1 or 2, which is characterized in that it is described by the automated test tool to parsing after The page carry out sectional drawing obtain sectional drawing picture and preserve the sectional drawing picture include:
Calculate the perceptual hash value of the sectional drawing picture;
Judge the perceptual hash value of the sectional drawing picture and whether the similarity between the perceptual hash value of sectional drawing picture is more than Pre-set similarity threshold;
When the similarity between the perceptual hash value for determining the sectional drawing picture and the perceptual hash value of sectional drawing picture is more than in advance When the similarity threshold being first arranged, the sectional drawing picture is deleted.
4. method as claimed in claim 3, which is characterized in that it is described by the automated test tool to the page after parsing Face progress sectional drawing obtains sectional drawing picture and preserves the sectional drawing picture:
Be less than when the similarity between the perceptual hash value for determining the sectional drawing picture and the perceptual hash value of sectional drawing picture or When person is equal to pre-set similarity threshold, the page after the sectional drawing picture and corresponding parsing is associated and is stored in Pre-set specific position.
5. the method as described in claim 1, which is characterized in that the training process packet of the trained picture recognition model in advance It includes:
Obtain plurality of pictures;
The plurality of pictures is pre-processed, the data set for waiting participating in training picture recognition model is obtained;
The data set is carried out using the method for cross validation to be divided into training set and test set;
The training set training picture recognition model of the first preset quantity is randomly choosed in the training set;
The accuracy rate of trained picture recognition model is tested using the test set;
If the accuracy rate is more than or equal to default accuracy rate threshold value, training terminates;
If the accuracy rate is less than the default accuracy rate threshold value, re -training picture recognition model.
6. method as claimed in claim 5, which is characterized in that the re -training picture recognition model includes:
From the training set in the training set in addition to the training set of first preset quantity, increase by the second preset quantity In training set to the training set of first preset quantity, until the accuracy rate for the picture recognition model trained is more than or waits In the default accuracy rate threshold value.
7. method as claimed in claim 5, which is characterized in that second preset quantity is pre-set fixed value, or The preset ratio value of the pre-set ratio value of person or first preset quantity.
8. a kind of dynamic chart class page data crawls device, which is characterized in that described device includes:
Starting module inputs the link of the website of data to be crawled for starting browser using automated test tool;
Module is crawled, the relevant page of keyword is crawled with input by user for being crawled from the website of the data to be crawled Information;
Parsing module, for the page crawled to be rendered and parsed;
Screen capture module, for obtaining sectional drawing picture to the page progress sectional drawing after parsing by the automated test tool and protecting Deposit the sectional drawing picture;
Identification module obtains described cut for the sectional drawing picture to be identified according to picture recognition model trained in advance Content in figure picture.
9. a kind of terminal, which is characterized in that the terminal includes processor and memory, and the processor is for executing described deposit Dynamic chart class page data as claimed in any of claims 1 to 7 in one of claims is realized when the computer program stored in reservoir Crawling method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Dynamic chart class page data crawling method as claimed in any of claims 1 to 7 in one of claims is realized when being executed by processor.
CN201810349975.3A 2018-04-18 2018-04-18 Dynamic graph page data crawling method, device, terminal and storage medium Active CN108595583B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810349975.3A CN108595583B (en) 2018-04-18 2018-04-18 Dynamic graph page data crawling method, device, terminal and storage medium
PCT/CN2018/100159 WO2019200783A1 (en) 2018-04-18 2018-08-13 Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810349975.3A CN108595583B (en) 2018-04-18 2018-04-18 Dynamic graph page data crawling method, device, terminal and storage medium

Publications (2)

Publication Number Publication Date
CN108595583A true CN108595583A (en) 2018-09-28
CN108595583B CN108595583B (en) 2022-12-02

Family

ID=63611109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810349975.3A Active CN108595583B (en) 2018-04-18 2018-04-18 Dynamic graph page data crawling method, device, terminal and storage medium

Country Status (2)

Country Link
CN (1) CN108595583B (en)
WO (1) WO2019200783A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582850A (en) * 2018-12-03 2019-04-05 金瓜子科技发展(北京)有限公司 A kind of method, apparatus of web page crawl, storage medium and electronic equipment
CN109901968A (en) * 2019-01-31 2019-06-18 阿里巴巴集团控股有限公司 A kind of automation page data method of calibration and device
CN109948020A (en) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 Data capture method, device, system and readable storage medium storing program for executing
CN110324360A (en) * 2019-08-02 2019-10-11 联永智能科技(上海)有限公司 Offline cryptogram setting, management method, device, system, server and medium
CN110807007A (en) * 2019-09-30 2020-02-18 支付宝(杭州)信息技术有限公司 Target detection model training method, device and system and storage medium
CN111475699A (en) * 2020-03-07 2020-07-31 咪咕文化科技有限公司 Website data crawling method and device, electronic equipment and readable storage medium
CN113660535A (en) * 2021-08-18 2021-11-16 海看网络科技(山东)股份有限公司 System and method for monitoring content change of EPG column of IPTV service
CN114595391A (en) * 2022-03-17 2022-06-07 北京百度网讯科技有限公司 Data processing method and device based on information search and electronic equipment
CN114691962A (en) * 2022-04-25 2022-07-01 清华大学 Mobile terminal page crawler method and device and electronic equipment
CN114691962B (en) * 2022-04-25 2024-04-19 清华大学 Mobile terminal page crawler method and device and electronic equipment

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026392B (en) * 2019-11-14 2023-08-22 北京金山安全软件有限公司 Method and device for generating guide page and electronic equipment
CN111428162A (en) * 2020-03-20 2020-07-17 支付宝(杭州)信息技术有限公司 Page screenshot method and device
CN111538887B (en) * 2020-04-30 2023-11-10 贵阳杰汇数字创新中心有限公司 Big data graph and text recognition system and method based on artificial intelligence
CN111694588B (en) * 2020-06-11 2022-05-20 杭州安恒信息安全技术有限公司 Engine upgrade detection method and device, computer equipment and readable storage medium
CN112363919B (en) * 2020-11-02 2024-02-13 北京云测信息技术有限公司 User interface AI automatic test method, device, equipment and storage medium
CN112712021B (en) * 2020-12-29 2022-06-17 华信咨询设计研究院有限公司 Grain surface abnormal state identification method based on perceptual hash and connected domain analysis algorithm
CN113821747A (en) * 2021-08-31 2021-12-21 挂号网(杭州)科技有限公司 Data display method and device, storage medium and electronic equipment
CN115396237A (en) * 2022-10-27 2022-11-25 浙江鹏信信息科技股份有限公司 Webpage malicious tampering identification method and system and readable storage medium
CN117149552A (en) * 2023-10-31 2023-12-01 联通在线信息科技有限公司 Automatic interface detection method and device, electronic equipment and storage medium

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060184410A1 (en) * 2003-12-30 2006-08-17 Shankar Ramamurthy System and method for capture of user actions and use of capture data in business processes
US20090217153A1 (en) * 2004-08-02 2009-08-27 Clairvoyance Corporation Document processing and management approach to editing a document in a mark up language environment using undoable commands
CN102346736A (en) * 2010-07-28 2012-02-08 阿里巴巴集团控股有限公司 Protection method of webpage digital information and system
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method
US20160042021A1 (en) * 2014-08-08 2016-02-11 Halogen Software Inc. System and method for rendering of hierarchical data structures
CN105528159A (en) * 2016-01-28 2016-04-27 深圳市创想天空科技股份有限公司 Picture operation method and operation device
CN105630780A (en) * 2014-10-27 2016-06-01 小米科技有限责任公司 Webpage information processing method and apparatus
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106599242A (en) * 2016-12-20 2017-04-26 福建六壬网安股份有限公司 Webpage change monitoring method and system based on similarity calculation
US20170193569A1 (en) * 2015-12-07 2017-07-06 Brandon Nedelman Three dimensional web crawler
CN106960062A (en) * 2017-04-12 2017-07-18 四川九鼎瑞信软件开发有限公司 Webpage capture method and system
CN107203778A (en) * 2017-05-05 2017-09-26 平安科技(深圳)有限公司 PVR intensity grade detecting system and method
CN107332805A (en) * 2016-04-29 2017-11-07 阿里巴巴集团控股有限公司 Detect the methods, devices and systems of leak
CN107480176A (en) * 2017-07-01 2017-12-15 珠海格力电器股份有限公司 A kind of management method of picture, device and terminal device
CN107871128A (en) * 2017-12-11 2018-04-03 广州市标准化研究院(广州市组织机构代码管理中心) A kind of high robust image-recognizing method based on SVG dynamic charts

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103401835A (en) * 2013-07-01 2013-11-20 北京奇虎科技有限公司 Method and device for presenting safety detection results of microblog page
CA2863124A1 (en) * 2014-01-03 2015-07-03 Investel Capital Corporation User content sharing system and method with automated external content integration
CN104376114B (en) * 2014-12-01 2018-01-30 百度在线网络技术(北京)有限公司 A kind of search result methods of exhibiting and device

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060184410A1 (en) * 2003-12-30 2006-08-17 Shankar Ramamurthy System and method for capture of user actions and use of capture data in business processes
US20090217153A1 (en) * 2004-08-02 2009-08-27 Clairvoyance Corporation Document processing and management approach to editing a document in a mark up language environment using undoable commands
CN102346736A (en) * 2010-07-28 2012-02-08 阿里巴巴集团控股有限公司 Protection method of webpage digital information and system
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
US20160042021A1 (en) * 2014-08-08 2016-02-11 Halogen Software Inc. System and method for rendering of hierarchical data structures
CN105630780A (en) * 2014-10-27 2016-06-01 小米科技有限责任公司 Webpage information processing method and apparatus
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method
US20170193569A1 (en) * 2015-12-07 2017-07-06 Brandon Nedelman Three dimensional web crawler
CN105528159A (en) * 2016-01-28 2016-04-27 深圳市创想天空科技股份有限公司 Picture operation method and operation device
CN107332805A (en) * 2016-04-29 2017-11-07 阿里巴巴集团控股有限公司 Detect the methods, devices and systems of leak
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106599242A (en) * 2016-12-20 2017-04-26 福建六壬网安股份有限公司 Webpage change monitoring method and system based on similarity calculation
CN106960062A (en) * 2017-04-12 2017-07-18 四川九鼎瑞信软件开发有限公司 Webpage capture method and system
CN107203778A (en) * 2017-05-05 2017-09-26 平安科技(深圳)有限公司 PVR intensity grade detecting system and method
CN107480176A (en) * 2017-07-01 2017-12-15 珠海格力电器股份有限公司 A kind of management method of picture, device and terminal device
CN107871128A (en) * 2017-12-11 2018-04-03 广州市标准化研究院(广州市组织机构代码管理中心) A kind of high robust image-recognizing method based on SVG dynamic charts

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
PEITONG DUAN: "Beagle: Automated Extraction and Interpretation of Visualizations from the Web", 《MASSACHUSETTS INSTITUTE OF TECHNOLOGY 2017》 *
刘可: "移动通信中的金融类钓鱼网页检测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
栗迎结等: "基于Selenium的SQL注入漏洞检测系统的研究", 《现代计算机(专业版)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582850B (en) * 2018-12-03 2021-07-02 金瓜子科技发展(北京)有限公司 Webpage crawling method and device, storage medium and electronic equipment
CN109582850A (en) * 2018-12-03 2019-04-05 金瓜子科技发展(北京)有限公司 A kind of method, apparatus of web page crawl, storage medium and electronic equipment
CN109948020A (en) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 Data capture method, device, system and readable storage medium storing program for executing
CN109901968A (en) * 2019-01-31 2019-06-18 阿里巴巴集团控股有限公司 A kind of automation page data method of calibration and device
CN110324360A (en) * 2019-08-02 2019-10-11 联永智能科技(上海)有限公司 Offline cryptogram setting, management method, device, system, server and medium
CN110807007A (en) * 2019-09-30 2020-02-18 支付宝(杭州)信息技术有限公司 Target detection model training method, device and system and storage medium
CN110807007B (en) * 2019-09-30 2022-06-24 支付宝(杭州)信息技术有限公司 Target detection model training method, device and system and storage medium
CN111475699A (en) * 2020-03-07 2020-07-31 咪咕文化科技有限公司 Website data crawling method and device, electronic equipment and readable storage medium
CN111475699B (en) * 2020-03-07 2023-09-08 咪咕文化科技有限公司 Website data crawling method and device, electronic equipment and readable storage medium
CN113660535A (en) * 2021-08-18 2021-11-16 海看网络科技(山东)股份有限公司 System and method for monitoring content change of EPG column of IPTV service
CN114595391A (en) * 2022-03-17 2022-06-07 北京百度网讯科技有限公司 Data processing method and device based on information search and electronic equipment
CN114691962A (en) * 2022-04-25 2022-07-01 清华大学 Mobile terminal page crawler method and device and electronic equipment
CN114691962B (en) * 2022-04-25 2024-04-19 清华大学 Mobile terminal page crawler method and device and electronic equipment

Also Published As

Publication number Publication date
CN108595583B (en) 2022-12-02
WO2019200783A1 (en) 2019-10-24

Similar Documents

Publication Publication Date Title
CN108595583A (en) Dynamic chart class page data crawling method, device, terminal and storage medium
CN107729319B (en) Method and apparatus for outputting information
CN108021651B (en) Network public opinion risk assessment method and device
CN106844685B (en) Method, device and server for identifying website
CN110909229A (en) Webpage data acquisition and storage system based on simulated browser access
CN110427549A (en) A kind of network public opinion Source Tracing method, apparatus, terminal and storage medium
EP3617910A1 (en) Method and apparatus for displaying textual information
CN107742128A (en) Method and apparatus for output information
CN110795697A (en) Logic expression obtaining method and device, storage medium and electronic device
CN110717801A (en) Commodity information pushing method and device
CN112667802A (en) Service information input method, device, server and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN110069686A (en) User behavior analysis method, apparatus, computer installation and storage medium
WO2018208412A1 (en) Detection of caption elements in documents
CN111858686A (en) Data display method and device, terminal equipment and storage medium
JP6508327B2 (en) Text visualization system, text visualization method, and program
CN108268488A (en) The recognition methods of webpage master map and device
CN110413307A (en) Correlating method, device and the electronic equipment of code function
CN115439156A (en) Method and device for recommending advertisement space, computer equipment and storage medium
CN113792232B (en) Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product
CN112328812B (en) Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
CN114201376A (en) Log analysis method and device based on artificial intelligence, terminal equipment and medium
CN112597760A (en) Method and device for extracting domain words in document
CN110851517A (en) Source data extraction method, device and equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant