WO2019200783A1 - Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium - Google Patents

Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium Download PDF

Info

Publication number
WO2019200783A1
WO2019200783A1 PCT/CN2018/100159 CN2018100159W WO2019200783A1 WO 2019200783 A1 WO2019200783 A1 WO 2019200783A1 CN 2018100159 W CN2018100159 W CN 2018100159W WO 2019200783 A1 WO2019200783 A1 WO 2019200783A1
Authority
WO
WIPO (PCT)
Prior art keywords
screenshot
page
picture
training
crawled
Prior art date
Application number
PCT/CN2018/100159
Other languages
French (fr)
Chinese (zh)
Inventor
阮晓雯
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200783A1 publication Critical patent/WO2019200783A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Information Transfer Between Computers (AREA)
  • Image Analysis (AREA)

Abstract

A method for data crawling in a page containing a dynamic image or table, the method comprising: launching a browser by means of an automatic testing tool, and inputting a link of a given website; crawling the given website for page information related to a crawling keyword input by a user; rendering and parsing a crawled page; capturing a screenshot of the parsed page by means of the automatic testing tool and storing the screenshot image; performing identification on the screenshot image according to a pre-trained image identification model, and obtaining the content of the screenshot image; determining whether traversal of the given website and pages corresponding to the crawling keyword is completed; if so, terminating the procedure; and if not, continuing the above procedure. The present application further provides a device for data crawling in a page containing a dynamic image or table, a terminal, and a storage medium. The present application enables automatic crawling of dynamically loaded data in an image or table and identifies content in an image.

Description

动态图表类页面数据爬取方法、装置、终端及存储介质Dynamic chart class page data crawling method, device, terminal and storage medium
本申请要求于2018年04月18日提交中国专利局,申请号为201810349975.3、发明名称为“动态图表类页面数据爬取方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810349975.3, entitled "Dynamic Charts Page Data Crawling Method, Device, Terminal and Storage Medium", filed on April 18, 2018, all of which are entitled The content is incorporated herein by reference.
技术领域Technical field
本申请涉及网络爬虫技术领域,具体涉及一种动态图表类页面数据爬取方法、装置、终端及存储介质。The present application relates to the field of web crawler technology, and in particular, to a dynamic chart type page data crawling method, device, terminal and storage medium.
背景技术Background technique
随着创建交互式Web应用程序而无需牺牲浏览器兼容性的流行方法(Asynchronous JavaScript and XML,Ajax)等现代网页技术的普及,网页数据的形态也发生了深刻的变化。互联网上出现了越来越多的使用Ajax动态生成的页面内容,用户经常会遇到一些网页提示“点击加载更多”或者是随着鼠标滚动自动加载更多内容。这些新形态的网页需要用户交互操作来触发内容的生成和显示,在一定程度上改善了用户浏览体验,但是对传统基于抓取HTML文件的数据采集方法提出了严峻的挑战。With the popularity of modern web technologies such as Asynchronous JavaScript and XML (Ajax), which creates interactive web applications without sacrificing browser compatibility, the form of web page data has undergone profound changes. There are more and more web content dynamically generated by Ajax on the Internet. Users often encounter some webpage prompts "click to load more" or automatically load more content as the mouse scrolls. These new forms of web pages require user interaction to trigger the generation and display of content, which improves the user's browsing experience to a certain extent, but poses a serious challenge to the traditional data collection method based on grabbing HTML files.
尤其是对于网页中动态加载的图表类数据,一般都是通过异步加载后进行显示,而传统的爬虫难以爬取到;一些文本数据采用加密技术后也通过图表的形式显示,并且图表无法直接下载获取;在爬取数据的过程中会经常遇到需要输入的问题;另外图表上会增加一些干扰信息,使得图表中的真实数据信息难以获取。现阶段一般需要大量的人力投入才可以获取到动态图表类数据。Especially for the dynamically loaded chart data in the webpage, it is generally displayed after asynchronous loading, and the traditional crawler is difficult to crawl; some text data is displayed in the form of a chart after using encryption technology, and the chart cannot be directly downloaded. Get; in the process of crawling data, you will often encounter problems that need to be input; in addition, some interference information will be added to the chart, making the real data information in the chart difficult to obtain. At this stage, a large amount of manpower is generally required to obtain dynamic chart class data.
发明内容Summary of the invention
鉴于以上内容,有必要提出一种动态图表类页面数据爬取方法、装置、终端及存储介质,能够自动爬取动态加载的图表类数据,对于爬取到的图表类数据进行截图后输入至预先训练好的图片识别模型中,识别出图片中的内容,相比于传统的网络爬虫产品兼容性好、速度快、数据抓取准确。In view of the above, it is necessary to propose a dynamic chart class page data crawling method, device, terminal and storage medium, which can automatically crawl the dynamically loaded chart class data, and take a screenshot of the crawled chart class data and input it to the advance. In the trained picture recognition model, the content in the picture is recognized, and the compatibility is good, the speed is fast, and the data is captured accurately compared with the traditional web crawler product.
本申请的第一方面提供一种动态图表类页面数据爬取方法,所述方法包括:A first aspect of the present application provides a dynamic chart class page data crawling method, the method comprising:
a)采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;
b)从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;
c)对爬取到的页面进行渲染并解析;c) rendering and parsing the crawled page;
d)通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;
e)根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容;e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;
f)判断所述待爬取数据的网站及对应所述爬取关键词的页面是否已遍历完;及f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and
当确定所述待爬取数据的网站及对应所述爬取关键词的页面都已被遍历 过,则结束流程;或者When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or
当确定所述待爬取数据的网站及对应所述爬取关键词的页面未被遍历完,则继续执行上述b)至f)。When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
本申请的第二方面提供一种动态图表类页面数据爬取装置,所述装置包括:A second aspect of the present application provides a dynamic chart class page data crawling device, the device comprising:
启动模块,用于采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a startup module for launching a browser with an automated testing tool and entering a link to a website to be crawled;
爬取模块,用于从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;a crawling module, configured to crawl, from the website that is to be crawled data, page information related to the crawling keyword input by the user;
解析模块,用于对爬取到的页面进行渲染并解析;a parsing module for rendering and parsing the crawled page;
截图模块,用于通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;a screenshot module, configured to take a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image;
识别模块,用于根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容。An identification module, configured to identify the screenshot image according to a pre-trained picture recognition model, to obtain content in the screenshot picture.
本申请的第三方面提供一种终端,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现所述动态图表类页面数据爬取方法。A third aspect of the present application provides a terminal, the terminal comprising a processor and a memory, the processor implementing the dynamic chart class page data crawling method when the computer readable instructions stored in the memory are executed.
本申请的第四方面提供一种非易失性可读存储介质,其上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现所述动态图表类页面数据爬取方法。A fourth aspect of the present application provides a non-volatile readable storage medium having stored thereon computer readable instructions that, when executed by a processor, implement the dynamic chart class page data crawling method.
本申请所述的动态图表类页面数据爬取方法、装置、终端及存储介质,采用Selenium技术模拟用户登录浏览器、动态加载及截图下载等操作,再结合网络爬虫技术,从而可以自动爬取动态加载的图表类数据,爬取的信息和真实用户看到的图文信息完全一致,对于爬取到的图表类数据进行截图后输入至预先训练好的图片识别模型中,识别出图片中的内容,相比于传统的网络爬虫产品兼容性好、速度快、数据抓取准确。The dynamic chart class page data crawling method, device, terminal and storage medium described in the present application use Selenium technology to simulate user login browser, dynamic loading and screenshot downloading operations, and then combine web crawling technology to automatically crawl dynamics. The loaded chart class data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained image recognition model to identify the content in the image. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.
其次,图片识别模型的训练过程中,通过逐步增加参与训练的训练集的数量,在保证图片识别模型的识别率的前提下,用较少的样本参与训练,能够最大限度的缩短图片识别模型的训练时间,提高图片识别模型的训练效率,即在图片识别模型的准确率和效率之间找到最佳的训练集的数量。Secondly, in the training process of the picture recognition model, by gradually increasing the number of training sets participating in the training, under the premise of ensuring the recognition rate of the picture recognition model, using less samples to participate in the training, the picture recognition model can be shortened to the utmost extent. Training time, improve the training efficiency of the picture recognition model, that is, find the optimal number of training sets between the accuracy and efficiency of the picture recognition model.
附图说明DRAWINGS
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings to be used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description are only It is an embodiment of the present application, and those skilled in the art can obtain other drawings according to the provided drawings without any creative work.
图1是本申请实施例一提供的动态图表类页面数据爬取方法的流程图。FIG. 1 is a flowchart of a dynamic chart class page data crawling method provided in Embodiment 1 of the present application.
图2是本申请实施例二提供的对解析后的页面进行截图得到截图图片并保存所述截图图片的方法的流程图。FIG. 2 is a flowchart of a method for taking a screenshot of a parsed page and obtaining a screenshot image and saving the screenshot image according to the second embodiment of the present application.
图3是本申请实施例三提供的图片识别模型的训练方法的流程图。FIG. 3 is a flowchart of a training method of a picture recognition model according to Embodiment 3 of the present application.
图4是本申请实施例四提供的动态图表类页面数据爬取装置的结构图。4 is a structural diagram of a dynamic chart class page data crawling device provided in Embodiment 4 of the present application.
图5是本申请实施例五提供的去重模块的子功能模块图。FIG. 5 is a schematic diagram of sub-function modules of the de-duplication module provided in Embodiment 5 of the present application.
图6是本申请实施例六提供的训练模块的子功能模块图。6 is a sub-function block diagram of a training module provided in Embodiment 6 of the present application.
图7是本申请实施例七提供的终端的结构图。FIG. 7 is a structural diagram of a terminal provided in Embodiment 7 of the present application.
如下具体实施方式将结合上述附图进一步说明本申请。The present application will be further described in conjunction with the above drawings in the following detailed description.
具体实施方式detailed description
为了能够更清楚地理解本申请的上述目的、特征和优点,下面结合附图和具体实施例对本申请进行详细描述。需要说明的是,在不冲突的情况下,本申请的实施例及实施例中的特征可以相互组合。The above described objects, features, and advantages of the present invention will be more clearly understood from the following detailed description. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
在下面的描述中阐述了很多具体细节以便于充分理解本申请,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。All technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention applies, unless otherwise defined. The terminology used herein is for the purpose of describing particular embodiments, and is not intended to be limiting.
本申请实施例的动态图表类页面数据爬取方法应用在一个或者多个终端中。所述动态图表类页面数据爬取方法也可以应用于由终端和通过网络与所述终端进行连接的服务器所构成的硬件环境中。网络包括但不限于:广域网、城域网或局域网。本申请实施例的动态图表类页面数据爬取方法可以由服务器来执行,也可以由终端来执行;还可以是由服务器和终端共同执行。The dynamic chart class page data crawling method of the embodiment of the present application is applied to one or more terminals. The dynamic chart class page data crawling method can also be applied to a hardware environment composed of a terminal and a server connected to the terminal through a network. Networks include, but are not limited to, wide area networks, metropolitan area networks, or local area networks. The dynamic chart class page data crawling method of the embodiment of the present application may be executed by a server or by a terminal; or may be performed by a server and a terminal together.
所述对于需要进行动态图表类页面数据爬取方法的终端,可以直接在终端上集成本申请的方法所提供的动态图表类页面数据爬取功能,或者安装用于实现本申请的方法的客户端。再如,本申请所提供的方法还可以以软件开发工具包(Software Development Kit,SDK)的形式运行在服务器等设备上,以SDK的形式提供动态图表类页面数据爬取功能的接口,终端或其他设备通过提供的接口即可实现手部的跟踪。For the terminal that needs to perform the dynamic chart class page data crawling method, the dynamic chart class page data crawling function provided by the method of the present application may be directly integrated on the terminal, or the client for implementing the method of the present application may be installed. . For example, the method provided by the present application can also be run on a server or the like in the form of a software development kit (SDK), and provide an interface of a dynamic chart type page data crawling function in the form of an SDK, a terminal or Other devices can track the hand through the provided interface.
实施例一Embodiment 1
图1是本申请实施例一提供的动态图表类页面数据爬取方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 1 is a flowchart of a dynamic chart class page data crawling method provided in Embodiment 1 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
S11、采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接。S11. Start the browser with an automated test tool and enter a link to the website to be crawled.
计算机软件自动化测试技术Selenium Web Driver(下文简称为Selenium)具有较强的可视化自动交互功能,通过编程来模拟人与网页的交互,从而触发动态数据加载,获取动态生成的数据。Selenium技术能够真实的模拟用户在网站网页上执行的操作,例如模拟用户点击“查看更多”、“自动登录”、“点击链接”、“填写表单”、“滚动鼠标”、“鼠标拖拽”、“页面加载完成后向下滚动”、“点击翻页”、“截图保存”等操作。The computer software automated testing technology Selenium Web Driver (hereinafter referred to as Selenium) has a strong visual automatic interaction function, which simulates the interaction between people and web pages through programming, thereby triggering dynamic data loading and obtaining dynamically generated data. Selenium technology can realistically simulate the actions users perform on the website's webpage, such as simulating users clicking "View More", "Auto Login", "Click Link", "Fill Form", "Roll Mouse", "Mouse Drag" , "Scroll down after the page is loaded", "Click to page", "Screen save" and other operations.
本实施例中,通过Selenium工具打开浏览器,在浏览器中输入待爬取数据的网站的链接(Uniform Resource Locator,URL),Selenium工具调用get()方法打开用户输入的待爬取数据的网站的Web页面。In this embodiment, the browser is opened by the Selenium tool, and the link of the website to be crawled data (Uniform Resource Locator, URL) is input in the browser, and the Selenium tool calls the get() method to open the website to be crawled by the user. Web page.
例如,用户需要爬取“当当”网站上的“人脸识别书籍”数据,则通过 selenium工具打开浏览器(例如,Google浏览器),输入“当当”网站的URL“www.dangdang.com”,即可启动“当当”网站,显示“当当”网站的Web页面。For example, if the user needs to crawl the "Face Recognition Books" data on the "Dangdang" website, open the browser (for example, Google Chrome) through the selenium tool, and enter the URL "www.dangdang.com" of the "Dangdang" website. You can launch the "Dangdang" website and display the "Dangdang" website's web page.
本实施例中,若用户需要爬取多个网站的数据时,可以将多个待爬取数据的网站的链接同时输入通过selenium工具打开的浏览器的队列中,爬虫程序依次爬取所述多个待爬取数据的网站中的数据。In this embodiment, if the user needs to crawl data of multiple websites, the link of the website to be crawled data may be simultaneously input into the queue of the browser opened by the selenium tool, and the crawler program sequentially climbs the plurality of the website. The data in the website where the data is to be crawled.
S12、从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息。S12. Climb the page information related to the crawling keyword input by the user from the website to be crawled.
当通过Selenium工具打开所述待爬取数据的网站时,用户输入爬取关键词,例如,“人脸识别”,则Selenium工具模拟用户浏览所述待爬取数据的网站上的“人脸识别”的所有网页的页面信息。When the website to be crawled is opened by the Selenium tool, the user inputs a crawl keyword, for example, "face recognition", and the Selenium tool simulates "face recognition" on the website where the user browses the data to be crawled. "Page information for all pages."
S13、对爬取到的页面进行渲染并解析。S13. Render and parse the crawled page.
Selenium工具在爬取页面时会触发Ajax向服务器异步请求数据,收到回复的原始数据后,格式化拼装成新的HTML节点,插入到初始HTML文件中,最后由浏览器内核渲染引擎将动态内容显示出来。通过selenium服务发送获取页面服务请求到wire协议,然后操作浏览器API获取浏览器加载的原始页面。通过wire协议返回到selenium服务中,当selenium服务拿到页面后交给解析模块进行页面解析。When the Selenium tool crawls the page, it will trigger Ajax to request data asynchronously from the server. After receiving the original data of the reply, it will be formatted into a new HTML node, inserted into the initial HTML file, and finally the dynamic content will be generated by the browser kernel rendering engine. display. Send the page service request to the wire protocol through the selenium service, and then operate the browser API to get the original page loaded by the browser. Return to the selenium service through the wire protocol, and when the selenium service gets the page, it is handed to the parsing module for page parsing.
S14、通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片。S14. Perform a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image.
Selenium工具的驱动程序指示浏览器执行命令,最后由浏览器在内核中进行截图保存操作,最终的效果与用户使用鼠标在页面上截取图片并保存的效果完全相同。The driver of the Selenium tool instructs the browser to execute the command, and finally the browser saves the screenshot in the kernel. The final effect is exactly the same as the user's use of the mouse to capture the image on the page and save it.
优选的,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片还可以包括:根据感知哈希值对解析后的页面中的表格进行去重。Preferably, the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image may further include: de-duplicating the table in the parsed page according to the perceptual hash value.
对步骤S14通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片进一步细化的过程具体参见图2及其相应描述。Refer to FIG. 2 and its corresponding description for the process of stepping through the parsed page by the automated test tool to obtain a screenshot picture and saving the screenshot picture for further refinement.
S15、根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容。S15. Identify the screenshot picture according to a pre-trained picture recognition model, and obtain content in the screenshot picture.
本实施例中,所述图片识别模型的训练方法具体参见图3及其相应描述。In this embodiment, the training method of the picture recognition model is specifically referred to FIG. 3 and its corresponding description.
S16、判断所述待爬取数据的网站及对应所述爬取关键词的页面是否已遍历完。S16. Determine whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed.
当确定所述待爬取数据的网站及对应所述爬取关键词的页面都已被遍历过,则结束流程;否则,当确定所述待爬取数据的网站及对应所述爬取关键词的页面未被遍历完,则继续执行上述S12至S15。When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; otherwise, when determining the website to be crawled data and corresponding to the crawling keyword The pages of the page are not traversed, and the above S12 to S15 are continued.
综上所述,本申请所述的动态图表类页面数据爬取方法,采用Selenium技术模拟用户登录浏览器、动态加载及截图下载等操作,再结合网络爬虫技术,从而可以自动爬取动态加载的图表类数据,爬取的信息和真实用户看到的图文信息完全一致,对于爬取到的图表类数据进行截图后输入至预先训练 好的图片识别模型中,识别出图片中的内容,相比于传统的网络爬虫产品兼容性好、速度快、数据抓取准确。In summary, the dynamic chart class page data crawling method described in the present application uses Selenium technology to simulate a user login browser, dynamic loading, and screenshot downloading operations, and then combines web crawling technology to automatically crawl dynamically loaded. The chart type data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained picture recognition model to identify the content in the picture. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.
实施例二Embodiment 2
图2是本申请实施例二提供的对解析后的页面进行截图得到截图图片并保存所述截图图片的方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 2 is a flowchart of a method for taking a screenshot of a parsed page and obtaining a screenshot image and saving the screenshot image according to the second embodiment of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
S21、通过所述自动化测试工具判断解析后的页面中是否存在图表。S21. Determine, by the automated testing tool, whether a chart exists in the parsed page.
本实施例中,所述自动化测试工具是通过识别所述解析后的页面中是否存在与所述图表显示和控制相关的标签进而判断解析后的页面中是否存在图表。In this embodiment, the automated testing tool determines whether a chart exists in the parsed page by identifying whether the parsed page has a tag related to the chart display and control.
当所述自动化测试工具识别出所述解析后的页面中存在与所述图表显示和控制相关的标签,则确定所述解析后的页面中存在图表;当所述自动化测试工具识别出所述解析后的页面中不存在与所述图表显示和控制相关的标签,则确定所述解析后的页面中不存在图表。Determining that a chart exists in the parsed page when the automated test tool recognizes that there is a tag related to the chart display and control in the parsed page; when the automated test tool identifies the parsing The label associated with the chart display and control does not exist in the subsequent page, and it is determined that the chart does not exist in the parsed page.
所述与图表显示和控制相关的标签包括:img、table、tr、td、colspan等标签。The tags related to the chart display and control include: img, table, tr, td, colspan, and the like.
因网页中的图表使用HTML语言书写,其中会存在诸多控制页面显示格式的DIV、CSS及与图表相关的HTML标签,通过判断是否存在与图表相关的标签属性即可判断解析后的页面中是否存在图表,当识别到与图表相关的标签属性时,确定解析后的页面中存在图表,当没有识别到与图表相关的标签属性时,确定解析后的页面中不存在图表。Because the charts in the webpage are written in HTML language, there are many DIVs, CSSs, and HTML tags related to the chart that control the display format of the page. It can be judged whether the parsed page exists by determining whether there is a tag attribute related to the chart. The chart, when identifying the tag attribute related to the chart, determines that there is a chart in the parsed page, and when the tag attribute related to the chart is not recognized, it is determined that the chart does not exist in the parsed page.
当确定解析后的页面中不存在图表时,执行步骤S22;否则,当确定解析后的页面中存在图表时,执行步骤S23。When it is determined that there is no chart in the parsed page, step S22 is performed; otherwise, when it is determined that there is a chart in the parsed page, step S23 is performed.
S22、爬取解析后的页面中的信息,并根据预先设置的数据格式保存爬取到的信息。S22. Climb the information in the parsed page, and save the crawled information according to a preset data format.
当确定解析后的页面中不存在图表时,不对解析后的页面进行截图,采用爬虫程序直接爬取解析后的页面中的信息,并按照预先设置的数据格式进行存储。When it is determined that there is no chart in the parsed page, the parsed page is not screenshotd, and the crawler program directly crawls the information in the parsed page and stores it according to a preset data format.
本实施例中,通过判断解析后的页面中是否存在图表从而执行不同的操作,解析后的页面中有图表时,对解析后的页面进行截图同时对页面中的图表进行截图,解析后的页面中不存在图表时,则不进行截图操作,如此可便于节省网络资源,避免对所有解析后的页面进行截图从而浪费网络资源;另外,解析后的页面中不存在图表时,不进行截图操作,简化了操作流程,有助于提高爬取效率。In this embodiment, by determining whether there is a graph in the parsed page to perform different operations, when there is a graph in the parsed page, the parsed page is screenshotd and the graph in the page is screenshotd, and the parsed page is If there is no chart in the middle, the screenshot operation will not be performed. This can save network resources and avoid screenshots of all parsed pages, thus wasting network resources. In addition, when there is no chart in the parsed page, no screenshot operation is performed. Simplify the operation process and help improve crawling efficiency.
S23、对所述解析后的页面中的图表进行截图得到截图图片。S23. Perform a screenshot of the chart in the parsed page to obtain a screenshot image.
本实施例中,通过Selenium工具模拟用户对所述解析后的页面中的图表进行截图还包括对所述解析后的页面中的图表进行下载。In this embodiment, the screenshot of the graph in the parsed page by the Selenium tool simulation user further includes downloading the graph in the parsed page.
S24、计算所述截图图片的感知哈希值。S24. Calculate a perceptual hash value of the screenshot picture.
本实施例中,采用感知哈希算法(perceptual hash algorithm)计算截图图片的感知哈希值,具体过程包括:In this embodiment, the perceptual hash algorithm is used to calculate the perceptual hash value of the screenshot image. The specific process includes:
1)对截图图片进行灰度化处理;1) Perform grayscale processing on the screenshot image;
2)计算灰度化处理后的截图图片的灰度平均值;2) Calculate the grayscale average value of the screenshot image after the grayscale processing;
3)比较灰度化处理后的截图图片的每个像素的灰度值与所述灰度平均值的大小;3) comparing the gray value of each pixel of the screenshot image after the grayscale processing with the size of the gray average value;
4)将灰度化处理后的截图图片的像素的灰度值大于或等于所述灰度平均值的记为1,将灰度化处理后的截图图片的像素的灰度值小于所述灰度平均值的记为0;4) The grayscale value of the pixel of the screenshot image after the grayscale processing is greater than or equal to the grayscale average value is 1, and the grayscale value of the pixel of the grayscale processed screenshot image is smaller than the grayscale The average value of the degree is recorded as 0;
5)根据预先设置的连接规则将4)中得到的每个像素的比较结果进行连接,得到所述截图图片的感知哈希值。5) Connect the comparison result of each pixel obtained in 4) according to a preset connection rule to obtain a perceptual hash value of the screenshot picture.
S25、判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值。S25. Determine whether a similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold.
本实施例中,所述判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值具体包括:比较所述截图图片的感知哈希值与已截图图片的感知哈希值之间相同的数值的位数;判断相同的数值的位数是否大于所述预先设置的相似度阈值。In this embodiment, determining whether the similarity between the perceptual hash value of the screenshot image and the perceptual hash value of the screenshotd image is greater than a preset similarity threshold specifically includes: comparing the perception of the screenshot image. The number of digits of the same value between the hash value and the perceived hash value of the captured picture; whether the number of bits of the same value is greater than the preset similarity threshold.
例如,灰度化处理后的截图图片为8*8像素,其灰度平均值为45,第一行第一列的像素的灰度值大于45时,将比较结果记为1,否则比较结果记为0;第一行第二列的像素的灰度值大于45时,比较结果记为1,否则比较结果记为0;第一行第三列的像素的灰度值大于45时,比较结果记为1,否则比较结果记为0;以此类推。然后从左向右、从上向下将比较结果组合成64位数,该64位数即为所述截图图片的感知哈希值。当判断所述截图图片的感知哈希值与所述已截图图片的感知哈希值之间具有相同的数值的位数(例如61)大于所述预先设置的相似度阈值(例如,60)时,说明所述截图图片与所述已截图图片相同。For example, the screenshot image after the grayscale processing is 8*8 pixels, and the average value of the grayscale is 45. When the grayscale value of the pixel in the first row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is Recorded as 0; when the gray value of the pixel in the second row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is recorded as 0; when the gray value of the pixel in the first row and the third column is greater than 45, The result is recorded as 1, otherwise the comparison result is recorded as 0; and so on. The comparison results are then combined from left to right and from top to bottom into 64-bit numbers, which are the perceived hash values of the screenshot picture. When it is determined that the number of bits (eg, 61) having the same value between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshotd picture is greater than the preset similarity threshold (eg, 60) , indicating that the screenshot picture is the same as the screenshot picture.
当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度大于预先设置的相似度阈值时,执行步骤S26;否则,当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度小于或者等于预先设置的相似度阈值时,执行步骤S27。When it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshotd picture is greater than a preset similarity threshold, step S26 is performed; otherwise, when the perceptual hash of the screenshot picture is determined When the similarity between the value and the perceived hash value of the screenshotd picture is less than or equal to the preset similarity threshold, step S27 is performed.
S26、删除所述截图图片。S26. Delete the screenshot picture.
S27、将所述截图图片及对应的解析后的页面进行关联存储于预先设置的特定的位置。S27. Associate the screenshot picture and the corresponding parsed page in a specific location set in advance.
本实施例中,所述预先设置的特定的位置,专用于存储所述截图图片及对应的解析后的页面。所述特定的位置可以是一个特定的文件夹,或者是一个以特定名称命名的文件夹。将每一次的截图图片及对应的解析后的页面行关联存储,便于事后能快速的查找到图表所在的页面,根据图表在所述页面中的位置信息等,基于上下文语义分析的方法进一步解析所述页面中的所述图表的内容。In this embodiment, the preset specific location is dedicated to storing the screenshot picture and the corresponding parsed page. The specific location can be a specific folder or a folder named with a specific name. Each time the screenshot picture and the corresponding parsed page line are stored in association, so that the page where the chart is located can be quickly found after the event, and the method based on the context semantic analysis is further analyzed according to the position information of the chart in the page. The content of the chart in the page.
综上所述,本申请提供的截图图片去重方法,根据感知哈希值判断所述截图图片与已截图图片是否相同从而达到去重的目的,感知哈希计算结果精确,对具有相同内容的下载进行删除或去重处理,能够去除冗余的截图图片, 有效地节省了存储空间。另外,关联存储截图图片及对应的解析后的页面,便于事后管理与分析。In summary, the screenshot picture de-duplication method provided by the present application determines, according to the perceptual hash value, whether the screenshot picture and the screenshot picture are the same to achieve the purpose of deduplication, and the perceptual hash calculation result is accurate, and has the same content. Downloading for deletion or de-duplication can remove redundant screenshot images, effectively saving storage space. In addition, the screenshot image and the corresponding parsed page are stored in association, which facilitates post-mortem management and analysis.
实施例三Embodiment 3
图3是本申请实施例三提供的图片识别模型的训练方法的流程图。根据不同的需求,该流程图中的执行顺序可以改变,某些步骤可以省略。FIG. 3 is a flowchart of a training method of a picture recognition model according to Embodiment 3 of the present application. The order of execution in the flowchart can be changed according to different requirements, and some steps can be omitted.
S31、获取多张图片。S31. Acquire multiple pictures.
本实施例中,可以通过另外的小爬虫自动从互联网上提供的各个网站中获取多张图片,也可从各个搜索引擎(例如,百度,Google,360)中手动下载多张图片,形成图片数据集保存在本地数据库中。图片中的内容可以包括,但不限于:数字、字符、字母、图像、表格等,字母还可以区分大小写。In this embodiment, multiple small reptiles can automatically obtain multiple images from various websites provided on the Internet, and multiple images can be manually downloaded from various search engines (for example, Baidu, Google, 360) to form image data. The set is saved in a local database. The content in the picture can include, but is not limited to, numbers, characters, letters, images, tables, etc., and the letters can also be case sensitive.
S32、对所述多张图片进行预处理,得到待参与训练图片识别模型的数据集。S32. Perform pre-processing on the multiple pictures to obtain a data set to be participated in the training picture recognition model.
本实施例中,分别对所述图片数据集中的每张图片进行预处理,所述预处理包括:背景去除、分割、缩放、裁剪、翻转及/或扭曲等,使训练图片具有相同的尺寸及相同的视角后,再进行图片识别模型的训练,以有效提高图片识别模型的真实性及准确率。In this embodiment, each picture in the picture data set is preprocessed separately, and the preprocessing includes: background removal, segmentation, scaling, cropping, flipping, and/or warping, etc., so that the training pictures have the same size and After the same perspective, the image recognition model is trained to effectively improve the authenticity and accuracy of the image recognition model.
本实施例中,可以采用二值化方法进行背景去除,如果图片上的像素大于预先设置的阀值则为白色,否则为黑色,即将原始图片转换成只有黑白两色的图片以有效去除图片背景的干扰元素。In this embodiment, the background removal may be performed by using a binarization method. If the pixel on the picture is larger than a preset threshold, it is white, otherwise it is black, that is, the original picture is converted into a picture with only black and white to effectively remove the picture background. Interference element.
本实施例中,可以使用分割函数对所述图片数据集中的每张图片进行分割,将图片中的每个数字或每个字符等分割成单一的数字或字符。In this embodiment, each picture in the picture data set may be segmented using a segmentation function, and each number or each character in the picture is divided into a single number or character.
S33、采用交叉验证的方法对所述数据集进行划分为训练集及测试集。S33. The data set is divided into a training set and a test set by using a cross-validation method.
所述训练集用以训练图片识别模型,所述测试集用以测试所训练出的图片识别模型的性能。若测试的准确率越高,则表明所训练出的图片识别模型的性能越好;若测试的准确率较低,则表明所训练出的图片识别模型的性能较差。The training set is used to train a picture recognition model, and the test set is used to test the performance of the trained picture recognition model. If the accuracy of the test is higher, it indicates that the performance of the trained picture recognition model is better; if the accuracy of the test is low, it indicates that the performance of the trained picture recognition model is poor.
可以对所述数据集按照合适的比例(例如,3比2)进行划分,得到训练集及训练集。The data set can be divided into appropriate proportions (for example, 3 to 2) to obtain a training set and a training set.
S34、在所述训练集中随机选择第一预设数量的训练集训练图片识别模型。S34. Randomly select a first preset number of training set training picture recognition models in the training set.
本实施例中,不需要将所有的所述原始训练集中的图片进行图片识别模型的训练,而是在所述原始训练集中选择第一预设数量的训练集参与训练,可以减少参与训练的训练集的数量,节省图片识别模型的训练时间。In this embodiment, all the pictures in the original training set need not be trained in the picture recognition model, but the first preset number of training sets are selected in the original training set to participate in the training, which can reduce the training involved in the training. The number of sets saves the training time of the picture recognition model.
另外,采用随机数生成算法进行随机选择,可以增加参与训练的训练集的随机性,能够提高图片识别模型的鲁棒性。In addition, the random number generation algorithm is used for random selection, which can increase the randomness of the training set participating in the training and improve the robustness of the picture recognition model.
在第一实施例中,所述第一预设数量可以是一个预先设置的固定值,例如,60,即在原始训练集中随机挑选出60个样本参与图片识别模型的训练。In the first embodiment, the first preset number may be a preset fixed value, for example, 60, that is, training for randomly selecting 60 samples to participate in the picture recognition model in the original training set.
在第二实施例中,所述第一预设数量可以是一个预先设置的比例值,例如,1/10,即在原始训练集中随机挑选1/10比例的样本参与图片识别模型的训练。In the second embodiment, the first preset number may be a preset ratio value, for example, 1/10, that is, a random selection of a 1/10 ratio sample participating in the image recognition model training in the original training set.
S35、利用所述测试集测试所训练的图片识别模型的准确率,若准确率大于或者等于预设准确率阈值,则训练结束;若准确率小于预设准确率阈值,则重新训练图片识别模型。S35. Test the accuracy of the trained picture recognition model by using the test set. If the accuracy rate is greater than or equal to the preset accuracy rate threshold, the training ends; if the accuracy rate is less than the preset accuracy rate threshold, the picture recognition model is retrained. .
优选地,所述重新训练图片识别模型包括:从所述训练集中除所述第一预设数量的训练集之外的训练集中,增加第二预设数量的训练集至所述第一预设数量的训练集中,并重新执行上述步骤S32至S35,直至所训练的图片识别模型的准确率大于或者等于预设准确率阈值。Preferably, the retraining the picture recognition model comprises: adding a second preset number of training sets to the first preset from the training set except the first preset number of training sets in the training set The number of trainings is concentrated, and the above steps S32 to S35 are re-executed until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy rate threshold.
在第一实施例中,所述第二预设数量可以是一个预先设置的固定值,例如,20,即在所述训练集中除第一预设数量的训练集之外的训练集中随机挑选出20个图片参与图片识别模型的训练。In the first embodiment, the second preset number may be a preset fixed value, for example, 20, that is, randomly selected in the training set except the first preset number of training sets in the training set. 20 pictures participated in the training of the picture recognition model.
在第二实施例中,所述第二预设数量可以是一个预先设置的比例值,例如,1/20,即在所述训练集中除所述第一预设数量的训练集之外的训练集中随机挑选1/20比例的图片参与图片识别模型的训练。In the second embodiment, the second preset number may be a preset ratio value, for example, 1/20, that is, training in addition to the first preset number of training sets in the training set. Focus on randomly selecting 1/20 scale pictures to participate in the training of picture recognition models.
在第三实施例中,所述第二预设数量可以是所述第一预设数量的预设比例值,例如,1/5,即在所述训练集中除第一预设数量的训练集之外的训练集中,随机挑选所述第一预设数量的1/5比例的图片参与图片识别模型的训练。In the third embodiment, the second preset number may be the first preset number of preset ratio values, for example, 1/5, that is, the first preset number of training sets are divided in the training set. In addition to the training set, the first preset number of 1/5 scale pictures are randomly selected to participate in the training of the picture recognition model.
本申请提供的图片识别模型训练方法,通过逐步增加参与训练的训练集的数量,在保证图片识别模型的识别率的前提下,用较少的样本参与训练,能够最大限度的缩短图片识别模型的训练时间,提高图片识别模型的训练效率,即在图片识别模型的准确率和效率之间找到最佳的训练集的数量。The picture recognition model training method provided by the present application can minimize the number of training sets participating in the training, and under the premise of ensuring the recognition rate of the picture recognition model, using less samples to participate in the training, the picture recognition model can be shortened to the utmost extent. Training time, improve the training efficiency of the picture recognition model, that is, find the optimal number of training sets between the accuracy and efficiency of the picture recognition model.
以上所述,仅是本申请的具体实施方式,但本申请的保护范围并不局限于此,对于本领域的普通技术人员来说,在不脱离本申请创造构思的前提下,还可以做出改进,但这些均属于本申请的保护范围。The above description is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and those skilled in the art can also make without departing from the concept of the present application. Improvements, but these are all within the scope of this application.
下面结合第4至7图,分别对实现上述动态图表类页面数据爬取方法的终端的功能模块及硬件结构进行介绍。The function modules and hardware structures of the terminal for realizing the above-mentioned dynamic chart class page data crawling method are respectively described below with reference to the fourth to seventh figures.
实施例四Embodiment 4
图4为本申请实施例四提供的动态图表类页面数据爬取装置的功能模块图。4 is a functional block diagram of a dynamic chart class page data crawling device according to Embodiment 4 of the present application.
在一些实施例中,所述动态图表类页面数据爬取装置40运行于终端中。所述动态图表类页面数据爬取装置40可以包括多个由程序代码段所组成的功能模块。所述动态图表类页面数据爬取装置40中的各个程序段的程序代码可以存储于存储器中,并由至少一个处理器所执行,以执行(详见图1及其相关描述)对动态图表类页面数据的爬取。In some embodiments, the dynamic chart class page data crawler 40 operates in a terminal. The dynamic chart class page data crawler 40 can include a plurality of functional modules consisting of program code segments. The program code of each program segment in the dynamic chart class page data crawling device 40 may be stored in a memory and executed by at least one processor to execute (see FIG. 1 and its related description) for the dynamic chart class. Crawling of page data.
本实施例中,所述终端的动态图表类页面数据爬取装置40根据其所执行的功能,可以被划分为多个功能模块。所述功能模块可以包括:启动模块401、爬取模块402、解析模块403、截图模块404、去重模块405、训练模块406、识别模块407及判断模块408。本申请所称的模块是指一种能够被至少一个处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在所述存储器中。在一些实施例中,关于各模块的功能将在后续的实施例中详述。In this embodiment, the dynamic chart class page data crawling device 40 of the terminal may be divided into a plurality of functional modules according to the functions performed by the terminal. The function module may include: a startup module 401, a crawl module 402, a parsing module 403, a screenshot module 404, a deduplication module 405, a training module 406, an identification module 407, and a determination module 408. A module as referred to in this application refers to a series of computer readable instruction segments that are executable by at least one processor and capable of performing a fixed function, which are stored in the memory. In some embodiments, the functionality of each module will be detailed in subsequent embodiments.
启动模块401,用于采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接。The startup module 401 is configured to start a browser by using an automated testing tool, and input a link of a website to be crawled.
计算机软件自动化测试技术Selenium Web Driver(下文简称为Selenium)具有较强的可视化自动交互功能,通过编程来模拟人与网页的交互,从而触发动态数据加载,获取动态生成的数据。Selenium技术能够真实的模拟用户在网站网页上执行的操作,例如模拟用户点击“查看更多”、“自动登录”、“点击链接”、“填写表单”、“滚动鼠标”、“鼠标拖拽”、“页面加载完成后向下滚动”、“点击翻页”、“截图保存”等操作。The computer software automated testing technology Selenium Web Driver (hereinafter referred to as Selenium) has a strong visual automatic interaction function, which simulates the interaction between people and web pages through programming, thereby triggering dynamic data loading and obtaining dynamically generated data. Selenium technology can realistically simulate the actions users perform on the website's webpage, such as simulating users clicking "View More", "Auto Login", "Click Link", "Fill Form", "Roll Mouse", "Mouse Drag" , "Scroll down after the page is loaded", "Click to page", "Screen save" and other operations.
本实施例中,通过Selenium工具打开浏览器,在浏览器中输入待爬取数据的网站的链接(Uniform Resource Locator,URL),Selenium工具调用get()方法打开用户输入的待爬取数据的网站的Web页面。In this embodiment, the browser is opened by the Selenium tool, and the link of the website to be crawled data (Uniform Resource Locator, URL) is input in the browser, and the Selenium tool calls the get() method to open the website to be crawled by the user. Web page.
例如,用户需要爬取“当当”网站上的“人脸识别书籍”数据,则通过selenium工具打开浏览器(例如,Google浏览器),输入“当当”网站的URL“www.dangdang.com”,即可启动“当当”网站,显示“当当”网站的Web页面。For example, if the user needs to crawl the "Face Recognition Books" data on the "Dangdang" website, open the browser (for example, Google Chrome) through the selenium tool, and enter the URL "www.dangdang.com" of the "Dangdang" website. You can launch the "Dangdang" website and display the "Dangdang" website's web page.
本实施例中,若用户需要爬取多个网站的数据时,可以将多个待爬取数据的网站的链接同时输入通过selenium工具打开的浏览器的队列中,爬虫程序依次爬取所述多个待爬取数据的网站中的数据。In this embodiment, if the user needs to crawl data of multiple websites, the link of the website to be crawled data may be simultaneously input into the queue of the browser opened by the selenium tool, and the crawler program sequentially climbs the plurality of the website. The data in the website where the data is to be crawled.
爬取模块402,用于从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息。The crawling module 402 is configured to crawl page information related to the crawling keyword input by the user from the website to be crawled.
当通过Selenium工具打开所述待爬取数据的网站时,用户输入爬取关键词,例如,“人脸识别”,则Selenium工具模拟用户浏览所述待爬取数据的网站上的“人脸识别”的所有网页的页面信息。When the website to be crawled is opened by the Selenium tool, the user inputs a crawl keyword, for example, "face recognition", and the Selenium tool simulates "face recognition" on the website where the user browses the data to be crawled. "Page information for all pages."
解析模块403,用于对爬取到的页面进行渲染并解析。The parsing module 403 is configured to parse and parse the crawled page.
Selenium工具在爬取页面时会触发Ajax向服务器异步请求数据,收到回复的原始数据后,格式化拼装成新的HTML节点,插入到初始HTML文件中,最后由浏览器内核渲染引擎将动态内容显示出来。通过selenium服务发送获取页面服务请求到wire协议,然后操作浏览器API获取浏览器加载的原始页面。通过wire协议返回到selenium服务中,当selenium服务拿到页面后交给解析模块进行页面解析。When the Selenium tool crawls the page, it will trigger Ajax to request data asynchronously from the server. After receiving the original data of the reply, it will be formatted into a new HTML node, inserted into the initial HTML file, and finally the dynamic content will be generated by the browser kernel rendering engine. display. Send the page service request to the wire protocol through the selenium service, and then operate the browser API to get the original page loaded by the browser. Return to the selenium service through the wire protocol, and when the selenium service gets the page, it is handed to the parsing module for page parsing.
截图模块404,用于通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片。The screenshot module 404 is configured to take a screenshot of the parsed page by the automated testing tool to obtain a screenshot image and save the screenshot image.
Selenium工具的驱动程序指示浏览器执行命令,最后由浏览器在内核中进行截图保存操作,最终的效果与用户使用鼠标在页面上截取图片并保存的效果完全相同。The driver of the Selenium tool instructs the browser to execute the command, and finally the browser saves the screenshot in the kernel. The final effect is exactly the same as the user's use of the mouse to capture the image on the page and save it.
去重模块405,用于根据感知哈希值对解析后的页面中的表格进行去重。The de-duplication module 405 is configured to de-scale the table in the parsed page according to the perceptual hash value.
训练模块406,用于训练图片识别模型。The training module 406 is configured to train a picture recognition model.
识别模块407,用于根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容。The identification module 407 is configured to identify the screenshot picture according to a pre-trained picture recognition model, and obtain content in the screenshot picture.
判断模块408,用于判断所述待爬取数据的网站及对应所述爬取关键词 的页面是否已遍历完。当所述判断模块408确定所述待爬取数据的网站及对应所述爬取关键词的页面未遍历完,重复执行上述模块401、402、403、404、405及407。The determining module 408 is configured to determine whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed. When the determining module 408 determines that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the modules 401, 402, 403, 404, 405 and 407 are repeatedly executed.
综上所述,本申请所述的动态图表类页面数据爬取装置,采用Selenium技术模拟用户登录浏览器、动态加载及截图下载等操作,再结合网络爬虫技术,从而可以自动爬取动态加载的图表类数据,爬取的信息和真实用户看到的图文信息完全一致,对于爬取到的图表类数据进行截图后输入至预先训练好的图片识别模型中,识别出图片中的内容,相比于传统的网络爬虫产品兼容性好、速度快、数据抓取准确。In summary, the dynamic chart type page data crawling device described in the present application uses Selenium technology to simulate a user login browser, dynamic loading, and screenshot downloading operations, and then combines web crawling technology to automatically crawl dynamically loaded. The chart type data, the crawled information is exactly the same as the graphic information seen by the real user, and the captured chart data is screenshotd and input into the pre-trained picture recognition model to identify the content in the picture. Compared with traditional web crawler products, it has good compatibility, fast speed and accurate data capture.
实施例五Embodiment 5
图5是本申请实施例五提供的去重模块的子功能模块图。所述去重模块405包括:第一判断子模块4051、保存子模块4052、截图子模块4053、计算子模块4054、第二判断子模块4055、删除子模块4056及关联子模块4057。FIG. 5 is a schematic diagram of sub-function modules of the de-duplication module provided in Embodiment 5 of the present application. The de-duplication module 405 includes: a first determining sub-module 4051, a saving sub-module 4052, a screenshot sub-module 4053, a computing sub-module 4054, a second determining sub-module 4055, a deleting sub-module 4056, and an associated sub-module 4057.
第一判断子模块4051,用于通过所述自动化测试工具判断解析后的页面中是否存在图表。The first determining sub-module 4051 is configured to determine, by the automated testing tool, whether a chart exists in the parsed page.
本实施例中,所述自动化测试工具是通过识别所述解析后的页面中是否存在与所述图表显示和控制相关的标签进而判断解析后的页面中是否存在图表。In this embodiment, the automated testing tool determines whether a chart exists in the parsed page by identifying whether the parsed page has a tag related to the chart display and control.
当所述自动化测试工具识别出所述解析后的页面中存在与所述图表显示和控制相关的标签,则确定所述解析后的页面中存在图表;当所述自动化测试工具识别出所述解析后的页面中不存在与所述图表显示和控制相关的标签,则确定所述解析后的页面中不存在图表。Determining that a chart exists in the parsed page when the automated test tool recognizes that there is a tag related to the chart display and control in the parsed page; when the automated test tool identifies the parsing The label associated with the chart display and control does not exist in the subsequent page, and it is determined that the chart does not exist in the parsed page.
所述与图表显示和控制相关的标签包括:img、table、tr、td、colspan等标签。The tags related to the chart display and control include: img, table, tr, td, colspan, and the like.
因网页中的图表使用HTML语言书写,其中会存在诸多控制页面显示格式的DIV、CSS及与图表相关的HTML标签,通过判断是否存在与图表相关的标签属性即可判断解析后的页面中是否存在图表,当识别到与图表相关的标签属性时,确定解析后的页面中存在图表,当没有识别到与图表相关的标签属性时,确定解析后的页面中不存在图表。Because the charts in the webpage are written in HTML language, there are many DIVs, CSSs, and HTML tags related to the chart that control the display format of the page. It can be judged whether the parsed page exists by determining whether there is a tag attribute related to the chart. The chart, when identifying the tag attribute related to the chart, determines that there is a chart in the parsed page, and when the tag attribute related to the chart is not recognized, it is determined that the chart does not exist in the parsed page.
保存子模块4052,用于当所述第一判断子模块4051确定解析后的页面中不存在图表时爬取解析后的页面中的信息,并根据预先设置的数据格式保存爬取到的信息。The saving submodule 4052 is configured to: when the first determining submodule 4051 determines that there is no chart in the parsed page, crawl the information in the parsed page, and save the crawled information according to a preset data format.
当确定解析后的页面中不存在图表时,不对解析后的页面进行截图,采用爬虫程序直接爬取解析后的页面中的信息,并按照预先设置的数据格式进行存储。When it is determined that there is no chart in the parsed page, the parsed page is not screenshotd, and the crawler program directly crawls the information in the parsed page and stores it according to a preset data format.
本实施例中,通过判断解析后的页面中是否存在图表从而执行不同的操作,解析后的页面中有图表时,对解析后的页面进行截图同时对页面中的图表进行截图,解析后的页面中不存在图表时,则不进行截图操作,如此可便于节省网络资源,避免对所有解析后的页面进行截图从而浪费网络资源;另外,解析后的页面中不存在图表时,不进行截图操作,简化了操作流程,有 助于提高爬取效率。In this embodiment, by determining whether there is a graph in the parsed page to perform different operations, when there is a graph in the parsed page, the parsed page is screenshotd and the graph in the page is screenshotd, and the parsed page is If there is no chart in the middle, the screenshot operation will not be performed. This can save network resources and avoid screenshots of all parsed pages, thus wasting network resources. In addition, when there is no chart in the parsed page, no screenshot operation is performed. Simplify the operation process and help improve crawling efficiency.
截图子模块4053,用于当所述第一判断子模块4051确定解析后的页面中存在图表时,对所述解析后的页面中的图表进行截图得到截图图片。The screenshot sub-module 4053 is configured to: when the first determining sub-module 4051 determines that a graph exists in the parsed page, perform a screenshot on the graph in the parsed page to obtain a screenshot image.
本实施例中,通过Selenium工具模拟用户对所述解析后的页面中的图表进行截图还包括对所述解析后的页面中的图表进行下载。In this embodiment, the screenshot of the graph in the parsed page by the Selenium tool simulation user further includes downloading the graph in the parsed page.
计算子模块4054,用于计算所述截图图片的感知哈希值。The calculation sub-module 4054 is configured to calculate a perceptual hash value of the screenshot picture.
本实施例中,所述计算子模块4054具体过程包括:In this embodiment, the specific process of the calculation submodule 4054 includes:
1)对截图图片进行灰度化处理;1) Perform grayscale processing on the screenshot image;
2)计算灰度化处理后的截图图片的灰度平均值;2) Calculate the grayscale average value of the screenshot image after the grayscale processing;
3)比较灰度化处理后的截图图片的每个像素的灰度值与所述灰度平均值的大小;3) comparing the gray value of each pixel of the screenshot image after the grayscale processing with the size of the gray average value;
4)将灰度化处理后的截图图片的像素的灰度值大于或等于所述灰度平均值的记为1,将灰度化处理后的截图图片的像素的灰度值小于所述灰度平均值的记为0;4) The grayscale value of the pixel of the screenshot image after the grayscale processing is greater than or equal to the grayscale average value is 1, and the grayscale value of the pixel of the grayscale processed screenshot image is smaller than the grayscale The average value of the degree is recorded as 0;
5)根据预先设置的连接规则将4)中得到的每个像素的比较结果进行连接,得到所述截图图片的感知哈希值。5) Connect the comparison result of each pixel obtained in 4) according to a preset connection rule to obtain a perceptual hash value of the screenshot picture.
第二判断子模块4055,用于判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值。The second determining sub-module 4055 is configured to determine whether a similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold.
本实施例中,所述判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值具体包括:比较所述截图图片的感知哈希值与已截图图片的感知哈希值之间相同的数值的位数;判断相同的数值的位数是否大于所述预先设置的相似度阈值。In this embodiment, determining whether the similarity between the perceptual hash value of the screenshot image and the perceptual hash value of the screenshotd image is greater than a preset similarity threshold specifically includes: comparing the perception of the screenshot image. The number of digits of the same value between the hash value and the perceived hash value of the captured picture; whether the number of bits of the same value is greater than the preset similarity threshold.
例如,灰度化处理后的截图图片为8*8像素,其灰度平均值为45,第一行第一列的像素的灰度值大于45时,将比较结果记为1,否则比较结果记为0;第一行第二列的像素的灰度值大于45时,比较结果记为1,否则比较结果记为0;第一行第三列的像素的灰度值大于45时,比较结果记为1,否则比较结果记为0;以此类推。然后从左向右、从上向下将比较结果组合成64位数,该64位数即为所述截图图片的感知哈希值。当判断所述截图图片的感知哈希值与所述已截图图片的感知哈希值之间具有相同的数值的位数(例如61)大于所述预先设置的相似度阈值(例如,60)时,说明所述截图图片与所述已截图图片相同。For example, the screenshot image after the grayscale processing is 8*8 pixels, and the average value of the grayscale is 45. When the grayscale value of the pixel in the first row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is Recorded as 0; when the gray value of the pixel in the second row of the first row is greater than 45, the comparison result is recorded as 1, otherwise the comparison result is recorded as 0; when the gray value of the pixel in the first row and the third column is greater than 45, The result is recorded as 1, otherwise the comparison result is recorded as 0; and so on. The comparison results are then combined from left to right and from top to bottom into 64-bit numbers, which are the perceived hash values of the screenshot picture. When it is determined that the number of bits (eg, 61) having the same value between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshotd picture is greater than the preset similarity threshold (eg, 60) , indicating that the screenshot picture is the same as the screenshot picture.
删除子模块4056,用于当所述第二判断子模块4055确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度大于预先设置的相似度阈值时,删除所述截图图片。The deleting sub-module 4056 is configured to delete when the second determining sub-module 4055 determines that the similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshot picture is greater than a preset similarity threshold. The screenshot picture.
关联子模块4057,用于当所述第二判断子模块4055确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度小于或者等于预先设置的相似度阈值时,将所述截图图片及对应的解析后的页面进行关联存储于预先设置的特定的位置。The association sub-module 4057 is configured to: when the second determining sub-module 4055 determines that the similarity between the perceived hash value of the screenshot picture and the perceived hash value of the screenshotd picture is less than or equal to a preset similarity threshold And storing the screenshot picture and the corresponding parsed page in a specific location set in advance.
本实施例中,所述预先设置的特定的位置,专用于存储所述截图图片及对应的解析后的页面。所述特定的位置可以是一个特定的文件夹,或者是一 个以特定名称命名的文件夹。将每一次的截图图片及对应的解析后的页面行关联存储,便于事后能快速的查找到图表所在的页面,根据图表在所述页面中的位置信息等,基于上下文语义分析的方法进一步解析所述页面中的所述图表的内容。In this embodiment, the preset specific location is dedicated to storing the screenshot picture and the corresponding parsed page. The specific location can be a specific folder or a folder named with a specific name. Each time the screenshot picture and the corresponding parsed page line are stored in association, so that the page where the chart is located can be quickly found after the event, and the method based on the context semantic analysis is further analyzed according to the position information of the chart in the page. The content of the chart in the page.
实施例六Embodiment 6
图6是本申请实施例六提供的训练模块的子功能模块图。所述训练模块406包括:获取子模块4061、预处理模块4062、划分子模块4063、选择子模块4064及测试子模块4065。6 is a sub-function block diagram of a training module provided in Embodiment 6 of the present application. The training module 406 includes: an obtaining submodule 4061, a preprocessing module 4062, a dividing submodule 4063, a selecting submodule 4064, and a testing submodule 4065.
获取子模块4061,用于获取多张图片。The obtaining submodule 4061 is configured to acquire a plurality of pictures.
本实施例中,可以通过另外的小爬虫自动从互联网上提供的各个网站中获取多张图片,也可从各个搜索引擎(例如,百度,Google,360)中手动下载多张图片,形成图片数据集保存在本地数据库中。图片中的内容可以包括,但不限于:数字、字符、字母、图像、表格等,字母还可以区分大小写。In this embodiment, multiple small reptiles can automatically obtain multiple images from various websites provided on the Internet, and multiple images can be manually downloaded from various search engines (for example, Baidu, Google, 360) to form image data. The set is saved in a local database. The content in the picture can include, but is not limited to, numbers, characters, letters, images, tables, etc., and the letters can also be case sensitive.
预处理模块4062,用于对所述多张图片进行预处理,得到待参与训练图片识别模型的数据集。The pre-processing module 4062 is configured to perform pre-processing on the multiple pictures to obtain a data set to be participated in the training picture recognition model.
本实施例中,分别对所述图片数据集中的每张图片进行预处理,所述预处理包括:背景去除、分割、缩放、裁剪、翻转及/或扭曲等,使训练图片具有相同的尺寸及相同的视角后,再进行图片识别模型的训练,以有效提高图片识别模型的真实性及准确率。In this embodiment, each picture in the picture data set is preprocessed separately, and the preprocessing includes: background removal, segmentation, scaling, cropping, flipping, and/or warping, etc., so that the training pictures have the same size and After the same perspective, the image recognition model is trained to effectively improve the authenticity and accuracy of the image recognition model.
本实施例中,可以采用二值化方法进行背景去除,如果图片上的像素大于预先设置的阀值则为白色,否则为黑色,即将原始图片转换成只有黑白两色的图片以有效去除图片背景的干扰元素。In this embodiment, the background removal may be performed by using a binarization method. If the pixel on the picture is larger than a preset threshold, it is white, otherwise it is black, that is, the original picture is converted into a picture with only black and white to effectively remove the picture background. Interference element.
本实施例中,可以使用分割函数对所述图片数据集中的每张图片进行分割,将图片中的每个数字或每个字符等分割成单一的数字或字符。In this embodiment, each picture in the picture data set may be segmented using a segmentation function, and each number or each character in the picture is divided into a single number or character.
划分子模块4063,用于采用交叉验证的方法对所述数据集进行划分为训练集及测试集。The dividing sub-module 4063 is configured to divide the data set into a training set and a test set by using a cross-validation method.
所述训练集用以训练图片识别模型,所述测试集用以测试所训练出的图片识别模型的性能。若测试的准确率越高,则表明所训练出的图片识别模型的性能越好;若测试的准确率较低,则表明所训练出的图片识别模型的性能较差。The training set is used to train a picture recognition model, and the test set is used to test the performance of the trained picture recognition model. If the accuracy of the test is higher, it indicates that the performance of the trained picture recognition model is better; if the accuracy of the test is low, it indicates that the performance of the trained picture recognition model is poor.
可以对所述数据集按照合适的比例(例如,3比2)进行划分,得到训练集及训练集。The data set can be divided into appropriate proportions (for example, 3 to 2) to obtain a training set and a training set.
选择子模块4064,用于在所述训练集中随机选择第一预设数量的训练集训练图片识别模型。The selecting sub-module 4064 is configured to randomly select a first preset number of training set training picture recognition models in the training set.
本实施例中,不需要将所有的所述原始训练集中的图片进行图片识别模型的训练,而是在所述原始训练集中选择第一预设数量的训练集参与训练,可以减少参与训练的训练集的数量,节省图片识别模型的训练时间。In this embodiment, all the pictures in the original training set need not be trained in the picture recognition model, but the first preset number of training sets are selected in the original training set to participate in the training, which can reduce the training involved in the training. The number of sets saves the training time of the picture recognition model.
另外,采用随机数生成算法进行随机选择,可以增加参与训练的训练集的随机性,能够提高图片识别模型的鲁棒性。In addition, the random number generation algorithm is used for random selection, which can increase the randomness of the training set participating in the training and improve the robustness of the picture recognition model.
在第一实施例中,所述第一预设数量可以是一个预先设置的固定值,例 如,60,即在原始训练集中随机挑选出60个样本参与图片识别模型的训练。In the first embodiment, the first preset number may be a preset fixed value, for example, 60, that is, randomly training 60 samples to participate in the training of the picture recognition model in the original training set.
在第二实施例中,所述第一预设数量可以是一个预先设置的比例值,例如,1/10,即在原始训练集中随机挑选1/10比例的样本参与图片识别模型的训练。In the second embodiment, the first preset number may be a preset ratio value, for example, 1/10, that is, a random selection of a 1/10 ratio sample participating in the image recognition model training in the original training set.
测试子模块4065,用于利用所述测试集测试所训练的图片识别模型的准确率,若准确率大于或者等于预设准确率阈值,则训练结束;若准确率小于预设准确率阈值,则所述选择子模块4064从所述训练集中除所述第一预设数量的训练集之外的训练集中,增加第二预设数量的训练集至所述第一预设数量的训练集中,并重新执行测试子模块4065,直至所训练的图片识别模型的准确率大于或者等于预设准确率阈值。The test sub-module 4065 is configured to test the accuracy of the trained picture recognition model by using the test set, and if the accuracy rate is greater than or equal to the preset accuracy rate threshold, the training ends; if the accuracy rate is less than the preset accuracy rate threshold, The selection sub-module 4064 adds a second preset number of training sets to the first preset number of training sets from the training set except the first preset number of training sets in the training set, and The test sub-module 4065 is re-executed until the accuracy of the trained picture recognition model is greater than or equal to the preset accuracy rate threshold.
在第一实施例中,所述第二预设数量可以是一个预先设置的固定值,例如,20,即在所述训练集中除第一预设数量的训练集之外的训练集中随机挑选出20个图片参与图片识别模型的训练。In the first embodiment, the second preset number may be a preset fixed value, for example, 20, that is, randomly selected in the training set except the first preset number of training sets in the training set. 20 pictures participated in the training of the picture recognition model.
在第二实施例中,所述第二预设数量可以是一个预先设置的比例值,例如,1/20,即在所述训练集中除所述第一预设数量的训练集之外的训练集中随机挑选1/20比例的图片参与图片识别模型的训练。In the second embodiment, the second preset number may be a preset ratio value, for example, 1/20, that is, training in addition to the first preset number of training sets in the training set. Focus on randomly selecting 1/20 scale pictures to participate in the training of picture recognition models.
在第三实施例中,所述第二预设数量可以是所述第一预设数量的预设比例值,例如,1/5,即在所述训练集中除第一预设数量的训练集之外的训练集中,随机挑选所述第一预设数量的1/5比例的图片参与图片识别模型的训练。In the third embodiment, the second preset number may be the first preset number of preset ratio values, for example, 1/5, that is, the first preset number of training sets are divided in the training set. In addition to the training set, the first preset number of 1/5 scale pictures are randomly selected to participate in the training of the picture recognition model.
本申请提供的图片识别模型训练方法,通过逐步增加参与训练的训练集的数量,在保证图片识别模型的识别率的前提下,用较少的样本参与训练,能够最大限度的缩短图片识别模型的训练时间,提高图片识别模型的训练效率,即在图片识别模型的准确率和效率之间找到最佳的训练集的数量。The picture recognition model training method provided by the present application can minimize the number of training sets participating in the training, and under the premise of ensuring the recognition rate of the picture recognition model, using less samples to participate in the training, the picture recognition model can be shortened to the utmost extent. Training time, improve the training efficiency of the picture recognition model, that is, find the optimal number of training sets between the accuracy and efficiency of the picture recognition model.
上述以软件功能模块的形式实现的集成的单元,可以存储在一个非易失性可读存储介质中。上述软件功能模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,双屏设备,或者网络设备等)或处理器(processor)执行本申请各个实施例所述方法的部分。The above-described integrated unit implemented in the form of a software function module can be stored in a non-volatile readable storage medium. The software function module is stored in a storage medium and includes a plurality of instructions for causing a computer device (which may be a personal computer, a dual screen device, or a network device, etc.) or a processor to execute the embodiments of the present application. Part of the method.
实施例七Example 7
图7为本申请实施例五提供的终端的示意图。FIG. 7 is a schematic diagram of a terminal according to Embodiment 5 of the present application.
所述终端7包括:存储器71、至少一个处理器72、存储在所述存储器71中并可在所述至少一个处理器72上运行的计算机可读指令73、至少一条通讯总线74。The terminal 7 comprises a memory 71, at least one processor 72, computer readable instructions 73 stored in the memory 71 and operable on the at least one processor 72, and at least one communication bus 74.
所述至少一个处理器72执行所述计算机可读指令73时实现上述动态图表类页面数据爬取方法实施例中的步骤,或者,所述至少一个处理器72执行所述计算机可读指令73时实现上述装置实施例中各模块/单元的功能。The at least one processor 72 implements the steps in the dynamic chart class page data crawling method embodiment when the computer readable instructions 73 are executed, or the at least one processor 72 executes the computer readable instructions 73 The functions of the modules/units in the above device embodiments are implemented.
示例性的,所述计算机可读指令73可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器71中,并由所述至少一个处理器72执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该指令段用于描述所述计算机可读指令73在所述终端7中的执行过程。Illustratively, the computer readable instructions 73 may be partitioned into one or more modules/units, the one or more modules/units being stored in the memory 71 and by the at least one processor 72 Execute to complete this application. The one or more modules/units may be a series of computer readable instruction segments capable of performing a particular function, the instruction segments being used to describe the execution of the computer readable instructions 73 in the terminal 7.
所述终端7可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解,所述示意图5仅仅是终端7的示例,并不构成对终端7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端7还可以包括输入输出设备、网络接入设备、总线等。The terminal 7 can be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. It will be understood by those skilled in the art that the schematic diagram 5 is merely an example of the terminal 7, and does not constitute a limitation of the terminal 7, and may include more or less components than those illustrated, or combine some components or different components. For example, the terminal 7 may further include an input/output device, a network access device, a bus, and the like.
所述至少一个处理器72可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。该处理器72可以是微处理器或者该处理器72也可以是任何常规的处理器等,所述处理器72是所述终端7的控制中心,利用各种接口和线路连接整个终端7的各个部分。The at least one processor 72 may be a central processing unit (CPU), or may be another general-purpose processor, a digital signal processor (DSP), or an application specific integrated circuit (ASIC). ), a Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, and the like. The processor 72 may be a microprocessor or the processor 72 may be any conventional processor or the like. The processor 72 is a control center of the terminal 7, and connects various terminals of the entire terminal 7 by using various interfaces and lines. section.
所述存储器71可用于存储所述计算机可读指令73和/或模块/单元,所述处理器72通过运行或执行存储在所述存储器71内的计算机可读指令和/或模块/单元,以及调用存储在存储器71内的数据,实现所述终端7的各种功能。所述存储器71可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端7的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器71可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 71 can be used to store the computer readable instructions 73 and/or modules/units by running or executing computer readable instructions and/or modules/units stored in the memory 71, and The data stored in the memory 71 is called to implement various functions of the terminal 7. The memory 71 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be Data (such as audio data, phone book, etc.) created according to the use of the terminal 7 is stored. In addition, the memory 71 may include a high-speed random access memory, and may also include a non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a secure digital (SD). Card, flash card, at least one disk storage device, flash device, or other volatile solid state storage device.
所述终端7集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性可读存储介质中,该计算机可读指令在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。The modules/units integrated by the terminal 7 can be stored in a non-volatile readable storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the present application implements all or part of the processes in the foregoing embodiments, and may also be implemented by computer-readable instructions, which may be stored in a non-volatile manner. In reading a storage medium, the computer readable instructions, when executed by a processor, implement the steps of the various method embodiments described above. Wherein, the computer readable instructions comprise computer readable instruction code, which may be in the form of source code, an object code form, an executable file or some intermediate form or the like. The computer readable medium can include any entity or device capable of carrying the computer readable instruction code, a recording medium, a USB flash drive, a removable hard drive, a magnetic disk, an optical disk, a computer memory, a read only memory (ROM, Read-Only) Memory), random access memory (RAM), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the content contained in the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in a jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer readable media Does not include electrical carrier signals and telecommunication signals.
在本申请所提供的几个实施例中,应该理解到,所揭露的终端和方法,可以通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided by the present application, it should be understood that the disclosed terminal and method may be implemented in other manners. For example, the terminal embodiment described above is only illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner.
另外,在本申请各个实施例中的各功能单元可以集成在相同处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在相同单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated in the same processing unit, or each unit may exist physically separately, or two or more units may be integrated in the same unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图表记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或,单数不排除复数。系统权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。It is obvious to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, and the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims instead All changes in the meaning and scope of equivalent elements are included in this application. Any reference signs in the claims should not be construed as limiting the claim. In addition, it is to be understood that the term "comprising" does not exclude other elements or the singular does not exclude the plural. A plurality of units or devices recited in the system claims can also be implemented by a unit or device by software or hardware. The first, second, etc. words are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神范围。It should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto. Although the present application is described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present application can be applied. Modifications or equivalent substitutions are made without departing from the spirit of the invention.

Claims (20)

  1. 一种动态图表类页面数据爬取方法,其特征在于,所述方法包括:A dynamic chart class page data crawling method, the method comprising:
    a)采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;
    b)从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;
    c)对爬取到的页面进行渲染并解析;c) rendering and parsing the crawled page;
    d)通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;
    e)根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容;e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;
    f)判断所述待爬取数据的网站及对应所述爬取关键词的页面是否已遍历完;及f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面都已被遍历过,则结束流程;或者When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面未被遍历完,则继续执行上述b)至f)。When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
  2. 如权利要求1所述的方法,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The method according to claim 1, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image comprises:
    通过所述自动化测试工具判断解析后的页面中是否存在图表;Determining, by the automated testing tool, whether a chart exists in the parsed page;
    当确定解析后的页面中不存在图表时,爬取解析后的页面中的信息,并根据预先设置的数据格式保存爬取到的信息;及When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;
    当确定解析后的页面中存在图表时,对所述解析后的页面中的图表进行截图得到截图图片。When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
  3. 如权利要求1或2所述的方法,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The method according to claim 1 or 2, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and save the screenshot image comprises:
    计算所述截图图片的感知哈希值;Calculating a perceptual hash value of the screenshot picture;
    判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值;Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度大于预先设置的相似度阈值时,删除所述截图图片。The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
  4. 如权利要求3所述的方法,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片还包括:The method of claim 3, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image further comprises:
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度小于或者等于预先设置的相似度阈值时,将所述截图图片及对应的解析后的页面进行关联存储于预先设置的特定的位置。Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
  5. 如权利要求1所述的方法,其特征在于,所述预先训练的图片识别模型的训练过程包括:The method of claim 1 wherein the training process of the pre-trained picture recognition model comprises:
    获取多张图片;Get multiple images;
    对所述多张图片进行预处理,得到待参与训练图片识别模型的数据集;Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;
    采用交叉验证的方法对所述数据集进行划分为训练集及测试集;The data set is divided into a training set and a test set by using a cross-validation method;
    在所述训练集中随机选择第一预设数量的训练集训练图片识别模型;Randomly selecting a first preset number of training set training picture recognition models in the training set;
    利用所述测试集测试所训练的图片识别模型的准确率;Using the test set to test the accuracy of the trained picture recognition model;
    若所述准确率大于或者等于预设准确率阈值,则训练结束;If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;
    若所述准确率小于所述预设准确率阈值,则重新训练图片识别模型。If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
  6. 如权利要求5所述的方法,其特征在于,所述重新训练图片识别模型包括:The method of claim 5 wherein said retraining picture recognition model comprises:
    从所述训练集中除所述第一预设数量的训练集之外的训练集中,增加第二预设数量的训练集至所述第一预设数量的训练集中,直至所训练的图片识别模型的准确率大于或者等于所述预设准确率阈值。Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.
  7. 如权利要求5所述的方法,其特征在于,所述第二预设数量为预先设置的固定值,或者预先设置的比例值,或者所述第一预设数量的预设比例值。The method according to claim 5, wherein the second preset number is a preset fixed value, or a preset proportional value, or the first preset number of preset proportional values.
  8. 一种动态图表类页面数据爬取装置,其特征在于,所述装置包括:A dynamic chart class page data crawling device, characterized in that the device comprises:
    启动模块,用于采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a startup module for launching a browser with an automated testing tool and entering a link to a website to be crawled;
    爬取模块,用于从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;a crawling module, configured to crawl, from the website that is to be crawled data, page information related to the crawling keyword input by the user;
    解析模块,用于对爬取到的页面进行渲染并解析;a parsing module for rendering and parsing the crawled page;
    截图模块,用于通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;a screenshot module, configured to take a screenshot of the parsed page by using the automated test tool to obtain a screenshot image and save the screenshot image;
    识别模块,用于根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容。An identification module, configured to identify the screenshot image according to a pre-trained picture recognition model, to obtain content in the screenshot picture.
  9. 一种终端,其特征在于,所述终端包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令时实现如下步骤:A terminal, comprising: a processor and a memory, wherein when the processor is configured to execute the computer readable instructions stored in the memory, the following steps are implemented:
    a)采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;
    b)从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;
    c)对爬取到的页面进行渲染并解析;c) rendering and parsing the crawled page;
    d)通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;
    e)根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容;e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;
    f)判断所述待爬取数据的网站及对应所述爬取关键词的页面是否已遍历完;及f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面都已被遍历过,则结束流程;或者When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面未被遍历完,则继续执行上述b)至f)。When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
  10. 如权利要求9所述的终端,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The terminal according to claim 9, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image comprises:
    通过所述自动化测试工具判断解析后的页面中是否存在图表;Determining, by the automated testing tool, whether a chart exists in the parsed page;
    当确定解析后的页面中不存在图表时,爬取解析后的页面中的信息,并根据预先设置的数据格式保存爬取到的信息;及When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;
    当确定解析后的页面中存在图表时,对所述解析后的页面中的图表进行截图得到截图图片。When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
  11. 如权利要求9或10所述的终端,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The terminal according to claim 9 or 10, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and save the screenshot image includes:
    计算所述截图图片的感知哈希值;Calculating a perceptual hash value of the screenshot picture;
    判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值;Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度大于预先设置的相似度阈值时,删除所述截图图片。The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
  12. 如权利要求11所述的终端,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片还包括:The terminal according to claim 11, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and save the screenshot image further includes:
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度小于或者等于预先设置的相似度阈值时,将所述截图图片及对应的解析后的页面进行关联存储于预先设置的特定的位置。Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
  13. 如权利要求9所述的终端,其特征在于,所述预先训练的图片识别模型的训练过程包括:The terminal according to claim 9, wherein the training process of the pre-trained picture recognition model comprises:
    获取多张图片;Get multiple images;
    对所述多张图片进行预处理,得到待参与训练图片识别模型的数据集;Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;
    采用交叉验证的方法对所述数据集进行划分为训练集及测试集;The data set is divided into a training set and a test set by using a cross-validation method;
    在所述训练集中随机选择第一预设数量的训练集训练图片识别模型;Randomly selecting a first preset number of training set training picture recognition models in the training set;
    利用所述测试集测试所训练的图片识别模型的准确率;Using the test set to test the accuracy of the trained picture recognition model;
    若所述准确率大于或者等于预设准确率阈值,则训练结束;If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;
    若所述准确率小于所述预设准确率阈值,则重新训练图片识别模型。If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
  14. 如权利要求13所述的终端,其特征在于,所述重新训练图片识别模型包括:The terminal according to claim 13, wherein the retraining picture recognition model comprises:
    从所述训练集中除所述第一预设数量的训练集之外的训练集中,增加第二预设数量的训练集至所述第一预设数量的训练集中,直至所训练的图片识别模型的准确率大于或者等于所述预设准确率阈值。Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.
  15. 一种非易失性可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A non-volatile readable storage medium having stored thereon computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:
    a)采用自动化测试工具启动浏览器,输入待爬取数据的网站的链接;a) Start the browser with an automated test tool and enter a link to the website where the data is to be crawled;
    b)从所述待爬取数据的网站中爬取与用户输入的爬取关键词相关的页面信息;b) crawling the page information related to the crawling keyword input by the user from the website to be crawled;
    c)对爬取到的页面进行渲染并解析;c) rendering and parsing the crawled page;
    d)通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片;d) taking a screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image;
    e)根据预先训练的图片识别模型对所述截图图片进行识别,得到所述截图图片中的内容;e) identifying the screenshot picture according to a pre-trained picture recognition model, and obtaining content in the screenshot picture;
    f)判断所述待爬取数据的网站及对应所述爬取关键词的页面是否已遍历完;及f) determining whether the website to be crawled data and the page corresponding to the crawling keyword have been traversed; and
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面都已被遍历过,则结束流程;或者When it is determined that the website to be crawled data and the page corresponding to the crawling keyword have been traversed, the process ends; or
    当确定所述待爬取数据的网站及对应所述爬取关键词的页面未被遍历完,则继续执行上述b)至f)。When it is determined that the website to be crawled data and the page corresponding to the crawling keyword are not traversed, the above b) to f) are continued.
  16. 如权利要求15所述的存储介质,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The storage medium according to claim 15, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image comprises:
    通过所述自动化测试工具判断解析后的页面中是否存在图表;Determining, by the automated testing tool, whether a chart exists in the parsed page;
    当确定解析后的页面中不存在图表时,爬取解析后的页面中的信息,并根据预先设置的数据格式保存爬取到的信息;及When it is determined that there is no chart in the parsed page, the information in the parsed page is crawled, and the crawled information is saved according to a preset data format;
    当确定解析后的页面中存在图表时,对所述解析后的页面中的图表进行截图得到截图图片。When it is determined that there is a chart in the parsed page, a screenshot is taken on the chart in the parsed page to obtain a screenshot image.
  17. 如权利要求15或16所述的存储介质,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片包括:The storage medium according to claim 15 or 16, wherein the screenshot of the parsed page by the automated test tool to obtain a screenshot image and saving the screenshot image comprises:
    计算所述截图图片的感知哈希值;Calculating a perceptual hash value of the screenshot picture;
    判断所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度是否大于预先设置的相似度阈值;Determining whether a similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold;
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度大于预先设置的相似度阈值时,删除所述截图图片。The screenshot picture is deleted when it is determined that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is greater than a preset similarity threshold.
  18. 如权利要求17所述的存储介质,其特征在于,所述通过所述自动化测试工具对解析后的页面进行截图得到截图图片并保存所述截图图片还包括:The storage medium of claim 17, wherein the screenshot of the parsed page by the automated testing tool to obtain a screenshot image and saving the screenshot image further comprises:
    当确定所述截图图片的感知哈希值与已截图图片的感知哈希值之间的相似度小于或者等于预先设置的相似度阈值时,将所述截图图片及对应的解析后的页面进行关联存储于预先设置的特定的位置。Correlating the screenshot picture and the corresponding parsed page when determining that the similarity between the perceptual hash value of the screenshot picture and the perceptual hash value of the screenshot picture is less than or equal to a preset similarity threshold Stored in a specific location set in advance.
  19. 如权利要求15所述的存储介质,其特征在于,所述预先训练的图片识别模型的训练过程包括:The storage medium of claim 15, wherein the training process of the pre-trained picture recognition model comprises:
    获取多张图片;Get multiple images;
    对所述多张图片进行预处理,得到待参与训练图片识别模型的数据集;Pre-processing the plurality of pictures to obtain a data set to be participated in the training picture recognition model;
    采用交叉验证的方法对所述数据集进行划分为训练集及测试集;The data set is divided into a training set and a test set by using a cross-validation method;
    在所述训练集中随机选择第一预设数量的训练集训练图片识别模型;Randomly selecting a first preset number of training set training picture recognition models in the training set;
    利用所述测试集测试所训练的图片识别模型的准确率;Using the test set to test the accuracy of the trained picture recognition model;
    若所述准确率大于或者等于预设准确率阈值,则训练结束;If the accuracy is greater than or equal to the preset accuracy threshold, the training ends;
    若所述准确率小于所述预设准确率阈值,则重新训练图片识别模型。If the accuracy is less than the preset accuracy threshold, the picture recognition model is retrained.
  20. 如权利要求19所述的存储介质,其特征在于,所述重新训练图片识别模型包括:The storage medium of claim 19, wherein the retraining picture recognition model comprises:
    从所述训练集中除所述第一预设数量的训练集之外的训练集中,增加第二预设数量的训练集至所述第一预设数量的训练集中,直至所训练的图片识别模型的准确率大于或者等于所述预设准确率阈值。Adding, from the training set except the first preset number of training sets, a second preset number of training sets to the first preset number of training sets until the trained picture recognition model The accuracy rate is greater than or equal to the preset accuracy threshold.
PCT/CN2018/100159 2018-04-18 2018-08-13 Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium WO2019200783A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810349975.3 2018-04-18
CN201810349975.3A CN108595583B (en) 2018-04-18 2018-04-18 Dynamic graph page data crawling method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
WO2019200783A1 true WO2019200783A1 (en) 2019-10-24

Family

ID=63611109

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100159 WO2019200783A1 (en) 2018-04-18 2018-08-13 Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium

Country Status (2)

Country Link
CN (1) CN108595583B (en)
WO (1) WO2019200783A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026392A (en) * 2019-11-14 2020-04-17 北京金山安全软件有限公司 Method and device for generating guide page and electronic equipment
CN111538887A (en) * 2020-04-30 2020-08-14 广东所能网络有限公司 Big data image-text recognition system and method based on artificial intelligence
CN111694588A (en) * 2020-06-11 2020-09-22 浙江军盾信息科技有限公司 Engine upgrade detection method and device, computer equipment and readable storage medium
CN112363919A (en) * 2020-11-02 2021-02-12 北京云测信息技术有限公司 Automatic test method, device, equipment and storage medium for user interface AI
CN112712021A (en) * 2020-12-29 2021-04-27 华信咨询设计研究院有限公司 Grain surface abnormal state identification method based on perceptual hash and connected domain analysis algorithm
WO2021184896A1 (en) * 2020-03-20 2021-09-23 支付宝(杭州)信息技术有限公司 Page screenshot method and device
CN113821747A (en) * 2021-08-31 2021-12-21 挂号网(杭州)科技有限公司 Data display method and device, storage medium and electronic equipment
CN115396237A (en) * 2022-10-27 2022-11-25 浙江鹏信信息科技股份有限公司 Webpage malicious tampering identification method and system and readable storage medium
CN117149552A (en) * 2023-10-31 2023-12-01 联通在线信息科技有限公司 Automatic interface detection method and device, electronic equipment and storage medium

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582850B (en) * 2018-12-03 2021-07-02 金瓜子科技发展(北京)有限公司 Webpage crawling method and device, storage medium and electronic equipment
CN109948020A (en) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 Data capture method, device, system and readable storage medium storing program for executing
CN109901968A (en) * 2019-01-31 2019-06-18 阿里巴巴集团控股有限公司 A kind of automation page data method of calibration and device
CN110324360A (en) * 2019-08-02 2019-10-11 联永智能科技(上海)有限公司 Offline cryptogram setting, management method, device, system, server and medium
CN110807007B (en) * 2019-09-30 2022-06-24 支付宝(杭州)信息技术有限公司 Target detection model training method, device and system and storage medium
CN111475699B (en) * 2020-03-07 2023-09-08 咪咕文化科技有限公司 Website data crawling method and device, electronic equipment and readable storage medium
CN113660535B (en) * 2021-08-18 2023-03-24 海看网络科技(山东)股份有限公司 System and method for monitoring content change of EPG column of IPTV service
CN114595391A (en) * 2022-03-17 2022-06-07 北京百度网讯科技有限公司 Data processing method and device based on information search and electronic equipment
CN114691962B (en) * 2022-04-25 2024-04-19 清华大学 Mobile terminal page crawler method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103401835A (en) * 2013-07-01 2013-11-20 北京奇虎科技有限公司 Method and device for presenting safety detection results of microblog page
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
CN104376114A (en) * 2014-12-01 2015-02-25 百度在线网络技术(北京)有限公司 Search result displaying method and device
WO2015100496A1 (en) * 2014-01-03 2015-07-09 Investel Capital Corporation User content sharing system and method with automated external content integration

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060184410A1 (en) * 2003-12-30 2006-08-17 Shankar Ramamurthy System and method for capture of user actions and use of capture data in business processes
CN101048729A (en) * 2004-08-02 2007-10-03 佳思腾软件公司 Document processing and management approach for editing a document of mark up language
CN102346736B (en) * 2010-07-28 2014-04-09 阿里巴巴集团控股有限公司 Protection method of webpage digital information and system
US9886465B2 (en) * 2014-08-08 2018-02-06 Halogen Software Inc. System and method for rendering of hierarchical data structures
CN105630780A (en) * 2014-10-27 2016-06-01 小米科技有限责任公司 Webpage information processing method and apparatus
CN104933138A (en) * 2015-06-16 2015-09-23 携程计算机技术(上海)有限公司 Webpage crawler system and webpage crawling method
US20170193569A1 (en) * 2015-12-07 2017-07-06 Brandon Nedelman Three dimensional web crawler
CN105528159B (en) * 2016-01-28 2018-12-04 深圳市创想天空科技股份有限公司 A kind of operating method and operating device of picture
CN107332805B (en) * 2016-04-29 2021-02-26 阿里巴巴集团控股有限公司 Method, device and system for detecting vulnerability
CN106453351A (en) * 2016-10-31 2017-02-22 重庆邮电大学 Financial fishing webpage detection method based on Web page characteristics
CN106599242B (en) * 2016-12-20 2019-03-26 福建六壬网安股份有限公司 A kind of webpage change monitoring method and system based on similarity calculation
CN106960062A (en) * 2017-04-12 2017-07-18 四川九鼎瑞信软件开发有限公司 Webpage capture method and system
CN107203778A (en) * 2017-05-05 2017-09-26 平安科技(深圳)有限公司 PVR intensity grade detecting system and method
CN107480176B (en) * 2017-07-01 2020-05-01 珠海格力电器股份有限公司 Picture management method and device and terminal equipment
CN107871128B (en) * 2017-12-11 2023-06-06 广州市标准化研究院(广州市组织机构代码管理中心) High-robustness image recognition method based on SVG dynamic graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
CN103401835A (en) * 2013-07-01 2013-11-20 北京奇虎科技有限公司 Method and device for presenting safety detection results of microblog page
WO2015100496A1 (en) * 2014-01-03 2015-07-09 Investel Capital Corporation User content sharing system and method with automated external content integration
CN104376114A (en) * 2014-12-01 2015-02-25 百度在线网络技术(北京)有限公司 Search result displaying method and device

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026392A (en) * 2019-11-14 2020-04-17 北京金山安全软件有限公司 Method and device for generating guide page and electronic equipment
CN111026392B (en) * 2019-11-14 2023-08-22 北京金山安全软件有限公司 Method and device for generating guide page and electronic equipment
WO2021184896A1 (en) * 2020-03-20 2021-09-23 支付宝(杭州)信息技术有限公司 Page screenshot method and device
CN111538887B (en) * 2020-04-30 2023-11-10 贵阳杰汇数字创新中心有限公司 Big data graph and text recognition system and method based on artificial intelligence
CN111538887A (en) * 2020-04-30 2020-08-14 广东所能网络有限公司 Big data image-text recognition system and method based on artificial intelligence
CN111694588A (en) * 2020-06-11 2020-09-22 浙江军盾信息科技有限公司 Engine upgrade detection method and device, computer equipment and readable storage medium
CN111694588B (en) * 2020-06-11 2022-05-20 杭州安恒信息安全技术有限公司 Engine upgrade detection method and device, computer equipment and readable storage medium
CN112363919A (en) * 2020-11-02 2021-02-12 北京云测信息技术有限公司 Automatic test method, device, equipment and storage medium for user interface AI
CN112363919B (en) * 2020-11-02 2024-02-13 北京云测信息技术有限公司 User interface AI automatic test method, device, equipment and storage medium
CN112712021A (en) * 2020-12-29 2021-04-27 华信咨询设计研究院有限公司 Grain surface abnormal state identification method based on perceptual hash and connected domain analysis algorithm
CN112712021B (en) * 2020-12-29 2022-06-17 华信咨询设计研究院有限公司 Grain surface abnormal state identification method based on perceptual hash and connected domain analysis algorithm
CN113821747A (en) * 2021-08-31 2021-12-21 挂号网(杭州)科技有限公司 Data display method and device, storage medium and electronic equipment
CN115396237A (en) * 2022-10-27 2022-11-25 浙江鹏信信息科技股份有限公司 Webpage malicious tampering identification method and system and readable storage medium
CN117149552A (en) * 2023-10-31 2023-12-01 联通在线信息科技有限公司 Automatic interface detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN108595583B (en) 2022-12-02
CN108595583A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
WO2019200783A1 (en) Method for data crawling in page containing dynamic image or table, device, terminal, and storage medium
CN110458918B (en) Method and device for outputting information
US10613726B2 (en) Removing and replacing objects in images according to a directed user conversation
US20190163714A1 (en) Search result aggregation method and apparatus based on artificial intelligence and search engine
US9766868B2 (en) Dynamic source code generation
CN110321537B (en) Method and device for generating file
CN103988202A (en) Image attractiveness based indexing and searching
CN111291572B (en) Text typesetting method and device and computer readable storage medium
WO2016018683A1 (en) Image based search to identify objects in documents
CN110909229A (en) Webpage data acquisition and storage system based on simulated browser access
CN109710224B (en) Page processing method, device, equipment and storage medium
US9009188B1 (en) Drawing-based search queries
EP4359956A1 (en) Smart summarization, indexing, and post-processing for recorded document presentation
EP3564833B1 (en) Method and device for identifying main picture in web page
US9940320B2 (en) Plugin tool for collecting user generated document segmentation feedback
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
CN116611401A (en) Document generation method and related device, electronic equipment and storage medium
US11762939B2 (en) Measure GUI response time
RU2571379C2 (en) Intelligent electronic document processing
CN110879868A (en) Consultant scheme generation method, device, system, electronic equipment and medium
US11961261B2 (en) AI-based aesthetical image modification
US10878005B2 (en) Context aware document advising
US20240126978A1 (en) Determining attributes for elements of displayable content and adding them to an accessibility tree
TWI759877B (en) Method for extracting context from webpages
CN117520764A (en) Method, device and equipment for identifying low-quality image-text content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18915550

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.01.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18915550

Country of ref document: EP

Kind code of ref document: A1