WO2019227705A1 - 图片录入方法、服务器及计算机存储介质 - Google Patents

图片录入方法、服务器及计算机存储介质 Download PDF

Info

Publication number
WO2019227705A1
WO2019227705A1 PCT/CN2018/102077 CN2018102077W WO2019227705A1 WO 2019227705 A1 WO2019227705 A1 WO 2019227705A1 CN 2018102077 W CN2018102077 W CN 2018102077W WO 2019227705 A1 WO2019227705 A1 WO 2019227705A1
Authority
WO
WIPO (PCT)
Prior art keywords
picture
capture
pictures
crawling
rule
Prior art date
Application number
PCT/CN2018/102077
Other languages
English (en)
French (fr)
Inventor
张师琲
侯丽
王炜
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019227705A1 publication Critical patent/WO2019227705A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of picture recognition technology, and in particular, to a picture entry method, a server, and a computer storage medium.
  • the basic picture used for general picture recognition has the problem of scarce sources.
  • the basic picture used for general picture recognition is entered into the respective data platform by the user, and the input information is single.
  • a large amount of manual classification and labeling is required for the basic picture before recognition.
  • 70% of the time is spent on data collection and labeling, wasting a lot of time and manpower.
  • manual labeling and classification there are operational errors and low efficiency.
  • this application proposes a picture entry method, a server, and a computer storage medium to solve the problem of how to quickly obtain a large number of pictures and efficiently classify and label these pictures.
  • this application proposes a picture entry method, which includes steps:
  • the capture task includes a capture main process, the capture main process analyzes a mapping relationship between the capture request and a preset image capture rule, and according to the mapping The relationship starts a plurality of capture sub-processes for asynchronous picture capture, and the capture sub-processes correspond to a picture capture model established based on the preset picture capture rules;
  • the captured pictures are stored in the first data set, the picture attribute information and picture characteristics of the pictures in the first data set are obtained, the pictures are initially classified according to the picture attribute information, and the picture attribute information is used as the tag information Pictures are initially labeled;
  • the classified and twice annotated pictures are distributedly stored according to the classification results.
  • the preset picture grabbing rule includes:
  • a first crawling rule where the first crawling rule is crawling according to a specified URL, and establishing a first crawling model based on the first crawling rule;
  • a second crawling rule which uses regular matching to perform range crawling, and establishes a second crawling model based on the second crawling rule
  • a third crawling rule which is a designated page element for crawling, and a third crawling model is established based on the third rule.
  • a simulation artificial access step is further included to cope with the anti-crawling restriction of the target website.
  • the simulation artificial access step specifically includes:
  • the hidden information is information required to log in to the target website
  • the information after login is started to be captured, and the pictures of the target website are captured according to the preset picture capture rules.
  • the main process is further configured to monitor the number of image capture tasks in the plurality of capture sub-processes.
  • the main process distributes the new task to all The sub-processes in which the number of image capturing tasks in the capturing sub-processes is less than a preset value are described.
  • the main process creates a new sub-process, and Distribute the new task to the newly created child process.
  • the method for selecting a plurality of similar other pictures is:
  • the current picture is a picture selected randomly or sequentially.
  • the feature is a color histogram feature, a texture or a shape feature, and the distance is an Euclidean distance.
  • obtaining a plurality of fitting coefficients of the picture includes the steps:
  • the feature corresponding to the current image is xi
  • the features of the k nearest neighbor images are ⁇ Xil,... Xik ⁇
  • the method further includes the following steps:
  • the present application further provides a server including a memory, a processor, and a picture entry system stored on the memory and operable on the processor, and the picture entry system is processed by the process.
  • a server including a memory, a processor, and a picture entry system stored on the memory and operable on the processor, and the picture entry system is processed by the process.
  • the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a picture entry system, and the picture entry system can be executed by at least one processor to enable the At least one processor executes the steps of the picture entry method as described above.
  • the picture entry method, server, and computer-readable storage medium proposed in the present application first accept a picture capture request and start a picture capture task.
  • the capture task includes a main capture process.
  • the capture main process analyzes the mapping relationship between the capture request and a preset image capture rule, and starts several capture sub-processes for asynchronous image capture according to the mapping relationship.
  • the capture sub-process corresponds to An image capture model established by setting image capture rules; secondly, storing the captured images in a first data set, obtaining picture attribute information of the pictures in the first data set, and initially classifying the pictures according to the picture attribute information, And use the picture attribute information as tag information to initially mark the picture; again, select the picture in the first data set, and select a plurality of other pictures in the first data set that are similar to the picture in terms of picture characteristics Obtain a plurality of fitting coefficients of the picture by fitting the picture features of the picture with the picture features of the plurality of other pictures, The multiple fitting coefficients of the picture, using the tags of the other pictures to construct a tag of the picture, and re-tagging the picture by using the constructed tags; finally, classifying and labeling the Pictures are distributedly stored according to the classification results.
  • the picture entry method, server, and computer-readable storage medium proposed in this application can quickly obtain pictures on the network, and efficiently and quickly classify and mark the obtained pictures, greatly reducing human and material resources and greatly saving. Compared with the existing technology, the cost is more convenient, fast and accurate.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of a server of the present application
  • FIG. 2 is a schematic diagram of a program module of a first embodiment of a picture entry system of the present application
  • FIG. 3 is a schematic flowchart of a first embodiment of a picture entry method according to the present application.
  • FIG. 4 is a schematic flowchart of a second embodiment of a picture entry method according to the present application.
  • FIG. 5 is a schematic flowchart of a third embodiment of a picture entry method according to the present application.
  • FIG. 1 is a schematic diagram of an optional hardware architecture of the server 1 of the present application.
  • the server 1 may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 which may communicate with each other through a system bus. It should be noted that FIG. 1 only shows the server 1 with components 11-13, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.
  • the server 1 may be a computing device such as a rack server, a blade server, a tower server, or a rack server.
  • the server 1 may be an independent server or a server cluster composed of multiple servers.
  • the memory 11 includes at least one type of readable storage medium.
  • the readable storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), a random access memory (RAM), and a static memory.
  • the memory 11 may be an internal storage unit of the server 1, such as a hard disk or a memory of the server 1.
  • the memory 11 may also be an external storage device of the server 1, such as a plug-in hard disk, a Smart Memory Card (SMC), and a secure digital (Secure) Digital, SD) card, Flash card, etc.
  • the memory 11 may also include both an internal storage unit of the server 1 and an external storage device thereof.
  • the memory 11 is generally used to store an operating system and various application software installed on the server 1, such as program codes of the picture entry system 2.
  • the memory 11 may also be used to temporarily store various types of data that have been output or will be output.
  • the processor 12 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or another data processing chip.
  • the processor 12 is generally used to control the overall operation of the server 1.
  • the processor 12 is configured to run program code or process data stored in the memory 11, for example, to run the picture entry system 2 and the like.
  • the network interface 13 may include a wireless network interface or a wired network interface.
  • the network interface 13 is generally used to establish a communication connection between the server 1 and other electronic devices.
  • this application proposes a picture entry system 2.
  • FIG. 2 it is a program module diagram of the first embodiment of the picture entry system 2 of the present application.
  • the picture entry system 2 includes a series of computer program instructions stored in the memory 11.
  • the picture entry operations of the embodiments of the present application can be implemented.
  • the picture entry system 2 may be divided into one or more modules based on specific operations implemented by various portions of the computer program instructions. For example, in FIG. 3, the picture entry system 2 may be divided into a picture capture module 21, a first annotation module 22, a second annotation module 23, and a storage module 24. among them:
  • the picture capture module 21 is configured to accept a picture capture request and start a picture capture task.
  • the capture task includes a capture main process, and the capture main process analyzes the capture request and a preset image capture. Fetch the mapping relationship of the rule, and start a plurality of capture sub-processes for asynchronous image capture according to the mapping relationship, wherein the capture sub-process corresponds to a picture capture model established based on the preset picture capture rule;
  • the crawl request is input by a user.
  • the user can choose different ways to crawl the pictures on the Internet.
  • the user can specify the URL of the image to be crawled, and the web page corresponding to the designated URL Capture existing pictures; users can also use regular matching search scope URLs to capture pictures that are limited by regular expressions.
  • regular expressions also known as regular expressions, are called English regular expressions.
  • Code is often abbreviated as regex, regexp or RE, which is a concept in computer science. Regular tables are often used to retrieve and replace text that conforms to a certain pattern (rule).
  • a regular expression is a logical formula that operates on strings (including ordinary characters (for example, letters between a and z) and special characters (called “meta characters”), that is, using certain predefined characters , And the combination of these specific characters to form a "rule string”.
  • This "rule string” is used to express a filtering logic for strings.
  • the regular expression matching http URL can be:
  • page elements for fetching can also specify page elements for fetching.
  • page elements for recursive crawling can also specify the order of page element crawling.
  • a web page is composed of web page elements.
  • Web page elements include navigation, website logos, advertising bars, pictures, Words, animations, decorations, hyperlinks, and so on, it is these various elements that make up a complete web page, and each web page has become the most indispensable part of the Internet.
  • the preset picture grabbing rule includes:
  • a first fetching rule where the first fetching rule is fetching according to a specified Uniform Resource Locator (URL), and establishing a first fetching model based on the first fetching rule;
  • URL Uniform Resource Locator
  • a second crawling rule which uses regular matching to perform range crawling, and establishes a second crawling model based on the second crawling rule
  • a third crawling rule which is a designated page element for crawling, and a third crawling model is established based on the third rule.
  • the image crawling model is established corresponding to the preset image crawling rule.
  • the preset image crawling rule 1. Crawl according to a specified URL; 2. Use regular Match to crawl the range; 3. Specify page elements to crawl. Among them, you can also specify page elements for recursive crawling, and specify the order of page element crawling to establish a specified URL image capture model, a regular match image capture model, and an element image capture model.
  • a simulated manual access step which can include:
  • the main process is also used to monitor the number of picture capture tasks in each sub process.
  • the main process distributes the new task to the picture capture in the sub process.
  • the main process creates a new child process and distributes the new task to the newly created child process.
  • the first annotation classification module 22 is configured to store the captured pictures in a first data set, obtain picture attribute information of the pictures in the first data set, perform preliminary classification of the pictures according to the picture attribute information, and classify the pictures.
  • the picture attribute information is used as tag information to initially mark the picture.
  • the picture attribute information includes: time, place, picture name, etc.
  • the time and place where the picture was generated are classified, and the picture can be classified according to time and place.
  • the picture can be classified according to different years and months. 3, different dates for time classification, you can classify pictures according to country, province, city, district, county, etc.
  • the picture attribute information is stored in the picture itself, and the picture attribute information can be read by writing a picture attribute reading program, and the steps of obtaining the picture attribute information include: 1. loading picture information; 2. The information is analyzed and filtered to obtain the picture attribute information of the picture; 3, the picture attribute information of the picture is output.
  • the obtained picture attribute information can be filtered, and the filtered picture attribute information can be used as tag information to initially mark the picture.
  • the time, place, and picture name in the picture can be selected to mark the picture.
  • Picture classification is one of the main methods for picture annotation, and because an image can usually be labeled with multiple category tags, picture classification based on pictures is a multi-tag picture classification problem.
  • picture classification can also be used for automatic archiving of pictures, to achieve intra-class retrieval, and improve query efficiency.
  • the second label classification module 23 is configured to select a picture in the first data set, and select a plurality of other pictures in the first data set that are similar to the picture feature in the picture, by using the multiple The picture features of other pictures are fitted to the picture features of the picture to obtain a plurality of fitting coefficients of the picture, and according to the plurality of fitting coefficients of the picture, the labels of the plurality of other pictures are used to construct the fitting coefficients.
  • the tag of the picture, and the picture is labeled again by the constructed tag.
  • the picture is usually associated with some text description information, such as title, subject words, comment information, etc., used to indicate information such as the content of the picture, the shooting location, personal feelings, and evaluation. Therefore, you can tag pictures based on this information, or use keywords as tags directly.
  • text description information such as title, subject words, comment information, etc.
  • the storage module 24 is configured to store the classified and labeled pictures in a distributed manner according to a classification result.
  • the picture attribute information includes: time, place, picture name, etc.
  • the time and place where the picture was generated are classified, and the picture can be classified according to time and place.
  • the picture can be divided into different years, different months, Time is classified in three ways based on different dates, and pictures can be sorted by country, province, city, district, and county.
  • this application also proposes a picture entry method.
  • FIG. 3 is a schematic flowchart of a first embodiment of a picture entry method according to the present application.
  • the execution order of the steps in the flowchart shown in FIG. 5 may be changed, and some steps may be omitted.
  • Step S110 Accept a picture capture request, and start a picture capture task.
  • the capture task includes a capture main process, and the capture main process analyzes a mapping relationship between the capture request and a preset picture capture rule, according to The mapping relationship starts a plurality of capture sub-processes for asynchronous picture capture, and the capture sub-processes correspond to a picture capture model established based on the preset picture capture rules.
  • the crawl request is input by a user.
  • the user can choose different ways to crawl the pictures on the Internet.
  • the user can specify the URL of the picture to crawl, and the user can also use the regular Match search range URLs to capture images from the search range limited by regular expressions.
  • Users can also specify page elements to capture. Among them, you can specify page elements to crawl recursively, and specify the order of page elements to crawl.
  • Step S120 Store the captured pictures in a first data set, obtain picture attribute information of the pictures in the first data set, perform preliminary classification of the pictures according to the picture attribute information, and use the picture attribute information as a tag information pair.
  • the pictures are initially labeled.
  • picture classification is one of the main methods of picture labeling, and since a picture can usually be labeled with multiple category tags, picture classification based on pictures is a multi-tag picture classification problem.
  • picture classification can also be used for automatic archiving of pictures, to achieve intra-class retrieval, and improve query efficiency.
  • Step S130 selecting pictures in the first data set, and selecting a plurality of other pictures in the first data set that are close to the picture in terms of picture characteristics, and fitting the pictures with the picture characteristics of the plurality of other pictures Describe the picture characteristics of the picture, obtain multiple fitting coefficients of the picture, and use the tags of the multiple other pictures to construct the tags of the picture according to the multiple fitting coefficients of the picture.
  • the label marks the picture again.
  • obtaining a plurality of fitting coefficients of the picture includes steps:
  • the plurality of fitting coefficients of the picture are obtained by minimizing an error in fitting a given picture by a plurality of other pictures that are close to the given picture in picture characteristics.
  • step S140 Normalize each coefficient of the fitted coefficient vector W, that is, divide the value of each element in the fitted coefficient vector W by the sum of all these elements.
  • step S140 the classified and labeled pictures are distributedly stored according to the classification result.
  • the method further includes the following steps:
  • step S140 the classified and labeled pictures are distributedly stored according to the classification result.
  • the picture attribute information includes: time, place, picture name, etc.
  • the time and place where the picture was generated are classified, and the picture can be classified according to time and place.
  • the picture can be divided into different years, different months, Time is classified in three ways based on different dates, and pictures can be sorted by country, province, city, district, and county.
  • step S110 of the picture entry method the step of specifying the preset picture capture rule includes:
  • Step S210 crawl according to the specified URL.
  • the user may specify a URL for image crawling, and crawl existing images on a webpage corresponding to the specified URL.
  • Step S220 use regular matching for range grabbing.
  • step S230 page elements are designated for grabbing.
  • a page element is designated for crawling.
  • a web page is composed of web page elements.
  • Web page elements include navigation, website logos, advertising bars, pictures, Words, animations, decorations, hyperlinks, and so on, it is these various elements that make up a complete web page, and each web page has become the most indispensable part of the Internet.
  • the method of step 130 of the picture entry method for selecting a plurality of other similar pictures includes the following steps:
  • Step S310 extracting features of each picture in the first data set.
  • the methods of the prior art can be used for the selection and calculation of picture features. For example, color histogram features, textures, or shape features can be selected.
  • Step S320 Calculate the distance between features of the current picture and the remaining pictures.
  • the method for selecting and calculating the distance of a picture feature may adopt a method in the prior art, for example, Euclidean distance may be selected.
  • step S330 a preset number of pictures with the smallest distance is selected as the preset number of nearest neighbor pictures of a given picture.
  • a preset number of pictures with the smallest distance is selected as the preset number of nearest neighbor pictures of a given picture, and the purpose of selecting the picture with the smallest distance is to select the picture with the highest similarity.
  • the picture entry method, server, and computer-readable storage medium proposed in this application first accept a picture capture request and start a picture capture task.
  • the capture task includes a capture main process, and the capture main process analyzes the Mapping relationship between a capture request and a preset picture capture rule, and based on the mapping relationship, a plurality of capture sub-processes are started to perform asynchronous picture capture, and the capture sub-process corresponds to the image capture rule established based on the preset picture capture rule.
  • Picture capture model secondly, the captured pictures are stored in a first data set, the picture attribute information of the pictures in the first data set is obtained, the pictures are initially classified according to the picture attribute information, and the picture attribute information is Initially label the pictures as label information; again, select pictures in the first data set, and select a plurality of other pictures in the first data set that are close to the picture in terms of picture characteristics, by using the multiple The picture features of other pictures are fitted to the picture features of the picture, and multiple fitting coefficients of the picture are obtained. Multiple fitting coefficients, using the labels of the other pictures to construct the tags of the picture, and re-labeling the picture by using the constructed tags; finally, classifying and labeling the pictures according to the classification results Distributed storage.
  • the picture entry method, server, and computer-readable storage medium proposed in this application can quickly obtain pictures on the network, and efficiently and quickly classify and mark the obtained pictures, greatly reducing human and material resources and greatly saving. Compared with the existing technology, the cost is more convenient, fast and accurate.
  • the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better.
  • Implementation Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product that is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种图片录入方法,该方法包括:接受图片抓取请求,启动图片抓取任务进行图片异步抓取,将抓取的图片存储到第一数据集,获取图片属性信息及图片特征,对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注,选取所述第一数据集中的第一图片,选择在图片特征上与所述图片相近的多个其它图片,获得所述第一图片的多个拟合系数,根据所述第一图片的所述多个拟合系数,利用其它图片的标签来构造所述第一图片的标签,通过所述标签对该第一图片进行再次标注。本申请还提供一种服务器及计算机可读存储介质。本申请提供的图片录入方法、服务器及计算机可读存储介质能够对获取的图片进行高效、迅捷的分类及标注。

Description

图片录入方法、服务器及计算机存储介质
本申请要求于2018年5月28日提交中国专利局、申请号为201810525540.X、发明名称为“图片录入方法、服务器及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及图片识别技术领域,尤其涉及一种图片录入方法、服务器及计算机存储介质。
背景技术
一般图片识别所用的基础图片存在来源稀少的问题,例如一般图片识别所用的基础图片由使用单位自行录入各自的数据平台,且录入信息单一。另外,在识别之前对所述基础图片需要进行大量的人工分类和标注。在绝大多数项目中,70%的时间都在数据采集和标注上,浪费了很多的时间和人力。并且在人工标注和分类时,存在着操作方面的失误,同时效率低下。
因此,如何快速获得大量图片,并对这些图片进行高效分类及标注成了当下一大亟需解决的问题。
发明内容
有鉴于此,本申请提出一种图片录入方法、服务器及计算机存储介质,以解决如何快速获得大量图片,并对这些图片进行高效分类及标注的问题。
首先,为实现上述目的,本申请提出一种图片录入方法,该方法包括步骤:
接受图片抓取请求,启动图片抓取任务,所述抓取任务包含一抓取主进 程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;
将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息及图片特征,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;
选取所述第一数据集中的第一图片,在所述第一数据集中选择在所述图片特征上与所述第一图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述第一图片的多个拟合系数;
根据所述第一图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述第一图片的标签,通过构造的所述标签对该第一图片进行再次标注;及
将分类及两次标注后的图片按照分类的结果进行分布式存储。
优选地,所述预设图片抓取规则包括:
第一抓取规则,所述第一抓取规则为按指定的URL抓取,基于所述第一抓取规则建立第一抓取模型;
第二抓取规则,所述第二抓取规则为使用正则匹配来进行范围抓取,基于所述第二抓取规则建立第二抓取模型;及
第三抓取规则,所述第三抓取规则为指定页面元素进行抓取,基于所述第三规则建立第三抓取模型。
优选地,图片抓取过程中,还包括模拟人工访问步骤以应对目标网站的防抓取限制,所述模拟人工访问步骤具体包括:
找到登录所述目标网站的隐藏信息,并将其内容先进行保存,所述隐藏信息为登录所述目标网站需要的信息;
将所述隐藏信息进行提交,模拟登录网站;及
模拟登录成功后,开始获取登录后的信息,对所述目标网站的图片按照 所述预设图片抓取规则进行抓取。
优选地,所述主进程还用于监控所述若干抓取子进程中的图片抓取任务的数量,当有新的图片抓取任务到来时,所述主进程将所述新任务分发给所述若干抓取子进程中图片抓取任务数量小于预设值的子进程,当所有抓取子进程的图片抓取任务都大于所述预设值时,所述主进程新建一个子进程,并将新任务分发到新建的子进程。
优选地,选取相近的多个其他图片的方法为:
提取所述第一数据集中的每一个图片的所述图片特征;
计算当前图片和剩余图片的特征的距离;及
选择距离最小的前预设数量的图片作为给定图片的预设数量的最近邻图片;
其中,所述当前图片为随机或者顺序选择的图片。
优选地,所述特征为颜色直方图特征、纹理或者形状特征,所述距离为欧氏距离。
优选地,获得所述图片的多个拟合系数包括步骤:
计算大小为k×k的相关矩阵C,该矩阵中第m行、第n列的元素为:Cmn=(Xi-Xi m)*(Xi-Xi n),m,n=1,....,k;
解线性系统C*W=1,得到拟合系数向量W;及
将拟合系数向量W的各个系数归一化;
其中,所述当前图像对应的特征为xi,其k个最近邻图像的特征为{Xil,…Xik},拟合系数向量为W={w1,...,wk}。
优选地,为了获得所述第一数据集中所有图片的标签,还包括步骤:
随机或顺序选择所述第一数据集中的一个图片;
利用对应于所选图片的多个其它图片的标签,以对应的拟合系数来拟合所选图片的标签;及
重复上述步骤,直至为所述第一数据集中的每一个图片构造了标签。
此外,为实现上述目的,本申请还提供一种服务器,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的图片录入系统,所述图片录入系统被所述处理器执行时实现如上述的图片录入方法的步骤。
进一步地,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质存储有图片录入系统,所述图片录入系统可被至少一个处理器执行,以使所述至少一个处理器执行如上述的图片录入方法的步骤。
相较于现有技术,本申请所提出的图片录入方法、服务器及计算机可读存储介质,首先接受图片抓取请求,启动图片抓取任务,所述抓取任务包含一抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;其次,将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;再次,选取所述第一数据集中的图片,在所述第一数据集中选择在图片特征上与所述图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述图片的多个拟合系数,根据所述图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述图片的标签,通过构造的所述标签对该图片进行再次标注;最后,将分类及标注后的图片按照分类的结果进行分布式存储。采用本申请所提出的图片录入方法、服务器及计算机可读存储介质可以快速获得网络上的图片,对获取的图片进行高效、迅捷的分类及标注,大大减少了人力物力资源,极大的节约了成本,相较于现有技术,更加方便、快捷、准确。
附图说明
图1是本申请服务器一可选的硬件架构的示意图;
图2是本申请图片录入系统第一实施例的程序模块示意图;
图3是本申请图片录入方法第一实施例的流程示意图;
图4是本申请图片录入方法第二实施例的流程示意图;
图5是本申请图片录入方法第三实施例的流程示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。
参阅图1所示,是本申请服务器1一可选的硬件架构的示意图。
本实施例中,所述服务器1可包括,但不仅限于,可通过系统总线相互通信连接存储器11、处理器12、网络接口13。需要指出的是,图1仅示出了具有组件11-13的服务器1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。
其中,所述服务器1可以是机架式服务器、刀片式服务器、塔式服务器或 机柜式服务器等计算设备,该服务器1可以是独立的服务器,也可以是多个服务器所组成的服务器集群。
所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器11可以是所述服务器1的内部存储单元,例如该服务器1的硬盘或内存。在另一些实施例中,所述存储器11也可以是所述服务器1的外部存储设备,例如该服务器1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器11还可以既包括所述服务器1的内部存储单元也包括其外部存储设备。本实施例中,所述存储器11通常用于存储安装于所述服务器1的操作系统和各类应用软件,例如图片录入系统2的程序代码等。此外,所述存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述服务器1的总体操作。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行所述的图片录入系统2等。
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述服务器1与其他电子设备之间建立通信连接。
至此,己经详细介绍了本申请相关设备的硬件结构和功能。下面,将基于上述介绍提出本申请的各个实施例。
首先,本申请提出一种图片录入系统2。
参阅图2所示,是本申请图片录入系统2第一实施例的程序模块图。
本实施例中,所述图片录入系统2包括一系列的存储于存储器11上的计算机程序指令,当该计算机程序指令被处理器12执行时,可以实现本申请各实施例的图片录入操作。在一些实施例中,基于该计算机程序指令各部分所实现的特定的操作,图片录入系统2可以被划分为一个或多个模块。例如,在图3中,所述图片录入系统2可以被分割成图片抓取模块21、第一标注模块22、第二标注模块23及存储模块24。其中:
所述图片抓取模块21,用于接受图片抓取请求,启动图片抓取任务,所述抓取任务包含一抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,其中,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;
具体地,所述抓取请求由用户输入,根据不同需要,用户可以选择不同的方式对互联网上的图片进行抓取,例如,用户可以指定图片抓取的网址,在指定的网址对应的网页上对存在的图片进行抓取;用户还可以使用正则匹配搜索范围网址,对正则表达式限定的搜索范围进行图片抓取,其中,正则表达式,又称规则表达式,英文名为Regular Expression,在代码中常简写为regex、regexp或RE,是计算机科学的一个概念。正则表通常被用来检索、替换那些符合某个模式(规则)的文本。正则表达式是对字符串(包括普通字符(例如,a到z之间的字母)和特殊字符(称为“元字符”))操作的一种逻辑公式,就是用事先定义好的一些特定字符、及这些特定字符的组合,组成一个“规则字符串”,这个“规则字符串”用来表达对字符串的一种过滤逻辑。正则表达式是一种文本模式,模式描述在搜索文本时要匹配的一个或多个字符串,举例而言,匹配完整域名的正则表达式可为:^(?=^.{3,255}$)[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-zA-Z0-9][-a-zA-Z0-9]{0,62})+$
例如:www.baidu.com,匹配网址的正则表达式可为:
^(?=^.{3,255}$)(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-z A-Z0-9][-a-zA-Z0-9]{0,62})+(:\d+)*(\/\w+\.\w+)*$
匹配http url的正则表达式可为:
^(?=^.{3,255}$)(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9][-a-zA-Z0-9]{0,62}(\.[a-z A-Z0-9][-a-zA-Z0-9]{0,62})+(:\d+)*(\/\w+\.\w+)*([\?&]\w+=\w*)*$
例如:http://www.tetet.com/index.html?q=1&m=test。
以上正则表达式为根据DNS规定写出,根据DNS规定,域名中的标号都由英文字母和数字组成,每一个标号不超过63个字符,也不区分大小写字母。标号中除连字符(-)外不能使用其他的标点符号。级别最低的域名写在最左边,而级别最高的域名写在最右边。由多个标号组成的完整域名总共不超过255个字符。正则表达式的使用仅仅是举例,此处不再赘述。
具体地,用户还可以指定页面元素进行抓取。其中,可以指定页面元素进行递归抓取,以及指定页面元素的顺序进行抓取,举例来说,网页是由一个个网页元素所组成的,网页元素包括,导航、网站标志、广告条、图片、文字、动画、装饰物、超链接等等,正是这些各种各样的元素组成了一个完整的网页,而一个个网页成为了互联网中最不可缺少的部分。
具体地,所述预设图片抓取规则包括:
第一抓取规则,所述第一抓取规则为按指定的统一资源定位符(Uniform Resource Locator,URL)抓取,基于所述第一抓取规则建立第一抓取模型;
第二抓取规则,所述第二抓取规则为使用正则匹配来进行范围抓取,基于所述第二抓取规则建立第二抓取模型;及
第三抓取规则,所述第三抓取规则为指定页面元素进行抓取,基于所述第三规则建立第三抓取模型。
具体地,所述图片抓取模型对应于所述预设图片抓取规则而建立,举例而言,对应于所述预设图片抓取规则:1.按指定的URL抓取;2.使用正则匹配来进行范围抓取;3.指定页面元素进行抓取。其中,还可以指定页面元素进 行递归抓取,以及指定页面元素的顺序进行抓取分别建立指定URL图片抓取模型,正则匹配图片抓取模型,指定元素图片抓取模型。
具体地,在图片抓取过程中,当遇到一些网站具有某些方抓取限制时,例如,需要登录才能查看网页,我们可以设置模拟人工访问步骤,该步骤可以包括:
1.找到登录网站的隐藏信息,并将其内容先进行保存,具体地,进入开发者工具,手动的先进行一次登录,找到其中的data的数据段,这个就是登录所需的信息;
2.将信息进行提交,模拟登录网站;
3.模拟登录成功后,开始获取登录后的信息。
具体地,所述主进程还用于监控各子进程中的图片抓取任务的数量,当有新的图片抓取任务到来时,所述主进程将所述新任务分发给子进程中图片抓取任务数量小于预设值的子进程,当所有子进程的图片抓取任务都大于所述预设值时,所述主进程新建一个子进程,并将新任务分发到新建的子进程。
所述第一标注分类模块22,用于将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注。
具体地,所述图片属性信息包括:时间、地点、图片名称等,对图片产生的时间、地点进行归类,可将图片按照时间、地点进行分类,例如,可对图片按照不同年份、不同月份、不同日期三种方式进行时间分类,可按照国家、省、市、区、县等对图片进行地点分类。所述图片属性信息存储于图片本身中,可以通过编写图片属性读取程序对所述图片属性信息进行读取,获取所述图片属性信息的步骤包括:1,载入图片信息;2,对图片的信息进行分析、过滤,获取图片的图片属性信息;3,输出图片的图片属性信息。
具体地,可对获取的图片属性信息进行筛选,将筛选后的图片属性信息作为标签信息对图片进行初步标注,例如,可以选择图片中的时间、地点、 图片名称对图片进行标注。图片分类是图片标注的主要方法之一,且由于一幅图片通常可以被标注多个类别标签,因此基于分类的图片标注是一个多标签图片分类问题。此外,图片分类还可用于图片的自动归档,实现类内检索,提高查询效率。
所述第二标注分类模块23,用于选取所述第一数据集中的图片,在所述第一数据集中选择在图片特征上与所述图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述图片的多个拟合系数,根据所述图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述图片的标签,通过构造的所述标签对该图片进行再次标注。
具体地,图片通常关联有一些文本描述信息,例如,标题、主题词、评论信息等,用以表明图片的内容、拍摄地点、个人感受和评价等信息。因此,可以基于这些信息为图片添加标签,或直接将主题词作为标签。
应当说明的是,从网上抓取的图片,一部分图片含有标签,一部分不含标签,通过利用含有标签的相似图片对不含标签的图片进行打标签是本方法的中心思想。
所述存储模块24,用于将分类及标注后的图片按照分类的结果进行分布式存储。
具体地,根据不同类别对图片进行分布式存储可以方便图片管理及搜索,例如。例如,所述图片属性信息包括:时间、地点、图片名称等,对图片产生的时间、地点进行归类,可将图片按照时间、地点进行分类,例如,可对图片按照不同年份、不同月份、不同日期三种方式进行时间分类,可按照国家、省、市、区、县等对图片进行地点分类。
此外,本申请还提出一种图片录入方法。
参阅图3所示,是本申请图片录入方法第一实施例的流程示意图。在本实施例中,根据不同的需求,图5所示的流程图中的步骤的执行顺序可以改变,某些步骤可以省略。
步骤S110,接受图片抓取请求,启动图片抓取任务,所述抓取任务包含一抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型。
具体地,具体地,所述抓取请求由用户输入,根据不同需要,用户可以选择不同的方式对互联网上的图片进行抓取,例如,用户可以指定图片抓取的网址,用户还可以使用正则匹配搜索范围网址,对正则表达式限定的搜索范围进行图片抓取,用户还可以指定页面元素进行抓取。其中,可以指定页面元素进行递归抓取,以及指定页面元素的顺序进行抓取。
步骤S120,将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注。
具体地,图片分类是图片标注的主要方法之一,且由于一幅图片通常可以被标注多个类别标签,因此基于分类的图片标注是一个多标签图片分类问题。此外,图片分类还可用于图片的自动归档,实现类内检索,提高查询效率。
步骤S130,选取所述第一数据集中的图片,在所述第一数据集中选择在图片特征上与所述图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述图片的多个拟合系数,根据所述图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述图片的标签,通过构造的所述标签对该图片进行再次标注。
具体地,获得所述图片的多个拟合系数包括步骤:
通过使得在图片特征上与给定图片相近的多个其它图片来拟合给定图片的误差最小来获得所述图片的所述多个拟合系数。
下面以一个给定图像及其k个最近邻图像为例对获得所述拟合系数的步骤进行说明:
假设当前图像对应的特征为xi,其k个最近邻图像的特征为{Xil,…Xik},拟合系数向量为W={w1,...,wk}。
1,计算大小为k×k的相关矩阵C,该矩阵中第m行、第n列的元素为:Cmn=(Xi-Xi m)*(Xi-Xi n),m,n=1,....,k。
2,解线性系统C*W=1,得到拟合系数向量W。求解上述线形方程得到拟合系数;
3,将拟合系数向量W的各个系数归一化,即将拟合系数向量W中每个元素的值除以所有这些元素的和。步骤S140,将分类及标注后的图片按照分类的结果进行分布式存储。
具体地,为了获得所述第一数据集中所有图片的标签,还包括步骤:
1,随机或顺序选择所述图片集合中的一个图片;
2,利用对应于所选图片的多个其它图片的标签,以对应的拟合系数来拟合所选图片的标签;
3,重复步骤1、步骤2,直至为所述图片集合中的每一个图片构造了标签。
步骤S140,将分类及标注后的图片按照分类的结果进行分布式存储。
具体地,根据不同类别对图片进行分布式存储可以方便图片管理及搜索,例如。例如,所述图片属性信息包括:时间、地点、图片名称等,对图片产生的时间、地点进行归类,可将图片按照时间、地点进行分类,例如,可对图片按照不同年份、不同月份、不同日期三种方式进行时间分类,可按照国家、省、市、区、县等对图片进行地点分类。
如图4所示,是本申请图片录入方法的第二实施例的流程示意图。本实施例中,所述图片录入方法的步骤S110中,指定所述预设图片抓取规则的步骤包括:
步骤S210,按指定的URL抓取.
具体地,用户可以指定图片抓取的网址,在指定的网址对应的网页上对存在的图片进行抓取。
步骤S220,使用正则匹配来进行范围抓取.
具体地,使用正则匹配搜索范围网址,对正则表达式限定的搜索范围进行图片抓取。
步骤S230,指定页面元素进行抓取。
具体地,指定页面元素进行抓取。其中,可以指定页面元素进行递归抓取,以及指定页面元素的顺序进行抓取,举例来说,网页是由一个个网页元素所组成的,网页元素包括,导航、网站标志、广告条、图片、文字、动画、装饰物、超链接等等,正是这些各种各样的元素组成了一个完整的网页,而一个个网页成为了互联网中最不可缺少的部分。
如图5所示,是本申请图片录入方法的第三实施例的流程示意图。本实施例中,所述图片录入方法的步骤130选取相近的多个其他图片的方法包括步骤:
步骤S310,提取所述第一数据集中的每一个图片的特征。
具体地,图片特征的选取与计算可采用现有技术中的方法,例如,可选用颜色直方图特征、纹理或者形状特征。
步骤S320,计算当前图片和剩余图片的特征的距离。
具体地,图片特征的距离的选取与计算可采用现有技术中的方法,例如,可选用欧氏距离。
步骤S330,选择距离最小的预设数量的图片作为给定图片的预设数量的最近邻图片。
具体地,选择距离最小的预设数量的图片作为给定图片的预设数量的最近邻图片,选择距离最小的图片的目的是选择相似度最大的图片。
本申请所提出的图片录入方法、服务器及计算机可读存储介质,首先接受图片抓取请求,启动图片抓取任务,所述抓取任务包含一抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;其次,将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息,根据所述图片属性信 息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;再次,选取所述第一数据集中的图片,在所述第一数据集中选择在图片特征上与所述图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述图片的多个拟合系数,根据所述图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述图片的标签,通过构造的所述标签对该图片进行再次标注;最后,将分类及标注后的图片按照分类的结果进行分布式存储。采用本申请所提出的图片录入方法、服务器及计算机可读存储介质可以快速获得网络上的图片,对获取的图片进行高效、迅捷的分类及标注,大大减少了人力物力资源,极大的节约了成本,相较于现有技术,更加方便、快捷、准确。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种图片录入方法,应用于服务器,其特征在于,所述方法包括步骤:
    接受图片抓取请求,启动图片抓取任务,所述抓取任务包含抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;
    将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息及图片特征,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;
    选取所述第一数据集中的第一图片,在所述第一数据集中选择在所述图片特征上与所述第一图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述第一图片的多个拟合系数;根据所述第一图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述第一图片的标签,通过构造的所述标签对该第一图片进行再次标注;及
    将分类及两次标注后的图片按照分类的结果进行分布式存储。
  2. 如权利要求1所述的图片录入方法,其特征在于,所述预设图片抓取规则包括:
    第一抓取规则,所述第一抓取规则为按指定的URL抓取,基于所述第一抓取规则建立第一抓取模型;
    第二抓取规则,所述第二抓取规则为使用正则匹配来进行范围抓取,基于所述第二抓取规则建立第二抓取模型;及
    第三抓取规则,所述第三抓取规则为指定页面元素进行抓取,基于所述第三规则建立第三抓取模型。
  3. 如权利要求2所述的图片录入方法,其特征在于,图片抓取过程中,还包括模拟人工访问步骤以应对目标网站的防抓取限制,所述模拟人工访问 步骤具体包括:
    找到登录所述目标网站的隐藏信息,并将其内容先进行保存,所述隐藏信息为登录所述目标网站需要的信息;
    将所述隐藏信息进行提交,模拟登录网站;及
    模拟登录成功后,开始获取登录后的信息,对所述目标网站的图片按照所述预设图片抓取规则进行抓取。
  4. 如权利要求3所述的图片录入方法,其特征在于,所述主进程还用于监控所述若干抓取子进程中的图片抓取任务的数量,当有新的图片抓取任务到来时,所述主进程将所述新任务分发给所述若干抓取子进程中图片抓取任务数量小于预设值的子进程,当所有抓取子进程的图片抓取任务都大于所述预设值时,所述主进程新建一个子进程,并将新任务分发到新建的子进程。
  5. 如权利要求1所述的图片录入方法,其特征在于,选取相近的多个其他图片的方法为:
    提取所述第一数据集中的每一个图片的所述图片特征;
    计算当前图片和剩余图片的特征的距离;及
    选择距离最小的预设数量的图片作为给定图片的预设数量的最近邻图片;
    其中,所述当前图片为随机或者顺序选择的图片。
  6. 如权利要求5所述的图片录入方法,其特征在于,所述特征为颜色直方图特征、纹理或者形状特征,所述距离为欧氏距离。
  7. 如权利要求6所述的图片录入方法,其特征在于,获得所述图片的多个拟合系数包括步骤:
    计算大小为k×k的相关矩阵C,该矩阵中第m行、第n列的元素为:Cmn=(Xi-Xi m)*(Xi-Xi n),m,n=1,....,k;
    解线性系统C*W=1,得到拟合系数向量W;及
    将拟合系数向量W的各个系数归一化;
    其中,k为最近邻图片的数目,所述当前图像对应的特征为xi,其k个最 近邻图像的特征为{Xil,…Xik},拟合系数向量为W={w1,...,wk}。
  8. 如权利要求7所述的图片录入方法,其特征在于,为了获得所述第一数据集中所有图片的标签,还包括步骤:
    随机或顺序选择所述第一数据集中的一个图片;
    利用对应于所选图片的多个其它图片的标签,以对应的拟合系数来拟合所选图片的标签;及
    重复上述步骤,直至为所述第一数据集中的每一个图片构造了标签。
  9. 一种服务器,其特征在于,所述服务器包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的图片录入系统,所述图片录入系统被所述处理器执行时实现如下步骤:
    接受图片抓取请求,启动图片抓取任务,所述抓取任务包含抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;
    将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息及图片特征,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;
    选取所述第一数据集中的第一图片,在所述第一数据集中选择在所述图片特征上与所述第一图片相近的多个其它图片,通过用所述多个其它图片的图片特征拟合所述图片的图片特征,获得所述第一图片的多个拟合系数;根据所述第一图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述第一图片的标签,通过构造的所述标签对该第一图片进行再次标注;及
    将分类及两次标注后的图片按照分类的结果进行分布式存储。
  10. 如权利要求9所述的服务器,其特征在于,所述预设图片抓取规则包括:
    第一抓取规则,所述第一抓取规则为按指定的URL抓取,基于所述第一 抓取规则建立第一抓取模型;
    第二抓取规则,所述第二抓取规则为使用正则匹配来进行范围抓取,基于所述第二抓取规则建立第二抓取模型;及
    第三抓取规则,所述第三抓取规则为指定页面元素进行抓取,基于所述第三规则建立第三抓取模型。
  11. 如权利要求10所述的服务器,其特征在于,图片抓取过程中,还包括模拟人工访问步骤以应对目标网站的防抓取限制,所述模拟人工访问步骤具体包括:
    找到登录所述目标网站的隐藏信息,并将其内容先进行保存,所述隐藏信息为登录所述目标网站需要的信息;
    将所述隐藏信息进行提交,模拟登录网站;及
    模拟登录成功后,开始获取登录后的信息,对所述目标网站的图片按照所述预设图片抓取规则进行抓取。
  12. 如权利要求11所述的服务器,其特征在于,所述主进程还用于监控所述若干抓取子进程中的图片抓取任务的数量,当有新的图片抓取任务到来时,所述主进程将所述新任务分发给所述若干抓取子进程中图片抓取任务数量小于预设值的子进程,当所有抓取子进程的图片抓取任务都大于所述预设值时,所述主进程新建一个子进程,并将新任务分发到新建的子进程。
  13. 如权利要求9所述的服务器,其特征在于,选取相近的多个其他图片的方法为:
    提取所述第一数据集中的每一个图片的所述图片特征;
    计算当前图片和剩余图片的特征的距离;及
    选择距离最小的预设数量的图片作为给定图片的预设数量的最近邻图片;
    其中,所述当前图片为随机或者顺序选择的图片。
  14. 如权利要求13所述的服务器,其特征在于,所述特征为颜色直方图特征、纹理或者形状特征,所述距离为欧氏距离。
  15. 如权利要求14所述的服务器,其特征在于,获得所述图片的多个拟合系数包括步骤:
    计算大小为k×k的相关矩阵C,该矩阵中第m行、第n列的元素为:Cmn=(Xi-Xi m)*(Xi-Xi n),m,n=1,....,k;
    解线性系统C*W=1,得到拟合系数向量W;及
    将拟合系数向量W的各个系数归一化;
    其中,k为最近邻图片的数目,所述当前图像对应的特征为xi,其k个最近邻图像的特征为{Xil,…Xik},拟合系数向量为W={w1,...,wk}。
  16. 如权利要求15所述的服务器,其特征在于,为了获得所述第一数据集中所有图片的标签,还包括步骤:
    随机或顺序选择所述第一数据集中的一个图片;
    利用对应于所选图片的多个其它图片的标签,以对应的拟合系数来拟合所选图片的标签;及
    重复上述步骤,直至为所述第一数据集中的每一个图片构造了标签。
  17. 一种计算机可读存储介质,所述计算机可读存储介质存储有图片录入系统,所述图片录入系统可被至少一个处理器执行,以使所述至少一个处理器执行如下步骤:
    接受图片抓取请求,启动图片抓取任务,所述抓取任务包含抓取主进程,该抓取主进程分析所述抓取请求与预设图片抓取规则的映射关系,根据所述映射关系启动若干抓取子进程进行图片异步抓取,所述抓取子进程对应于基于所述预设图片抓取规则建立的图片抓取模型;
    将抓取的图片存储到第一数据集,获取该第一数据集中图片的图片属性信息及图片特征,根据所述图片属性信息对图片进行初步分类,并将所述图片属性信息作为标签信息对图片进行初步标注;
    选取所述第一数据集中的第一图片,在所述第一数据集中选择在所述图片特征上与所述第一图片相近的多个其它图片,通过用所述多个其它图片的 图片特征拟合所述图片的图片特征,获得所述第一图片的多个拟合系数;根据所述第一图片的所述多个拟合系数,利用所述多个其它图片的标签来构造所述第一图片的标签,通过构造的所述标签对该第一图片进行再次标注;及
    将分类及两次标注后的图片按照分类的结果进行分布式存储。
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,所述预设图片抓取规则包括:
    第一抓取规则,所述第一抓取规则为按指定的URL抓取,基于所述第一抓取规则建立第一抓取模型;
    第二抓取规则,所述第二抓取规则为使用正则匹配来进行范围抓取,基于所述第二抓取规则建立第二抓取模型;及
    第三抓取规则,所述第三抓取规则为指定页面元素进行抓取,基于所述第三规则建立第三抓取模型。
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,图片抓取过程中,还包括模拟人工访问步骤以应对目标网站的防抓取限制,所述模拟人工访问步骤具体包括:
    找到登录所述目标网站的隐藏信息,并将其内容先进行保存,所述隐藏信息为登录所述目标网站需要的信息;
    将所述隐藏信息进行提交,模拟登录网站;及
    模拟登录成功后,开始获取登录后的信息,对所述目标网站的图片按照所述预设图片抓取规则进行抓取。
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述主进程还用于监控所述若干抓取子进程中的图片抓取任务的数量,当有新的图片抓取任务到来时,所述主进程将所述新任务分发给所述若干抓取子进程中图片抓取任务数量小于预设值的子进程,当所有抓取子进程的图片抓取任务都大于所述预设值时,所述主进程新建一个子进程,并将新任务分发到新建的子进程。
PCT/CN2018/102077 2018-05-28 2018-08-24 图片录入方法、服务器及计算机存储介质 WO2019227705A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810525540.XA CN108921193B (zh) 2018-05-28 2018-05-28 图片录入方法、服务器及计算机存储介质
CN201810525540.X 2018-05-28

Publications (1)

Publication Number Publication Date
WO2019227705A1 true WO2019227705A1 (zh) 2019-12-05

Family

ID=64419549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102077 WO2019227705A1 (zh) 2018-05-28 2018-08-24 图片录入方法、服务器及计算机存储介质

Country Status (2)

Country Link
CN (1) CN108921193B (zh)
WO (1) WO2019227705A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178250A (zh) * 2019-12-27 2020-05-19 深圳市越疆科技有限公司 物体识别定位方法、装置及终端设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125489B (zh) * 2019-12-25 2023-05-26 北京锐安科技有限公司 一种数据抓取方法、装置、设备及存储介质
CN111144416A (zh) * 2019-12-25 2020-05-12 中国联合网络通信集团有限公司 信息处理方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204637A1 (en) * 2003-04-08 2009-08-13 The Penn State Research Foundation Real-time computerized annotation of pictures
CN103645939A (zh) * 2013-11-29 2014-03-19 北京奇虎科技有限公司 一种图片抓取的方法和系统
CN106599051A (zh) * 2016-11-15 2017-04-26 北京航空航天大学 一种基于生成图像标注库的图像自动标注的方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138578A (zh) * 2015-07-30 2015-12-09 北京奇虎科技有限公司 目标图片分类存储方法及其终端
CN106528702A (zh) * 2016-10-26 2017-03-22 朱育盼 日记生成方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204637A1 (en) * 2003-04-08 2009-08-13 The Penn State Research Foundation Real-time computerized annotation of pictures
CN103645939A (zh) * 2013-11-29 2014-03-19 北京奇虎科技有限公司 一种图片抓取的方法和系统
CN106599051A (zh) * 2016-11-15 2017-04-26 北京航空航天大学 一种基于生成图像标注库的图像自动标注的方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111178250A (zh) * 2019-12-27 2020-05-19 深圳市越疆科技有限公司 物体识别定位方法、装置及终端设备
CN111178250B (zh) * 2019-12-27 2024-01-12 深圳市越疆科技有限公司 物体识别定位方法、装置及终端设备

Also Published As

Publication number Publication date
CN108921193A (zh) 2018-11-30
CN108921193B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
US7386438B1 (en) Identifying language attributes through probabilistic analysis
WO2019085335A1 (zh) 利用新词发现投资标的的方法、装置及存储介质
US20180165370A1 (en) Methods and systems for object recognition
CN111241389B (zh) 一种基于矩阵的敏感词过滤方法、装置、电子设备、存储介质
CN102053991B (zh) 用于多语言文档检索的方法及系统
CN112016273B (zh) 文档目录生成方法、装置、电子设备及可读存储介质
WO2019227705A1 (zh) 图片录入方法、服务器及计算机存储介质
WO2021068681A1 (zh) 标签分析方法、装置及计算机可读存储介质
CN113220657B (zh) 数据处理方法、装置及计算机设备
CN105550359B (zh) 一种基于垂直搜索的网页排序方法、装置及服务器
US9665773B2 (en) Searching for events by attendants
CN112613938B (zh) 模型训练方法、装置及计算机设备
CN113449187A (zh) 基于双画像的产品推荐方法、装置、设备及存储介质
CN112818200A (zh) 基于静态网站的数据爬取及事件分析方法及系统
CN112269906B (zh) 网页正文的自动抽取方法及装置
CN111444368B (zh) 构建用户画像的方法、装置、计算机设备及存储介质
CN108268488A (zh) 网页主图识别方法和装置
CN109948015B (zh) 一种元搜索列表结果抽取方法及系统
CN116450664A (zh) 数据处理方法、装置、设备和存储介质
CN108170838B (zh) 话题演变的可视化展现方法、应用服务器及计算机可读存储介质
CN114282119B (zh) 一种基于异构信息网络的科技信息资源检索方法及系统
CN111782945B (zh) 书籍搜索方法、计算设备及存储介质
CN115186240A (zh) 基于关联性信息的社交网络用户对齐方法、装置、介质
CN114219544A (zh) 消费倾向分析方法、装置、设备及存储介质
CN113064984A (zh) 意图识别方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18920320

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18920320

Country of ref document: EP

Kind code of ref document: A1