CN112163139A - Image data processing method and device - Google Patents

Image data processing method and device Download PDF

Info

Publication number
CN112163139A
CN112163139A CN202011094010.8A CN202011094010A CN112163139A CN 112163139 A CN112163139 A CN 112163139A CN 202011094010 A CN202011094010 A CN 202011094010A CN 112163139 A CN112163139 A CN 112163139A
Authority
CN
China
Prior art keywords
image
commodity
processing
commodity image
image data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011094010.8A
Other languages
Chinese (zh)
Inventor
陈海波
其他发明人请求不公开姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenlan industrial intelligent Innovation Research Institute (Ningbo) Co.,Ltd.
Original Assignee
Deep Blue Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deep Blue Technology Shanghai Co Ltd filed Critical Deep Blue Technology Shanghai Co Ltd
Priority to CN202011094010.8A priority Critical patent/CN112163139A/en
Publication of CN112163139A publication Critical patent/CN112163139A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/954Navigation, e.g. using categorised browsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0641Shopping interfaces
    • G06Q30/0643Graphical representation of items or shoppers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a method and a device for processing image data, wherein the method comprises the following steps: an image crawler tool is adopted to simulate the state of a user browsing a webpage to crawl an original commodity image from a shopping website; processing the commodity image by adopting a foreground segmentation algorithm; and screening and classifying the processed commodity images. The processing method can simulate the state of browsing the webpage by a real user so as to reduce the risk of anti-crawler and realize the rapid acquisition of a large amount of image data of different categories.

Description

Image data processing method and device
Technical Field
The present invention relates to the field of computer vision and image processing technologies, and in particular, to an image data processing method, an image data processing apparatus, a computer device, and a non-transitory computer-readable storage medium.
Background
The current web crawler mainly aims at directly crawling image data, for example, crawling all images on a page can achieve the purpose of crawling images on a common page, but in some set pages with higher requirements on anti-crawlers, for example, pages of a shopping website, a user is limited from acquiring image data, and user experience is reduced.
Disclosure of Invention
The invention provides an image data processing method for solving the technical problems, which can simulate the webpage browsing state of a real user so as to reduce the risk of anti-crawler and realize the rapid acquisition of a large amount of image data of different types.
The technical scheme adopted by the invention is as follows:
a method of processing image data, comprising the steps of: an image crawler tool is adopted to simulate the state of a user browsing a webpage to crawl an original commodity image from a shopping website; processing the commodity image by adopting a foreground segmentation algorithm; and screening and classifying the processed commodity images.
According to one embodiment of the invention, the method for crawling the original commodity image from the shopping website by adopting the image crawler tool to simulate the state of a user browsing a webpage comprises the following steps: positioning the position of a search bar of the shopping website, and inputting a preset keyword in the search bar; a scroll bar is pulled down by a simulated mouse, first preset time is delayed, and commodity links in a page are sequentially acquired; and acquiring the original commodity image according to the commodity link.
According to an embodiment of the present invention, the method for processing image data further includes: performing a deduplication operation on the commodity link; downloading the commodity image in the link according to the commodity link; and screening and de-duplicating the commodity image to obtain the original commodity image.
According to an embodiment of the present invention, the screening and de-duplicating the product image to obtain the original product image includes: deleting the commodity image when the pixel value of the commodity image is smaller than a preset pixel value; and processing the deleted commodity image by adopting an image hash algorithm to delete the repeated commodity image to obtain the original commodity image.
According to an embodiment of the present invention, before locating the search bar location of the shopping website, the method further comprises: and positioning a login page of the shopping website, and delaying for a second preset time to complete login.
According to an embodiment of the present invention, processing the commodity image by using a foreground segmentation algorithm includes: extracting a target area of the commodity image; processing the target area by adopting a semantic segmentation algorithm; and replacing the background image of the processed target area to obtain a processed commodity image.
According to one embodiment of the invention, the screening and classifying the processed commodity image comprises the following steps: when the processed commodity image meets a preset condition, classifying the commodity image into a qualified image, and labeling the commodity image; and when the processed commodity image does not meet the preset condition, deleting the commodity image, wherein the preset condition comprises the following steps: integrity of the commodity image after the foreground segmentation processing and comprehensiveness of the commodity image information.
The invention also provides an image data processing device, comprising: the crawling module is used for crawling an original commodity image from a shopping website by adopting an image crawler tool to simulate the state of a user browsing a webpage; the processing module is used for processing the commodity image by adopting a foreground segmentation algorithm; and the screening and classifying module is used for screening and classifying the processed commodity images.
The invention also provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor executes the program, the processing method of the image data is realized.
The invention also proposes a non-transitory computer-readable storage medium on which a computer program is stored which, when executed by a processor, implements the image data processing method described above.
The invention has the beneficial effects that:
according to the invention, the image crawler tool is adopted to simulate the state of a user browsing the webpage, so that the risk of anti-crawler is reduced, and a large amount of image data of different types can be rapidly acquired.
Drawings
FIG. 1 is a flow chart of a method for processing image data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a commodity image before foreground segmentation processing according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a commodity image during foreground segmentation processing according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a commodity image after foreground segmentation processing according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a labeling software interface according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an interface of a markup document according to an embodiment of the present invention;
fig. 7 is a block diagram of an apparatus for processing image data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a method for processing image data according to an embodiment of the present invention.
As shown in fig. 1, the method for processing image data according to an embodiment of the present invention may include the following steps:
and S1, using the image crawler tool to simulate the state that the user browses the webpage to crawl the original commodity image from the shopping website.
According to the invention, a crawler script is developed, for example, based on a selenium package and an urllib package of python, the state of a real person browsing a webpage can be simulated by using the selenium package, the risk of anti-crawler is avoided as much as possible, and a commodity image is obtained from a commodity link by using the urllib package.
How to acquire the commodity image is described in detail below.
According to one embodiment of the invention, the method for crawling the original commodity image from the shopping website by adopting the image crawler tool to simulate the state of a user browsing a webpage comprises the following steps: positioning a search bar position of a shopping website, and inputting a preset keyword in the search bar; a scroll bar is pulled down by a simulated mouse, first preset time is delayed, and commodity links in a page are sequentially acquired; and acquiring an original commodity image according to the commodity link. The first preset time may be calibrated according to actual conditions, for example, the first preset time is determined according to a speed of browsing the commodities when the user is investigated to shop.
Before locating the search bar position of the shopping website, the method further comprises the following steps: and positioning a login page of the shopping website, and delaying for a second preset time to complete login. The second preset time can be calibrated according to actual conditions, for example, the time for a user to input a password or automatically log in is simulated.
Specifically, the state of the user is simulated, the browser is automatically opened, and a shopping site, such as "east", "party", "treasure", or "cat", is opened. When the website can be browsed only after logging in, positioning a login interface of the shopping website, delaying for a period of time (such as a second preset time), and completing manual login or automatic login. Viewing a webpage source code, positioning the position of a search bar of a shopping website, inputting preset keywords, such as ' evening dress ', ' suit ', ' children ' toy ' and the like, into the search bar, simulating the speed of a mouse to pull down a scroll bar, staying for a period of time (such as first preset time) to completely load commodity images on a current page, then automatically clicking each commodity link in sequence, continuing to simulate the mouse to pull down the scroll bar after all links in the current page are clicked, loading the commodity images on the page, automatically clicking each commodity link again, and repeatedly executing the steps until the commodity links with the set number of pages are clicked. Thus, a link to each image of the product can be obtained and automatically saved in the document to initiate downloading of the product image in the link upon subsequent invocation of the urllib package.
In order to reduce the data loading time due to the possibility of duplicate product links, in an embodiment of the present invention, the above processing method of image data further includes: performing de-duplication operation on the commodity link; downloading the commodity image in the link according to the commodity link; and screening and de-duplicating the commodity image to obtain an original commodity image.
Further, screening and duplicate removal are performed on the commodity image to obtain an original commodity image, and the method comprises the following steps: deleting the commodity image when the pixel value of the commodity image is smaller than a preset pixel value; and processing the deleted commodity image by adopting an image hash algorithm to delete the repeated commodity image to obtain an original commodity image. The preset pixel value may be calibrated according to actual conditions, for example, the preset pixel value may be a size of an image.
Specifically, duplicate content is removed from a link in a document, then a url is called to start downloading an image to the local, and the downloaded link is recorded. And acquiring the size of the commodity image, deleting the image with the too small size, and then calling an image hash algorithm to perform image duplicate removal processing to obtain an original commodity image, thereby completing the whole crawling process.
It will be appreciated that the role of the image hashing algorithm is to calculate the similarity between images. At present, two situations exist in the commodity image on the network, firstly, the commodity image links are the same, and then the commodity image contents are necessarily the same; second, the product images are the same, but the product links are different. Therefore, after the link is de-duplicated, an image hashing algorithm is also used to remove images that are different in link but identical to the commodity image.
And S2, processing the commodity image by adopting a foreground segmentation algorithm.
According to one embodiment of the invention, the processing of the commodity image by adopting the foreground segmentation algorithm comprises the following steps: extracting a target area of the commodity image; processing the target area by adopting a semantic segmentation algorithm; and replacing the background image of the processed target area to obtain a processed commodity image.
Specifically, taking a garment as an example, a general garment has a model fitting, and the preset keywords are as follows: women's dress, full dress, chinese dress, pink color system, searching the obtained image as shown in fig. 2, calling a maskrnnn frame, obtaining a mask of a person, obtaining fig. 3, calling a high-precision semantic segmentation algorithm, for example, a CascadePSP frame, obtaining a fine person region, pasting a pure white background color on the person image, as shown in fig. 4.
It should be noted that a typical commercial product will have a character or other reference, for example, when searching for a child's toy, there will typically be a child and a toy. The target area comprises toys and children, when clothes are searched, the target area comprises clothes and character models, and similarly, when the commodity shelf is searched, the target area comprises the commodity shelf, corresponding placed articles and the like.
And S3, screening and classifying the processed commodity images.
According to one embodiment of the invention, the screening and classifying the processed commodity image comprises the following steps: when the processed commodity image meets a preset condition, classifying the commodity image into a qualified image, and labeling the commodity image; and when the processed commodity image does not meet the preset conditions, deleting the commodity image, wherein the preset conditions comprise: integrity of the commodity image after the foreground segmentation processing and comprehensiveness of commodity image information. It should be noted that the comprehensiveness of the image information of the product refers to whether the product is displayed on the front, i.e., the front view of the target area, and also refers to whether the product is a front view of a model garment, taking fig. 2 to 4 as an example, and if the product is a side or a back, the image of the product is considered not to satisfy the preset condition.
Because the image of the foreground segmentation also has certain image quality problems, for example, the commodity image is incomplete, or the obtained commodity image is not a front photograph, and the like, if the commodity image is incomplete, the commodity image is classified as an unqualified product and is deleted; if the commodity image is complete and the information is comprehensive, the commodity image is classified as a qualified product, and the commodity image is labeled, for example, as shown in fig. 5, the label is: women's dress "full dress" and "Han dress" pink color system.
In an embodiment of the present invention, the commodity image is also labeled based on PyQt development software, for example, the labeled content includes: the method comprises the steps of displaying a current image, listing a category (can be edited by self-definition), labeling, turning up and down pages, deleting the image, displaying a current image label, displaying a current image path and name, displaying the total number of the images and the serial number of the current image, and recording the serial number of the last image when software is closed, so that the next continuous labeling is facilitated.
As shown in fig. 6, the labeled document 1 includes: the category list, the commodity image name, the image tag (1 indicates that the attribute exists, and 0 indicates that the attribute does not exist), and the image sequence number when the software is closed is recorded in the document 2, so that the tag can be conveniently marked continuously next time. Therefore, by classifying one or more attributes in one commodity image, the labeling task can be efficiently and conveniently carried out.
In summary, the image data processing method of the present invention uses the image crawler tool to simulate the state of the user browsing the web page, so as to reduce the risk of anti-crawler and achieve rapid acquisition of a large amount of image data of different categories. The image classification labeling software is used for performing a classification task with one or more attributes on one image, so that the labeling task can be efficiently and conveniently performed.
Corresponding to the image data processing method of the above embodiment, the invention also provides an image data processing device.
Fig. 7 is a block diagram of an apparatus for processing image data according to an embodiment of the present invention.
As shown in fig. 7, the image data processing apparatus according to the embodiment of the present invention may include: crawling module 10, processing module 20, and screening classification module 30.
The crawling module 10 is configured to crawl an original commodity image from a shopping website by using an image crawler tool to simulate a state where a user browses a webpage. The processing module 20 is configured to process the commodity image by using a foreground segmentation algorithm. The screening and classifying module 30 is used for screening and classifying the processed commodity images.
According to an embodiment of the present invention, the crawling module 10 uses an image crawler tool to simulate the state of a user browsing a webpage to crawl an original commodity image from a shopping website, specifically to locate a search bar position of the shopping website and input a preset keyword in the search bar; a scroll bar is pulled down by a simulated mouse, first preset time is delayed, and commodity links in a page are sequentially acquired; and acquiring an original commodity image according to the commodity link.
According to an embodiment of the present invention, the crawling module 10 is further configured to perform a deduplication operation on the commodity link; downloading the commodity image in the link according to the commodity link; and screening and de-duplicating the commodity image to obtain an original commodity image.
According to an embodiment of the present invention, the crawling module 10 is further configured to delete the commodity image when the pixel value of the commodity image is smaller than the preset pixel value; and processing the deleted commodity image by adopting an image hash algorithm to delete the repeated commodity image to obtain an original commodity image.
According to an embodiment of the present invention, the crawling module 10 is further configured to locate a login page of the shopping website, and delay a second preset time to complete login.
According to an embodiment of the present invention, the processing module 20 processes the commodity image by using a foreground segmentation algorithm, specifically, for extracting a target region of the commodity image; processing the target area by adopting a semantic segmentation algorithm; and replacing the background image of the processed target area to obtain a processed commodity image.
According to an embodiment of the present invention, the screening and classifying module 30 screens and classifies the processed commodity image, and is specifically configured to classify the commodity image into a qualified image and label the commodity image when the processed commodity image meets a preset condition; and when the processed commodity image does not meet the preset conditions, deleting the commodity image, wherein the preset conditions comprise: integrity of the commodity image after the foreground segmentation processing and comprehensiveness of commodity image information.
It should be noted that, for details that are not disclosed in the image data processing apparatus according to the embodiment of the present invention, please refer to details that are disclosed in the image data processing method according to the embodiment of the present invention, and details are not repeated herein.
According to the image data processing device, the crawling module simulates the state of a user browsing a webpage by adopting an image crawler tool to crawl an original commodity image from a shopping website, the processing module processes the commodity image by adopting a foreground segmentation algorithm, and the screening and classifying module screens and classifies the processed commodity image. Therefore, the risk of the anti-crawler can be reduced, a large amount of image data of different types can be rapidly acquired, and efficient and convenient labeling tasks can be performed.
The invention further provides a computer device corresponding to the embodiment.
The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the processing method of the image data according to the embodiment of the invention can be realized.
According to the computer equipment provided by the embodiment of the invention, when the processor executes the computer program stored on the memory, firstly, an image crawler tool is adopted to simulate the state of a user browsing a webpage to crawl an original commodity image from a shopping website; processing the commodity image by adopting a foreground segmentation algorithm; and screening and classifying the processed commodity images. Therefore, the state of browsing the webpage by a real user can be simulated, the risk of anti-crawler is reduced, and a large amount of image data of different types can be quickly acquired.
The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.
A non-transitory computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program that, when executed by a processor, can implement the method of processing image data according to the above-described embodiment of the present invention.
According to the non-transitory computer-readable storage medium of the embodiment of the present invention, when the processor executes the computer program stored thereon, first, an image crawler tool is used to simulate the state of a user browsing a web page to crawl an original commodity image from a shopping website; processing the commodity image by adopting a foreground segmentation algorithm; and screening and classifying the processed commodity images. Therefore, the state of browsing the webpage by a real user can be simulated, the risk of anti-crawler is reduced, and a large amount of image data of different types can be quickly acquired.
In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A method for processing image data, comprising the steps of:
an image crawler tool is adopted to simulate the state of a user browsing a webpage to crawl an original commodity image from a shopping website;
processing the commodity image by adopting a foreground segmentation algorithm;
and screening and classifying the processed commodity images.
2. The method for processing image data according to claim 1, wherein the crawling of the original commodity image from the shopping website by using the image crawler tool to simulate the state of the user browsing the webpage comprises:
positioning the position of a search bar of the shopping website, and inputting a preset keyword in the search bar;
a scroll bar is pulled down by a simulated mouse, first preset time is delayed, and commodity links in a page are sequentially acquired;
and acquiring the original commodity image according to the commodity link.
3. The method for processing image data according to claim 2, further comprising:
performing a deduplication operation on the commodity link;
downloading the commodity image in the link according to the commodity link;
and screening and de-duplicating the commodity image to obtain the original commodity image.
4. The method for processing image data according to claim 3, wherein the step of screening and de-duplicating the product image to obtain the original product image comprises:
deleting the commodity image when the pixel value of the commodity image is smaller than a preset pixel value;
and processing the deleted commodity image by adopting an image hash algorithm to delete the repeated commodity image to obtain the original commodity image.
5. The method for processing image data according to claim 2, further comprising, before locating the search bar position of the shopping site:
and positioning a login page of the shopping website, and delaying for a second preset time to complete login.
6. The method for processing the image data according to claim 1, wherein the processing the commodity image by using a foreground segmentation algorithm comprises:
extracting a target area of the commodity image;
processing the target area by adopting a semantic segmentation algorithm;
and replacing the background image of the processed target area to obtain a processed commodity image.
7. The method for processing image data according to claim 1, wherein the screening and classifying the processed commodity image comprises:
when the processed commodity image meets a preset condition, classifying the commodity image into a qualified image, and labeling the commodity image;
and when the processed commodity image does not meet the preset condition, deleting the commodity image, wherein the preset condition comprises the following steps: integrity of the commodity image after the foreground segmentation processing and comprehensiveness of the commodity image information.
8. An apparatus for processing image data, comprising:
the crawling module is used for crawling an original commodity image from a shopping website by adopting an image crawler tool to simulate the state of a user browsing a webpage;
the processing module is used for processing the commodity image by adopting a foreground segmentation algorithm;
and the screening and classifying module is used for screening and classifying the processed commodity images.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements a method of processing image data according to any one of claims 1-7.
10. A non-transitory computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the method of processing image data according to any one of claims 1 to 7.
CN202011094010.8A 2020-10-14 2020-10-14 Image data processing method and device Pending CN112163139A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011094010.8A CN112163139A (en) 2020-10-14 2020-10-14 Image data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011094010.8A CN112163139A (en) 2020-10-14 2020-10-14 Image data processing method and device

Publications (1)

Publication Number Publication Date
CN112163139A true CN112163139A (en) 2021-01-01

Family

ID=73866779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011094010.8A Pending CN112163139A (en) 2020-10-14 2020-10-14 Image data processing method and device

Country Status (1)

Country Link
CN (1) CN112163139A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254553A1 (en) * 2008-02-08 2009-10-08 Corbis Corporation Matching media for managing licenses to content
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
US20140222621A1 (en) * 2011-07-06 2014-08-07 Hirenkumar Nathalal Kanani Method of a web based product crawler for products offering
CN106126697A (en) * 2016-06-30 2016-11-16 广州市皓轩软件科技有限公司 A kind of sing on web multidate information captures the details page automatic generation method of technology
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device
CN109977983A (en) * 2018-05-07 2019-07-05 广州逗号智能零售有限公司 Obtain the method and device of training image
CN110647826A (en) * 2019-09-05 2020-01-03 北京百度网讯科技有限公司 Method and device for acquiring commodity training picture, computer equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254553A1 (en) * 2008-02-08 2009-10-08 Corbis Corporation Matching media for managing licenses to content
US20140222621A1 (en) * 2011-07-06 2014-08-07 Hirenkumar Nathalal Kanani Method of a web based product crawler for products offering
CN103092936A (en) * 2013-01-08 2013-05-08 华北电力大学(保定) Real-time information acquisition method of dynamic page of Internet of Things
CN106126697A (en) * 2016-06-30 2016-11-16 广州市皓轩软件科技有限公司 A kind of sing on web multidate information captures the details page automatic generation method of technology
CN106844522A (en) * 2016-12-29 2017-06-13 北京市天元网络技术股份有限公司 A kind of network data crawling method and device
CN109977983A (en) * 2018-05-07 2019-07-05 广州逗号智能零售有限公司 Obtain the method and device of training image
CN109063784A (en) * 2018-08-23 2018-12-21 深圳码隆科技有限公司 A kind of character costume image data screening technique and its device
CN110647826A (en) * 2019-09-05 2020-01-03 北京百度网讯科技有限公司 Method and device for acquiring commodity training picture, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107861972B (en) Method and equipment for displaying full commodity result after user inputs commodity information
US9607327B2 (en) Object search and navigation method and system
KR101511050B1 (en) Method, apparatus, system and computer program for offering and displaying a product information
TWI532013B (en) Image quality analysis method and system
US20160217343A1 (en) Systems and methods for identifying semantically and visually related content
US9311568B1 (en) Recipe text and image extraction
CN103988202A (en) Image attractiveness based indexing and searching
US20150026101A1 (en) Image search system and method for personalized photo applications using semantic networks
TWI781554B (en) Method of determining item name of object, device, computer equipment and storage medium
US20130254181A1 (en) Aggregation and Categorization
US10803363B2 (en) Media intelligence automation system
CN112328823A (en) Training method and device for multi-label classification model, electronic equipment and storage medium
CN110598095B (en) Method, device and storage medium for identifying article containing specified information
WO2016107125A1 (en) Information searching method and apparatus
US20220309249A1 (en) Data Processing Method, Apparatus, Electronic Device, and Computer Storage Medium
US9984104B2 (en) Indexing content and source code of a software application
CN111225009B (en) Method and device for generating information
TWI570579B (en) An information retrieving method utilizing webpage visual features and webpage language features and a system using thereof
CN111325705A (en) Image processing method, device, equipment and storage medium
CN112163139A (en) Image data processing method and device
CN113191235A (en) Sundry detection method, device, equipment and storage medium
CN110134807B (en) Target retrieval method, device, system and storage medium
US20140181111A1 (en) Genre generation device, non-transitory computer-readable recording medium storing genre generation program, and genre generation method
KR102113318B1 (en) Method, apparatus and computer program for providing shopping informations
CN108268488A (en) The recognition methods of webpage master map and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220107

Address after: 315000 No. 138-1, Zhongshan West Road, Fenghua District, Ningbo City, Zhejiang Province (self declaration)

Applicant after: Shenlan industrial intelligent Innovation Research Institute (Ningbo) Co.,Ltd.

Address before: Unit 1001, 369 Weining Road, Changning District, Shanghai, 200336 (9th floor of actual floor)

Applicant before: DEEPBLUE TECHNOLOGY (SHANGHAI) Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20210101

RJ01 Rejection of invention patent application after publication