CN102902737B - A kind of network image is independently collected and screening technique - Google Patents

A kind of network image is independently collected and screening technique Download PDF

Info

Publication number
CN102902737B
CN102902737B CN201210336284.2A CN201210336284A CN102902737B CN 102902737 B CN102902737 B CN 102902737B CN 201210336284 A CN201210336284 A CN 201210336284A CN 102902737 B CN102902737 B CN 102902737B
Authority
CN
China
Prior art keywords
image
network
word
download
relevant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210336284.2A
Other languages
Chinese (zh)
Other versions
CN102902737A (en
Inventor
薛建儒
王乐
高占宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201210336284.2A priority Critical patent/CN102902737B/en
Publication of CN102902737A publication Critical patent/CN102902737A/en
Application granted granted Critical
Publication of CN102902737B publication Critical patent/CN102902737B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a kind of network image independently to collect and screening technique, the mass image data that the method utilizes internet to become increasingly abundant, the powerful image retrieval ability provided by search engine realizes autonomous collection and the screening of network image.The present invention obtains the solution that image target category database data set provides a kind of robotization, so both can avoid a large amount of hand labors, and can eliminate the excess kurtosis brought owing to artificially collecting data set again.

Description

A kind of network image is independently collected and screening technique
Technical field
The present invention relates to and build computer vision and pattern-recognition image data base field, particularly a kind of network image for building image target category database is independently collected and screening technique.
Background technology
Image target category database is the necessary condition of carrying out computer vision and pattern identification research, and the research method and system of setting up high quality graphic target category database has great importance for the research work of computer vision and pattern-recognition.At present, all there is following problem in the most methods setting up image data base:
1) markers work of collecting view data and data needs a large amount of hand labors, significantly limit the scale of database, and this becomes expanded view as a very difficult bottleneck broken through during database size.
2) artificially collect with the process of marker image database is not a process completely objectively, the view data collected by people of different knowledge culture background and always different to its mark carried out, this just causes the image data base set up to have inclined often, namely cannot guarantee the objectivity evaluated and tested various computer vision and algorithm for pattern recognition.
Summary of the invention
A kind of network image is the object of the present invention is to provide independently to collect and screening technique.
For achieving the above object, present invention employs following technical scheme:
1) image subject is extracted
Choose single image, then select network search engines to carry out retrieval to single image and obtain Search Results, after Search Results is extracted, obtain image subject;
2) network image and relevant textual information are downloaded automatically
Data set is obtained from web download image and the text message relevant to image to local data base according to image subject;
3) view data screening
Image information and the text message relevant to image is utilized to screen to obtain target image set to the image in data set.
The step that described image subject is extracted is: first, by retrieving the single image chosen to scheme searching figure service on network, obtain the text retrieval information relevant to single image; Secondly, utilize priori semantic knowledge and statistical method to carry out extraction process to text retrieval information and obtain image subject.
Describedly extraction process carried out to text retrieval information comprise the following steps:
The semantic network calling WordNet filters text retrieval information, filtering preposition, article, abstract noun, verb, adjective and adverbial word; If only remain a word after filtering, so this word is image subject; If the multiple word of residue after filtering, using remaining word as the keyword network address of searching image in text based image retrieval respectively, the network address of 15-20 width image before obtaining; The network address of image is utilized respectively and retrieves to scheme searching figure service, obtain the text retrieval information relevant to network address; Add up word frequency after carrying out participle to all text retrieval information relevant to network address, the word that word frequency is the highest is the image subject of single image.
In described automatic download, if to go whistle because of download request or reading flow unsuccessfully causes failed download, and retry three times still failed download, so skip downloading to this image, directly enter downloading of next image; When downloading a certain image, if download image completes in setting-up time, then enter downloading of next image, if do not complete download image in setting-up time, then abandon and this time download result, then enter downloading of next image.
The step of described view data screening is: first, utilizes the Normalized Grey Level histogram distribution information of image to reject the non-natural images of data centralization; Secondly, the text message rejecting data centralization relevant to image is utilized to depart from the image of image subject.
Described text message is image Tag information.
Described non-natural images comprises the image of cartoon, icon, Freehandhand-drawing or synthesis.
The basis for estimation of described non-natural images is: in the Normalized Grey Level histogram of image, frequency threshold gets 0.06, and when the number of grey levels being namely greater than 0.06 when the frequency of occurrences is less than 60, image is judged as non-natural images.
The image that described rejecting data centralization departs from image subject comprises the following steps:
Participle is carried out to the text message relevant to image and obtains multiple word; Utilize the semantic network of WordNet from multiple word, screen the word representing Object; If represent, the word of Object belongs to same TongYiCi CiLin, so retains the image relevant to text message, otherwise rejects the image relevant to text message.
Beneficial effect of the present invention is:
First, image data information on internet is day by day huge, network transmission protocol standard and perfect more, image search engine develop rapidly, in regular hour section, search engine returning results under specific search condition is constant, no matter namely who carries out image retrieval, returning results in certain hour section of obtaining is identical, so just can eliminate the excess kurtosis brought owing to artificially collecting data set.Secondly, what the image search result returned by search engine was automatic and pointed is downloaded in local data base, can avoid so a large amount of artificially collecting work.
Accompanying drawing explanation
Fig. 1 is that network image is independently collected and screening system block diagram.
Fig. 2 is that image subject extracts process flow diagram.
Fig. 3 is that network image and Tag information thereof download process flow diagram automatically.
Fig. 4 is view data screening process figure.
Fig. 5 is the non-natural images during Google returns results.
Fig. 6 is Tag information and correspondence image example thereof.
Fig. 7 is the intensity histogram illustrated example of natural image.
Fig. 8 is the intensity histogram illustrated example of the non-natural images such as cartoon figure.
Fig. 9 is the word taxonomic structure tree signal of WordNet.
Embodiment
Below in conjunction with accompanying drawing, invention is described further.
Network image is independently collected and screens, after requiring input single image or search key, export and input picture or the higher great amount of images data of the keyword subject degree of correlation, Fig. 1 is that network image independently collects the general frame with screening system, show three basic steps that systemic-function realizes, be respectively that image subject is extracted, image and Tag information thereof are downloaded, optical sieving.Wherein, image Tag information refers to text message (as shown in Figure 5) corresponding below every width image.
(1) be first that image subject is extracted, the realization of image subject extraction module is mainly divided into two steps: the first, selects suitable search engine to retrieve input picture, obtains relative text retrieval information; The second, Corpus--based Method, priori process text retrieval information, extract image subject, the image of input are converted into the text message be described image.As shown in Figure 2, concrete steps are image subject extraction module flow process:
Step 1, according to the call format sending POST request, adds POST and asks backward Google(Google by input picture) server sends request;
Step 2, obtains the result that Google server returns, obtains the best-guess of input picture;
Step 3, call the semantic network of WordNet (about the details of WordNet, please refer to http://wordnet.princeton.edu/) best-guess (text retrieval information) is filtered, filtering preposition, article, abstract noun, verb, adjective, adverbial word;
Step 4, if only remain a word after filtering, so this word is image subject, Output rusults, and the first stage completes; Otherwise, respectively to filter rear remaining word as keyword searching image in Google text based image retrieval, obtain the network address of front 15 width view data;
Step 5, inputs Google to scheme to search in figure by the network address of 15 width images respectively, obtains the result of 15 best-guess;
Step 6, carries out participle to all results and adds up word frequency, and the word that the frequency of occurrences is the highest is the final theme of input picture, and after Output rusults, the first stage completes.
(2) be secondly that image and Tag information thereof are downloaded, download for image set, the present invention makes every effort to download module operation will stronger exception handling ability and stability.When image set is less, utilize and existingly on the net download relevant open source software with network data and can meet download demand, but along with downloading the increase of amount of images, program occurs that abnormal probability increases greatly, abnormal interruption or seemingly-dead state is often there is in operational process, make successive image cannot continue to download, the robotization affecting whole process realizes.Therefore the present invention adds following principle in system:
1) give up and littlely ask large, when downloading a certain image, if there is exception, such as request goes whistle or reading flow failure, and still ends in failure for three times at retry, so skips downloading to this image, directly enters downloading of next.
2) limited wait, for the download of each image adds finger daemon, is similar to the house dog program of embedded system.Just timer is started, the setting time limit, if download image success in setting-up time when a certain download image, then normally enter downloading of next image, if image is not downloaded successfully in setting-up time, then abandon and this time download result, directly enter the download of next.
Fig. 3 is the process flow diagram of download module, and concrete steps are:
Step 1, the quantity (number of pages) of input picture subject information and download image;
Step 2, generates and meets the URL that Google searches figure requirement;
Step 3, searches figure to Google and sends Get request, obtain the webpage source code returned;
Step 4, extracts the URL of 20 images and the Tag information of correspondence from webpage source code;
Step 5, downloads corresponding view data by 20 URL and is saved to this locality;
Step 6, if be downloaded to last page, exits, otherwise downloads lower one page, enters step 2.
(3) be finally the screening of view data, the view data of download be divided three classes:
The first kind: non-natural images, is made up of the fict image that can be characterized as of cartoon, icon, Freehandhand-drawing or synthesis;
Equations of The Second Kind: the image of deviating from the core theme in natural image, does not belong to the first kind, but when meeting search in its image, the target object of image subject is less or seriously block or be difficult to identification.
3rd class: meet the image set up target database and require, do not belong to the image of first and second class.
Screen the strategy of image in two steps:
First, based on image histogram information, first time screening is carried out to image, because find the most of non-natural images (Fig. 5) returned for Google in experiment, its histogram is regular, the non-natural figure such as cartoon figure and natural figure have obvious difference in color distribution, if be translated into gray-scale map, can find that the value of pixel in the non-natural figure such as cartoon figure is more concentrated, and the sum of general different pixel values is less than nature figure.The method is mainly for rejecting First Kind Graph picture.
The acquisition methods of the grey level histogram distributed intelligence of image is: travel through the pixel of piece image, the frequency that the gray-scale value adding up each pixel occurs, take gray-scale value as horizontal ordinate, the frequency of occurrences is the Normalized Grey Level histogram that ordinate can obtain image.
In theory, due to the singularity of non-natural images, the distribution of its histogram in gray level is very narrow, is only distributed in several few gray levels, namely by judging that the number of non-zero gray-scale value in histogram can judge whether image is non-natural images.But find when actual treatment due to the impact of picture noise, make most gray-scale value in histogram be difficult to equal 0, the histogram feature of noise is that distribution is wide but value is little.Therefore, if by judge the gray-scale value number of non-zero change into judge that number of pixels is greater than the gray-scale value number of a certain threshold value can the impact of stress release treatment.In normalization histogram, frequency threshold gets 0.06(Fig. 7, and in Fig. 8, dotted line represents threshold value), when the number of grey levels that namely frequency is greater than 0.06 is less than 60, image is judged as non-natural figure and rejects, otherwise meets the demands.Through the first step, the non-natural images of 70-80% can be removed.
The second, Tag information can reflect the content information (Fig. 6) of image to a certain extent, if can effectively use, then can reject in a part of Equations of The Second Kind the nonconforming view data mentioned, the method is mainly for Equations of The Second Kind image.
The key step of second step comprises:
Step 1, carries out participle to Tag, in short will be divided into multiple word;
Step 2, calculates word Input knowledge storehouse (semantic network of WordNet);
Step 3, removing belongs to verb, adjective, the word of adverbial word and abstract noun etc., namely calculates the similarity (Object position is as shown in Figure 9) of input word and Object; Its algorithm is: when two words have a shorter path to be connected in WordNet word set, semantically just have relatively large semantic similarity.Namely the distance in path and similarity are inversely proportional to;
To remaining, step 4, can represent that the word (namely belonging to the word of the node of Object shown in Fig. 9) of Object carries out Similarity Measure;
Step 5, if multiple word belongs to same and only have a TongYiCi CiLin, so illustrate that this image occurs that the possibility of other Object is little, if there is multiple inhomogeneous Object, then illustrate that the possibility occurring multiple Object is large (as Fig. 9, Object is divided into 7 classes), can consider to reject image.
Whole algorithm flow is as Fig. 4.

Claims (5)

1. network image is independently collected and a screening technique, it is characterized in that, comprises the following steps:
1) image subject is extracted
Choose single image, then select network search engines to carry out retrieval to single image and obtain Search Results, after Search Results is extracted, obtain image subject;
The step that described image subject is extracted is: first, by retrieving the single image chosen to scheme searching figure service on network, obtain the text retrieval information relevant to single image; Secondly, utilize priori semantic knowledge and statistical method to carry out extraction process to text retrieval information and obtain image subject;
Describedly extraction process is carried out to text retrieval information comprise the following steps: the semantic network calling WordNet filters text retrieval information, filtering preposition, article, abstract noun, verb, adjective and adverbial word; If only remain a word after filtering, so this word is image subject; If the multiple word of residue after filtering, using remaining word as the keyword network address of searching image in text based image retrieval respectively, the network address of 15-20 width image before obtaining; The network address of image is utilized respectively and retrieves to scheme searching figure service, obtain the text retrieval information relevant to network address; Add up word frequency after carrying out participle to all text retrieval information relevant to network address, the word that word frequency is the highest is the image subject of single image;
2) network image and relevant textual information are downloaded automatically
Data set is obtained from web download image and the text message relevant to image to local data base according to image subject;
3) view data screening
Image information and the text message relevant to image is utilized to screen to obtain target image set to the image in data set;
The step of described view data screening is: first, utilizes the Normalized Grey Level histogram distribution information of image to reject the non-natural images of data centralization; Secondly, the text message rejecting data centralization relevant to image is utilized to depart from the image of image subject;
The basis for estimation of described non-natural images is: in the Normalized Grey Level histogram of image, and when the number of grey levels that the frequency of occurrences is greater than 0.06 is less than 60, image is judged as non-natural images.
2. a kind of network image is independently collected and screening technique according to claim 1, it is characterized in that, in described automatic download, if gone whistle because of download request or reading flow unsuccessfully causes failed download, and retry three times still failed download, so skip downloading to this image, directly enter downloading of next image; When downloading a certain image, if download image completes in setting-up time, then enter downloading of next image, if do not complete download image in setting-up time, then abandon and this time download result, then enter downloading of next image.
3. a kind of network image is independently collected and screening technique according to claim 1, it is characterized in that, described text message is image Tag information.
4. a kind of network image is independently collected and screening technique according to claim 1, it is characterized in that, described non-natural images comprises the image of cartoon, icon, Freehandhand-drawing or synthesis.
5. a kind of network image is independently collected and screening technique according to claim 1, it is characterized in that, the image that described rejecting data centralization departs from image subject comprises the following steps:
Participle is carried out to the text message relevant to image and obtains multiple word; Utilize the semantic network of WordNet from multiple word, screen the word representing Object; If represent, the word of Object belongs to same TongYiCi CiLin, so retains the image relevant to text message, otherwise rejects the image relevant to text message.
CN201210336284.2A 2012-09-12 2012-09-12 A kind of network image is independently collected and screening technique Expired - Fee Related CN102902737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210336284.2A CN102902737B (en) 2012-09-12 2012-09-12 A kind of network image is independently collected and screening technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210336284.2A CN102902737B (en) 2012-09-12 2012-09-12 A kind of network image is independently collected and screening technique

Publications (2)

Publication Number Publication Date
CN102902737A CN102902737A (en) 2013-01-30
CN102902737B true CN102902737B (en) 2015-08-05

Family

ID=47574970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210336284.2A Expired - Fee Related CN102902737B (en) 2012-09-12 2012-09-12 A kind of network image is independently collected and screening technique

Country Status (1)

Country Link
CN (1) CN102902737B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447846B (en) * 2014-08-25 2020-06-23 联想(北京)有限公司 Image processing method and electronic equipment
CN108959304B (en) * 2017-05-22 2022-03-25 阿里巴巴集团控股有限公司 Label prediction method and device
CN107909088B (en) * 2017-09-27 2022-06-28 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer storage medium for obtaining training samples

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101853295A (en) * 2010-05-28 2010-10-06 天津大学 Image search method
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN102270234A (en) * 2011-08-01 2011-12-07 北京航空航天大学 Image search method and search engine

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN101853295A (en) * 2010-05-28 2010-10-06 天津大学 Image search method
CN102270234A (en) * 2011-08-01 2011-12-07 北京航空航天大学 Image search method and search engine

Also Published As

Publication number Publication date
CN102902737A (en) 2013-01-30

Similar Documents

Publication Publication Date Title
CN100405371C (en) Method and system for abstracting new word
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN104504150A (en) News public opinion monitoring system
CN105893583A (en) Data acquisition method and system based on artificial intelligence
CN103577478A (en) Web page pushing method and system
CN104133830A (en) Data obtaining method
CN110633594A (en) Target detection method and device
CN111563382A (en) Text information acquisition method and device, storage medium and computer equipment
CN111695014A (en) Method, system, device and storage medium for automatically generating manuscripts based on AI (artificial intelligence)
CN111726336A (en) Method and system for extracting identification information of networked intelligent equipment
CN102902737B (en) A kind of network image is independently collected and screening technique
KR102124935B1 (en) Disaster Monitoring System, Method Using Crowd Sourcing, and Computer Program therefor
CN103078854A (en) Message filtering method and device
CN110866172B (en) Data analysis method for block chain system
CN102902790A (en) Web page classification system and method
CN103475532A (en) Hardware detection method and system thereof
CN104778232B (en) Searching result optimizing method and device based on long query
CN106484913A (en) Method and server that a kind of Target Photo determines
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN102902792A (en) List page recognition system and method
CN102929948B (en) list page identification system and method
CN109948015B (en) Meta search list result extraction method and system
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN114841155A (en) Intelligent theme content aggregation method and device, electronic equipment and storage medium
CN104572767A (en) Method and system for language classification of sites

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150805

Termination date: 20170912