CN101308508A - Method and device for processing picture, and method for searching picture - Google Patents

Method and device for processing picture, and method for searching picture Download PDF

Info

Publication number
CN101308508A
CN101308508A CNA2008101164554A CN200810116455A CN101308508A CN 101308508 A CN101308508 A CN 101308508A CN A2008101164554 A CNA2008101164554 A CN A2008101164554A CN 200810116455 A CN200810116455 A CN 200810116455A CN 101308508 A CN101308508 A CN 101308508A
Authority
CN
China
Prior art keywords
field
picture header
website
invalid
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008101164554A
Other languages
Chinese (zh)
Other versions
CN101308508B (en
Inventor
贾梦雷
张阔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN2008101164554A priority Critical patent/CN101308508B/en
Publication of CN101308508A publication Critical patent/CN101308508A/en
Application granted granted Critical
Publication of CN101308508B publication Critical patent/CN101308508B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image title processing method, comprising the following steps: to set invalid field recognition rules; to recognize the invalid field contained in the image titles in a website page according to the recognition rules; to invalid field contained in the image titles in the website page. The invention also provides an image title processing device, a search engine and an image searching method. An embodiment of the invention has the invention has the advantages that firstly, the sorting effect is obviously improved; secondly, the user experience is improved due to the good relevance of the search results.

Description

Handle the method for method, device and the search pictures of picture
Technical field
The present invention relates to networking technology area, particularly relate to the method for method, device and the search pictures of handling picture.
Background technology
The image content that comprises in the page is analyzed and the data extraction, is the important content of search engine work.But, in the prior art at the page analysis technology of the single page, owing to lack statistical information about certain website, can't effectively remove invalid field such as website name in the picture header, forum's name, space of a whole page name, edition owner's name, time, model marking, the negative effect that brings thus has:
1, incoherent result appears.This is because query word has hit invalid field and causes, for example inquiry " phoenix " and picture header contains " phoenix report ".This result is not the needed result of search subscriber.
2, the good result's of correlativity ranks behind.Because the invalid field One's name is legion, cause being submerged in the invalid field with the good effective information of picture correlativity, lower when causing calculating scoring.For example content for the icon of benz car be entitled as " benz drifting fragrance network>>center picture>>Ai Che gang ".
3, uncorrelated content occurs in the field that represents to the user, reduced user experience.
The inventor finds that there are the following problems at least in the prior art in realizing process of the present invention:
Cause search result relevance poor at single page analysis, user experience is low.
Summary of the invention
In view of this, the purpose of the one or more embodiment of the present invention is to provide the method for method, device and the search pictures of handling picture, to realize improving relevance of search results, promotes user experience.
For addressing the above problem, the embodiment of the invention provides a kind of image title processing method, comprising:
The invalid field recognition rule is set;
According to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
Remove the described invalid field that is comprised in the picture header in the described page website.
A kind of device of image title processing also is provided, has comprised:
The unit is set, is used to be provided with the invalid field recognition rule;
Recognition unit is used for according to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
First removes the unit, is used for removing the described invalid field that is comprised in the picture header of described page website.
A kind of search engine also is provided, has comprised as disclosed each device among the processing picture device embodiment of the present invention.
A kind of method of search pictures also is provided, has comprised:
The invalid field recognition rule is set;
According to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
Remove the described invalid field that is comprised in the picture header in the described page website;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
Compared with prior art, the embodiment of the invention has the following advantages:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, because search result relevance is good, user experience improves.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Shown in Figure 1, be the process flow diagram of the embodiment one of image title processing method of the present invention;
Shown in Figure 2, be the block diagram of embodiment one of the device of image title processing of the present invention;
Shown in Figure 3, be embodiment one block diagram of the search engine that provides of the embodiment of the invention;
Shown in Figure 4, be the process flow diagram of embodiment one of the method for search pictures of the present invention;
Shown in Figure 5, be the process flow diagram of the processing procedure of modules A;
Shown in Figure 6, be the process flow diagram of the processing procedure of module B.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
" invalid field " be meant, and be less with the image content relation, irrelevant, perhaps plays the field of interference effect.For example embed website name, forum's name in the title, the edition owner of forum is the marking of picture card and edition owner's signature, and also having forum is the timestamp that adds automatically of picture header etc.
The negative effect that brings of invalid field comprises:
When 1) inquiring about, can hit and the irrelevant invalid field of picture, cause occurring incoherent result;
2), significant field really relevant with picture is submerged in the middle of the invalid field, do not have searched arriving, and the perhaps searched hit rate that arrives is low, causes original relevant picture scoring to reduce;
3) uncorrelated content appears in the field that represents to the user, has reduced user experience.
Picture header generally includes " running head ", " title in the page or leaf ", " picture Alternate text (alt) ", " picture character link (anchor) " etc.The literal that " picture Alternate text " finger mouse appears in one's mind when moving on the picture.The picture searching of current main-stream all is to adopt the picture related text to retrieve, and wherein most important is exactly picture header.Therefore, picture header is most important for the correlativity of picture searching.Because ubiquity a large amount of invalid field in picture header, the existence of these invalid field is very big to the relevance of search results influence.Because the picture header text is shorter, has amplified the influence to correlativity.
Core concept of the present invention is according to statistical law, sets judgment rule, and the invalid field in the identification picture header removes and put into special domain with invalid field when building inverted index from picture header; During on-line search, the special domain that hits the invalid field place done fall power and handle.Handle by falling power, make invalid field is composed with different weights, even can be zero, so as not to the high field of correlativity is had a negative impact some invalid field tax weighted value.Make relevance of search results improve, promoted user experience.
Because the engineering noise of invalid field is with respect to for the correlativity of image content, is not the literal relevant with image content, but is not to have no related with picture.And, consider that website name, forum's name, space of a whole page name are still useful information for the certain user in some cases, therefore directly invalid field is not directly abandoned, do and fall power and handle but move into special domain.
Page analysis technology at the single page, owing to lack statistical information about certain website, can't effectively remove invalid field such as website name in the picture header, forum's name, space of a whole page name, edition owner's name, time, model marking, the negative effect that brings thus has:
(1) can hit incoherent result.This be since query hit invalid field cause for example inquiry " phoenix " and picture header contains " phoenix report ".
(2) result that correlativity is high but ranks behind.The effective information relevant with picture is submerged in the invalid field, and be lower when causing calculating scoring, for example content for the icon of benz car be entitled as " benz drifting fragrance network>>center picture>>Ai Che gang ".
(3) uncorrelated content occurs in the field that represents to the user, reduced user experience.
As shown in Figure 1, be the embodiment one of image title processing method of the present invention, comprising:
101, the invalid field recognition rule is set;
Wherein, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described website meets prerequisite, then described field is set to invalid field.
Wherein, the field that comprises in the picture header of described website meets prerequisite and is specially:
If the occurrence number of described field reaches predetermined value, and the ratio of the picture sum that comprises of the occurrence number of described field and described website reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the picture number that described website comprises reaches predetermined value, and the ratio of the number of times that occurs of the occurrence number of described field and all fields reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the ratio of the number of times that the occurrence number of the occurrence number of described field or described field and all fields occur, reach predetermined value, and the result behind the described field participle shows that described field belongs to invalid information, and then the field that comprises in the picture header of described website meets prerequisite.
Wherein, described field belongs to invalid information and is specially:
Described field comprises: forum, community, photograph album, registration, daily record, pinup picture, browse or reprint.
Wherein, can all non-effective fields all be set to invalid field by oppositely setting the condition for identification of effective field, the then described invalid field recognition rule that is provided with is specially:
If the occurrence number of described field is less than pre-set threshold value, then with described field as effective field.
Wherein, described the invalid field recognition rule is set before, also comprise:
Is unit with the title of all pictures with website, page place, divides.
Wherein, after dividing, also comprise:
For the field in the bracket in the picture header, from picture header, remove;
Described picture header is divided into several fields according to separator;
Add up the number of times of each field appearance that is comprised in the picture header under the same website;
Then according to described recognition rule, the invalid field that picture header comprises in the identification website is specially:
If the number of times that described field occurs reaches prerequisite, then described field is identified as invalid field.
102, according to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
103, remove the described invalid field that is comprised in the picture header in the described page website.
Wherein, after the described invalid field that in removing described website, is comprised in the picture header, also comprise:
Set up the corresponding relation of described page website and described invalid field.
Wherein, after the corresponding relation of setting up described page website and described invalid field, also comprise:
Field in the bracket in the picture header is kept at special domain;
To the page website at described picture place, according to the corresponding relation of page website and described invalid field, search the invalid field in the described picture header, described invalid field is moved to special domain;
With remaining literal in the described picture header as picture header.
Wherein, after the removal invalid field, can provide Experience Degree higher query script, that is: remove after the described invalid field that is comprised in the picture header in the described page website, also comprise to the user:
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
Wherein, remaining literal in the described picture header is also comprised as after the picture header:
The power processing is fallen in the special domain at invalid field place;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
By said process, can obtain following useful technique effect:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, user experience improves.
Owing to incoherent content in the literal that represents to the user reduces, thereby improved user experience.
As shown in Figure 2, be the embodiment one of the device of image title processing of the present invention, comprising:
Unit 201 is set, is used to be provided with the invalid field recognition rule;
Wherein, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described website meets prerequisite, then described field is set to invalid field.
Wherein, the described invalid field recognition rule that is provided with is specially:
If the occurrence number of described field is less than pre-set threshold value, then with described field as effective field.
Wherein, the field that comprises in the picture header of described website meets prerequisite and is specially:
If the occurrence number of described field reaches predetermined value, and the ratio of the picture sum that comprises of the occurrence number of described field and described website reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the picture number that described website comprises reaches predetermined value, and the ratio of the number of times that occurs of the occurrence number of described field and all fields reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the ratio of the number of times that the occurrence number of the occurrence number of described field or described field and all fields occur, reach predetermined value, and the result behind the described field participle shows that described field belongs to invalid information, and then the field that comprises in the picture header of described website meets prerequisite.
Wherein, described field belongs to invalid information and is specially:
Described field comprises: forum, community, photograph album, registration, daily record, pinup picture, browse or reprint.
Recognition unit 202 is used for according to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
First removes unit 203, is used for removing the described invalid field that is comprised in the picture header of described page website.
By said process, can obtain following useful technique effect:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, user experience improves.
Owing to incoherent content in the literal that represents to the user reduces, thereby improved user experience.
Wherein, in the foregoing description, can also comprise:
The website division unit, be used for described the invalid field recognition rule is set before, be unit with the title of all pictures with website, page place, divide.
Wherein, except the website division unit, can also comprise:
Second removes the unit, is used for after dividing, and for the field in the bracket in the picture header, removes from picture header;
Separating element is used for described picture header is divided into several fields according to separator;
Statistic unit is used for adding up picture header comprised under the same website number of times that each field occurs;
Described recognition unit is specially:
Second recognition unit reaches prerequisite if be used for the number of times of described field appearance, then described field is identified as invalid field.
Wherein, the foregoing description can also comprise:
Set up the unit, after the described invalid field that is used in removing described website picture header, being comprised, set up the corresponding relation of described page website and described invalid field.
Wherein, the foregoing description can also comprise on the basis of setting up the unit comprising:
Preserve the unit, be used for after the corresponding relation of setting up described page website and described invalid field, the field in the bracket in the picture header is kept at special domain;
Mobile unit is used for the page website to described picture place, according to the corresponding relation of page website and described invalid field, searches the invalid field in the described picture header, and described invalid field is moved to special domain;
Processing unit is used for the remaining literal of described picture header as picture header.
Wherein, also comprise:
First acquiring unit is used for removing after the described invalid field that is comprised in the picture header of described page website, obtains the picture header relevant with query word;
First output unit, the link that is used to export described picture header correspondence.
Wherein,, invalid field can also be fallen after power handles, carry out the index coupling again, just can also comprise with above-mentioned directly to carry out the index coupling according to the effective field that comprises in the picture header different:
Fall the power unit, be used for the remaining literal of described picture header, power is fallen in the special domain at invalid field place handle as after the picture header;
Second acquisition unit is used to obtain the picture header relevant with query word;
Second output unit, the link that is used to export described picture header correspondence.
By said process, can obtain following useful technique effect:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, user experience improves.
Owing to incoherent content in the literal that represents to the user reduces, thereby improved user experience.
As shown in Figure 3, be the embodiment one of the search engine that provides of the embodiment of the invention, comprise disclosed each device of device embodiment of image title processing of the present invention.
As shown in Figure 4, be the embodiment one of the method for search pictures of the present invention, comprising:
401, the invalid field recognition rule is set;
Wherein, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described page website meets prerequisite, then described field is set to invalid field.
Wherein, described the invalid field recognition rule is set before, also comprise:
Is unit with the title of all pictures with website, page place, divides.
Wherein, also comprise:
For the field in the bracket in the picture header, from picture header, remove;
Described picture header is divided into several fields according to separator;
Add up the number of times of each field appearance that is comprised in the picture header under the same website;
Then according to described recognition rule, the invalid field that picture header comprises in the identification website is specially:
If the number of times that described field occurs reaches prerequisite, then described field is identified as invalid field.
402, according to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
403, remove the described invalid field that is comprised in the picture header in the described page website;
Wherein, after the described invalid field that in removing described website, is comprised in the picture header, also comprise:
Set up the corresponding relation of described page website and described invalid field.
Wherein, after the corresponding relation of setting up described page website and described invalid field, also comprise:
Field in the bracket in the picture header is kept at special domain;
To the page website at described picture place, according to the corresponding relation of page website and described invalid field, search the invalid field in the described picture header, described invalid field is moved to special domain;
With remaining literal in the described picture header as picture header.
Wherein, remaining literal in the described picture header is also comprised as after the picture header:
The power processing is fallen in the special domain at invalid field place;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
404, obtain the picture header relevant with query word;
405, the link of the described picture header correspondence of output.
By said process, can obtain following useful technique effect:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, user experience improves.
Owing to incoherent content in the literal that represents to the user reduces, thereby improved user experience.
Adapt with above-mentioned each embodiment, the invention provides a kind of embodiment two of method of search pictures, this enforcement illustrates the course of work of present embodiment by the division of three functional modules, for the those skilled in the art, can also adopt other Module Division mode, can realize technology literal of the present invention, three modules of the present invention and the course of work thereof are respectively:
A. discern the invalid field module.
The effect of this module is according to the statistical information that belongs to the picture header of same website, finds the pairing invalid field in this website.For example the title of all pictures all contains " Kong Zi's secondhand book net " under " www.kongfz.com " this website, and from the statistical significance as can be known, this field is almost nil to the contribution of understanding the content of picture own.So, can be with " Kong Zi's secondhand book net " as the pairing invalid field in website " www.kongfz.com ".
As previously mentioned, the picture header of broad sense divides multiple, comprises " running head ", " title in the page or leaf ", " picture Alternate text (alt) ", " picture character link (anchor) " etc.With reference to shown in Figure 5, be the processing procedure of modules A.
501. is that unit is divided into some groups with the picture header of all pictures with page website.
Generally speaking, the website under the picture self is not necessarily identical with the website of the picture place page.And picture header is to extract from the page at picture place, therefore differentiates invalid field and will carry out according to page website.Be unit promptly, find the pairing invalid field of the picture header that belongs to same website with the website.
At the page analysis technology of the single page, owing to lack statistical information, can't effectively remove invalid field such as website name in the picture header, forum's name, space of a whole page name about certain website.For instance, belong to the page of " phoenix report " this website, all have " phoenix report " field, only concerning the single page, we can't judge whether it is invalid field.Have only when the website is unit, we could find that the page that belongs to this website all contains this field, so " phoenix report " field belongs to the not contribution of numerous pages of this website for differentiation.User's interest all is the single page, and hence one can see that, and this field effective information content is few, belongs to invalid field.
502. the field in the removal bracket.
In the picture header, the content in bracket " [...] ", " [...] " and " " ... " " can be summarized as following several situation:
A) time.Example: [2006-10-15]
B) edition owner's signature.Example: [read in the human world], [the gentle andd serene heart is read]
C) class formative.Example: [recommendation], [original], [changeing card], [sharing], [picture group], [pouring water]
D) bonus point.Example: [elite+30]
E) picture group number.Example: [16P], [5p]
F) website, forum, plate name.Example: [Eight Diagrams rivers and lakes], [star]
Under most situations, these literal are minimum with the relation of image content, can be used as invalid field and remove.
The picture header behind the literal will be divided into several fields according to separator in the bracket 503. will remove, and add up the number of words that each field occurs under the same website.(separator refers to other punctuation marks outside the connectors such as comma and pause mark.)
With title appropriate separate after differentiate invalid field again and make it possible to short word string is operated, rather than long title integral body.
Step 504. is set rule, the identification invalid field.
According to the actual conditions of website, picture place, concrete rule can be:
A) occurrence number is less than 3 times, then thinks not to be invalid field.
B) occurrence number surpasses 100, and reaches 10% with the ratio of picture sum under this website, then is judged to invalid field.
C) occurrence number reaches 40, and reaches 30% with the ratio of picture sum under this website, then is judged to invalid field.
D) website enough big (this website included figure reach 50), and the appearance ratio of this field reached 50%, then is judged to invalid field.
One of e) occurrence number has surpassed 5 times, and the satisfied following condition of the result behind this field participle, then be judged to invalid field:
I. comprise one of following lexical item: " forum ", " community ", " photograph album " " registration ", " daily record ", " pinup picture ", " browsing ", " reprinting ".
Ii. ending is one of following lexical item: " readding ", " net ", " district ", " version ".
Above-mentioned rule is concluded, can be obtained following rule:
To a certain website, if a certain field:
1) occurrence number is very few, meets prerequisite, then thinks effective field, is not invalid field.
For example certain picture header of lining, website " www.kongfz.com " contains field " The Analects of Confucius justice ", and this field only occurred 2 times, was considered to not be invalid field.
2) occurrence number is too much, meets the relevant prerequisite of number of times, and acquires a certain degree with the ratio of picture sum under this website, meets the relevant prerequisite of ratio, then is judged to invalid field.
For example the part picture header of lining, website " www.kongfz.com " contains field " Kong Zi's secondhand book net ", surpass 1000 times, and with the storehouse in the ratio of picture sum under this website of including reached 10%, then this field is considered to invalid field.
3) occurrence number reaches some, meets the relevant prerequisite of number of times, and reaches suitable high level with the ratio of picture sum under this website, meets the relevant prerequisite of ratio, then is judged to invalid field.
For example lining, website " gcforum.org " has the part picture header to contain field " animation pinup picture ", reach 53 times, and be 100 and include this website picture in the storehouse, ratio reaches 53%, and then this field is considered to invalid field.
4) website is enough big, refer to that the figure that is included this website is abundant, meet the relevant prerequisite of quantity, and the appearance ratio of this field has reached to a certain degree, this ratio refers to the number of times of this field appearance and the ratio of the total degree of all fields, then is judged to invalid field.
For example lining, website " jk360.bolaa.com " has the part picture header to contain field " blog hand in hand ", be 15 times (quantity term above not satisfying), including this website picture in the storehouse is 15 (reaching the standard of " enough big "), and the picture under this website has only 4 titles altogether, totally 4 fields reach 25% ratio.Therefore, according to above-mentioned judgment rule, can think that this field is an invalid field.
One of 5) occurrence number has surpassed several times, and ratio perhaps occurs and reached some ratios, and the satisfied following condition of the result behind this field participle, then be judged to invalid field:
For example:
Comprise one of following word: " forum ", " community ", " photograph album " " registration ", " daily record ", " pinup picture ", " browsing ", " reprinting ".
I. the picture header as lining, website " bbs.arsenal.com.cn " contains field " gunman community ", and lining, website " niweiqiu.photo.ipart.cn " picture header contains field " free photograph album ", all is invalid field.
Ending is one of following lexical item: " readding ", " net ", " district ", " version ".
Ii. contain " China's business net " as lining, website " www.bbs818.com " picture header, lining, website " www.coolshrimp.com " picture header contains " discussion sharing area ", is invalid field.
According to above-mentioned rule, the invalid field that can identify website " www.kongfz.com " has " Kong Zi's secondhand book net ".
At last, through above-mentioned steps, identification invalid field module can obtain " page website-invalid field " tabulation.Wherein, a website may corresponding a plurality of invalid field.This tabulation comprises: the void in whole field that page website and page website are comprised, and the corresponding relation of the void in whole field that comprised of page website and described page website.This tabulation can comprise the page website of One's name is legion, and the corresponding relation of these page websites and invalid field separately.
In above-mentioned recognition rule, include word that figure number, field comprise and field end word etc. and take all factors into consideration by ratio, website field occurrence number, field being occurred, help to discern the degree of accuracy of invalid field, improve recall rate.
B. data generation module
When generating data,, be the processing procedure of module B with reference to shown in Figure 6.Picture header to each figure carries out following processing:
601. the field in the bracket is moved to special domain.
To each figure, the field in the bracket is deleted from title, put into " invalid field district " this word territory (with suitable notions such as " title ", " peripheral literal ") about this figure.If the link of this figure is to belong to website " www.kongfz.com ", and field " Kong Zi's secondhand book net " is arranged in the picture header, this field is deleted from title, put into " invalid field district ".
602. according to the page link of picture, search,, then invalid field moved to special domain if find that in picture header invalid field is arranged in " page website-invalid field " tabulation.
603., insert the territory at picture header place in the data file for the remaining literal after the removal invalid field.
Through data generation module, each of input can be opened the relevant information of figure, be processed into the data file that do not comprise invalid field and index file and export.
C. calculate evaluation module
When scoring is calculated in on-line search, power is fallen in the special domain at invalid field place handle.
(generally can be regarded as the rubbish field and remove) that the differentiation of invalid field is in the past carried out in page analysis often, but the such same specific website of similar " phoenix report ", " Ai Che gang ", the invalid field that forum is relevant, only after the statistical information of having obtained whole Website page, could differentiate, carry out after page analysis, this is the main points that the present invention distinguishes other anti-rubbish technology.
During on-line search,, then give an extremely low score value if search word has hit the special domain at invalid field place.
Like this, when the user wants to search the model that specific website, the space of a whole page or edition owner sent out, still can find, and when general inquiry, these information can not impact the picture that really should come the front again.
By calculating evaluation module, the index data and the query word of input can be carried out invalid field and falls power, remove the invalid field in the page title, output corresponding sequencing result.
During a) user search " The Analects of Confucius justice ", because " Kong Zi's secondhand book net " shifts out from title, the title of " The Analects of Confucius justice " is higher than the score of " The Analects of Confucius justice--Kong Zi's secondhand book net ", thereby the rank of this picture improves, and has avoided some not too relevant picture rank too forward.
During b) user search " Kong Zi ", contain the picture of " Kong Zi's secondhand book net " because do not hit title, just hit " invalid field district ", so the score reduction, its rank is inferior to the picture of the portrait of representing Kong Zi, statue.
That is to say, utilize embodiments of the invention, when searching " The Analects of Confucius justice ", can't see " Kong Zi's secondhand book net " printed words; When searching " Kong Zi ", the graph title that comes the front all is " Kong Zi * * * ", and irrelevant with this website of selling book.
In addition, above-mentioned each process of present embodiment can correspondingly be applied to handle in the method and apparatus of picture, also can apply in the search engine.
By said process, can obtain following useful technique effect:
At first, the ordering effect is obviously improved.
Owing to, just reduced the relevant result of invalid field and occurred by the removal of invalid field.Because invalid field is represented uncorrelated result, so incoherent result can not appear at the prostatitis of Search Results again.
The effective information relevant with picture weight when calculating scoring is higher, helps real relevant result and comes the front, the reach of rank as a result that correlativity is good.
Secondly, user experience improves.
Owing to incoherent content in the literal that represents to the user reduces, thereby improved user experience.
Through the above description of the embodiments, the those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly realize by hardware mode, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a computer equipment (can be a personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.
Above-described embodiment of the present invention does not constitute the qualification to protection domain of the present invention.Any modification of being done within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within protection scope of the present invention.

Claims (30)

1, a kind of image title processing method is characterized in that, comprising:
The invalid field recognition rule is set;
According to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
Remove the described invalid field that is comprised in the picture header in the described page website.
2, the method for claim 1 is characterized in that, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described website meets prerequisite, then described field is set to invalid field.
3, the method for claim 1 is characterized in that, the described invalid field recognition rule that is provided with is specially:
If the occurrence number of described field is less than pre-set threshold value, then with described field as effective field.
4, method as claimed in claim 2 is characterized in that, the field that comprises in the picture header of described website meets prerequisite and is specially:
If the occurrence number of described field reaches predetermined value, and the ratio of the picture sum that comprises of the occurrence number of described field and described website reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the picture number that described website comprises reaches predetermined value, and the ratio of the number of times that occurs of the occurrence number of described field and all fields reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the ratio of the number of times that the occurrence number of the occurrence number of described field or described field and all fields occur, reach predetermined value, and the result behind the described field participle shows that described field belongs to invalid information, and then the field that comprises in the picture header of described website meets prerequisite.
5, method as claimed in claim 4 is characterized in that, described field belongs to invalid information and is specially:
Described field comprises: forum, community, photograph album, registration, daily record, pinup picture, browse or reprint.
6, the method for claim 1 is characterized in that, described the invalid field recognition rule is set before, also comprise:
Is unit with the title of all pictures with website, page place, divides.
7, method as claimed in claim 6 is characterized in that, after dividing, also comprises:
For the field in the bracket in the picture header, from picture header, remove;
Described picture header is divided into several fields according to separator;
Add up the number of times of each field appearance that is comprised in the picture header under the same website;
Then according to described recognition rule, the invalid field that picture header comprises in the identification website is specially:
If the number of times that described field occurs reaches prerequisite, then described field is identified as invalid field.
8, the described method of claim 1 is characterized in that, after the described invalid field that is comprised in the picture header in removing described website, also comprises:
Set up the corresponding relation of described page website and described invalid field.
9, the described method of claim 8 is characterized in that, after the corresponding relation of setting up described page website and described invalid field, also comprises:
Field in the bracket in the picture header is kept at special domain;
To the page website at described picture place, according to the corresponding relation of page website and described invalid field, search the invalid field in the described picture header, described invalid field is moved to special domain;
With remaining literal in the described picture header as picture header.
10, the method for claim 1 is characterized in that, removes after the described invalid field that is comprised in the picture header in the described page website, also comprises:
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
11, the method for claim 1 is characterized in that, remaining literal in the described picture header is also comprised as after the picture header:
The power processing is fallen in the special domain at invalid field place;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
12, a kind of device of image title processing is characterized in that, comprising:
The unit is set, is used to be provided with the invalid field recognition rule;
Recognition unit is used for according to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
First removes the unit, is used for removing the described invalid field that is comprised in the picture header of described page website.
13, device as claimed in claim 12 is characterized in that, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described website meets prerequisite, then described field is set to invalid field.
14, device as claimed in claim 12 is characterized in that, the described invalid field recognition rule that is provided with is specially:
If the occurrence number of described field is less than pre-set threshold value, then with described field as effective field.
15, device as claimed in claim 13 is characterized in that, the field that comprises in the picture header of described website meets prerequisite and is specially:
If the occurrence number of described field reaches predetermined value, and the ratio of the picture sum that comprises of the occurrence number of described field and described website reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the picture number that described website comprises reaches predetermined value, and the ratio of the number of times that occurs of the occurrence number of described field and all fields reaches predetermined value, and then the field that comprises in the picture header of described website meets prerequisite; Or
If the ratio of the number of times that the occurrence number of the occurrence number of described field or described field and all fields occur, reach predetermined value, and the result behind the described field participle shows that described field belongs to invalid information, and then the field that comprises in the picture header of described website meets prerequisite.
16, device as claimed in claim 15 is characterized in that, described field belongs to invalid information and is specially:
Described field comprises: forum, community, photograph album, registration, daily record, pinup picture, browse or reprint.
17, device as claimed in claim 12 is characterized in that, also comprises:
The website division unit, be used for described the invalid field recognition rule is set before, be unit with the title of all pictures with website, page place, divide.
18, device as claimed in claim 17 is characterized in that, also comprises:
Second removes the unit, is used for after dividing, and for the field in the bracket in the picture header, removes from picture header;
Separating element is used for described picture header is divided into several fields according to separator;
Statistic unit is used for adding up picture header comprised under the same website number of times that each field occurs;
Described recognition unit is specially:
Second recognition unit reaches prerequisite if be used for the number of times of described field appearance, then described field is identified as invalid field.
19, the described device of claim 12 is characterized in that, also comprises:
Set up the unit, after the described invalid field that is used in removing described website picture header, being comprised, set up the corresponding relation of described page website and described invalid field.
20, the described device of claim 19 is characterized in that, also comprises:
Preserve the unit, be used for after the corresponding relation of setting up described page website and described invalid field, the field in the bracket in the picture header is kept at special domain;
Mobile unit is used for the page website to described picture place, according to the corresponding relation of page website and described invalid field, searches the invalid field in the described picture header, and described invalid field is moved to special domain;
Processing unit is used for the remaining literal of described picture header as picture header.
21, device as claimed in claim 12 is characterized in that, also comprises:
First acquiring unit is used for removing after the described invalid field that is comprised in the picture header of described page website, obtains the picture header relevant with query word;
First output unit, the link that is used to export described picture header correspondence.
22, device as claimed in claim 12 is characterized in that, also comprises:
Fall the power unit, be used for the remaining literal of described picture header, power is fallen in the special domain at invalid field place as after the picture header;
Second acquisition unit is used to obtain the picture header relevant with query word;
Second output unit, the link that is used to export described picture header correspondence.
23, a kind of search engine is characterized in that, comprises each described device as claim 12-22.
24, a kind of method of search pictures is characterized in that, comprising:
The invalid field recognition rule is set;
According to described recognition rule, the invalid field that picture header comprises in the identifying page veil station;
Remove the described invalid field that is comprised in the picture header in the described page website;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
25, method as claimed in claim 24 is characterized in that, the described invalid field recognition rule that is provided with is specially:
If the field that comprises in the picture header of described page website meets prerequisite, then described field is set to invalid field.
26, method as claimed in claim 24 is characterized in that, described the invalid field recognition rule is set before, also comprise:
Is unit with the title of all pictures with website, page place, divides.
27, method as claimed in claim 26 is characterized in that, after dividing, also comprises:
For the field in the bracket in the picture header, from picture header, remove;
Described picture header is divided into several fields according to separator;
Add up the number of times of each field appearance that is comprised in the picture header under the same website;
Then according to described recognition rule, the invalid field that picture header comprises in the identification website is specially:
If the number of times that described field occurs reaches prerequisite, then described field is identified as invalid field.
28, the described method of claim 24 is characterized in that, after the described invalid field that is comprised in the picture header in removing described website, also comprises:
Set up the corresponding relation of described page website and described invalid field.
29, the described method of claim 28 is characterized in that, after the corresponding relation of setting up described page website and described invalid field, also comprises:
Field in the bracket in the picture header is kept at special domain;
To the page website at described picture place, according to the corresponding relation of page website and described invalid field, search the invalid field in the described picture header, described invalid field is moved to special domain;
With remaining literal in the described picture header as picture header.
30, method as claimed in claim 24 is characterized in that, remaining literal in the described picture header is also comprised as after the picture header:
The power processing is fallen in the special domain at invalid field place;
Obtain the picture header relevant with query word;
Export the link of described picture header correspondence.
CN2008101164554A 2008-07-10 2008-07-10 Method and device for processing picture, and method for searching picture Active CN101308508B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101164554A CN101308508B (en) 2008-07-10 2008-07-10 Method and device for processing picture, and method for searching picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101164554A CN101308508B (en) 2008-07-10 2008-07-10 Method and device for processing picture, and method for searching picture

Publications (2)

Publication Number Publication Date
CN101308508A true CN101308508A (en) 2008-11-19
CN101308508B CN101308508B (en) 2011-11-02

Family

ID=40124962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101164554A Active CN101308508B (en) 2008-07-10 2008-07-10 Method and device for processing picture, and method for searching picture

Country Status (1)

Country Link
CN (1) CN101308508B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793509A (en) * 2014-01-27 2014-05-14 北京奇虎科技有限公司 Picture capturing method and device
CN103942272A (en) * 2014-03-27 2014-07-23 北京百度网讯科技有限公司 Image search method and device
CN105117448A (en) * 2015-08-14 2015-12-02 新一站保险代理有限公司 Picture-based product exposure rate calculating method and system for online shopping
CN107766365A (en) * 2016-08-18 2018-03-06 北京京东尚科信息技术有限公司 webpage generating method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3606729B2 (en) * 1997-12-10 2005-01-05 松下電器産業株式会社 Rich text material display method and video information providing system
CN100371934C (en) * 2005-05-30 2008-02-27 北大方正集团有限公司 Index structuring method for fast searching mass picture based on content
CN100511230C (en) * 2006-05-29 2009-07-08 北京万网志成科技有限公司 Webpage-text based image search and display method thereof

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793509A (en) * 2014-01-27 2014-05-14 北京奇虎科技有限公司 Picture capturing method and device
CN103942272A (en) * 2014-03-27 2014-07-23 北京百度网讯科技有限公司 Image search method and device
CN103942272B (en) * 2014-03-27 2017-08-18 北京百度网讯科技有限公司 Image searching method and device
CN105117448A (en) * 2015-08-14 2015-12-02 新一站保险代理有限公司 Picture-based product exposure rate calculating method and system for online shopping
CN105117448B (en) * 2015-08-14 2018-06-01 新一站保险代理股份有限公司 Product exposure rate algorithm and system based on picture in a kind of shopping at network
CN107766365A (en) * 2016-08-18 2018-03-06 北京京东尚科信息技术有限公司 webpage generating method and device

Also Published As

Publication number Publication date
CN101308508B (en) 2011-11-02

Similar Documents

Publication Publication Date Title
CN105095368B (en) Method and device for sequencing news information
Voorhees et al. TREC: Experiment and evaluation in information retrieval
CN102053991B (en) Method and system for multi-language document retrieval
JP5083669B2 (en) Information extraction system, information extraction method, information extraction program, and information service system
CN104615627B (en) A kind of event public feelings information extracting method and system based on microblog
EP1489528A3 (en) URL retrieval method and system
RU2010141559A (en) RANKING SEARCH RESULTS USING THE EDITING DISTANCE AND DOCUMENT INFORMATION
CN107291780A (en) A kind of user comment information methods of exhibiting and device
JP2005085285A5 (en)
CN102375813B (en) Search engine re-scheduling system and method
CN102088419A (en) Method and system for searching information of good friends in social network
CN103020140A (en) Method and device for automatically filtering comment contents of internet users
CN102262625A (en) Method and device for extracting keywords of page
CN105138558A (en) User access content-based real-time personalized information collection method
CN105302876A (en) Regular expression based URL filtering method
US20070055699A1 (en) Photo image retrieval system and program
CN102314492A (en) Method and equipment for acquiring candidate document sections matched with target document section
CN101308508B (en) Method and device for processing picture, and method for searching picture
US20130066894A1 (en) Information processing system, information processing method, program, and non-transitory information storage medium
van Zwol et al. Prediction of favourite photos using social, visual, and textual signals
CN104462397A (en) Promotion information processing method and promotion information processing device
CN1937518A (en) Method and device for selecting correlative discussion zone in network community
CN101673263B (en) Method for searching video content
CN101599069A (en) The searching method of electronic document and system
CN106611029A (en) Method and device for improving site search efficiency in website

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant