CN103377199B - Information processor and information processing method - Google Patents

Information processor and information processing method Download PDF

Info

Publication number
CN103377199B
CN103377199B CN201210112493.9A CN201210112493A CN103377199B CN 103377199 B CN103377199 B CN 103377199B CN 201210112493 A CN201210112493 A CN 201210112493A CN 103377199 B CN103377199 B CN 103377199B
Authority
CN
China
Prior art keywords
term
webpage
classification
unit
webpage classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210112493.9A
Other languages
Chinese (zh)
Other versions
CN103377199A (en
Inventor
夏迎炬
杨宇航
葛付江
孙健
潘屹峰
陈思源
何源
孙俊
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210112493.9A priority Critical patent/CN103377199B/en
Publication of CN103377199A publication Critical patent/CN103377199A/en
Application granted granted Critical
Publication of CN103377199B publication Critical patent/CN103377199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of information processor and method are provided.Information processing method includes: from picture, identification string is alternately;In response to obtaining alternative characters string, obtain term according to it;In response to the acquisition of term, term is used to carry out searching web pages;In response to the webpage retrieved, the webpage retrieved is clustered;When as the webpage classification of cluster result and the dependency of term be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of webpage classification and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;In response to the first other selection of web page class, compare the first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;And based on the term corresponding with the second webpage classification and the picture classification system pre-build, image content type of theme is identified.

Description

Information processor and information processing method
Technical field
The present invention relates to field of information processing, particularly relate to a kind of for identifying image content type of theme and carrying out information processor and the information processing method of information inquiry based on type of theme.
Background technology
Non-online information release carrier (such as papery, lamp box, signboard) tends not to provide detailed information as space is limited,.User if it is desired to understand more information, such as: activity detailed rules and regulations, product detail information, company relevant information etc., generally require further search.It addition, for user comparison Related product (technical specification, price etc.), check the demands such as word-of-mouth information, then need search repeatedly.How positioning these information in the internet information of magnanimity is relatively difficult to domestic consumer.
In current method, have and specific picture is put in data base, when user's uploading pictures, by the method for images match, by most like content retrieval out, and the details of this content are presented to user.Such as, corresponding picture, when issuing non-online information, is saved in data base, when user sees non-online information, and time interested in it by information publisher simultaneously, it is possible to by taking pictures and picture uploading to the server end of information publisher.Information publisher, when obtaining retrieval request, uses the method for images match that the advertising message mated most in data base is returned to user.The method also having is the addition method such as bar code or Quick Response Code in advertisement, and bar code or two-dimension code image only need to be uploaded to server by user.Server is when carrying out picture match, due to the feature such as easy to identify of bar code and Quick Response Code, it is possible to be greatly improved the precision of picture match.Can partly make up the defects such as the camera installation resolution not high (intelligent terminal, such as mobile phone) of user, light is bad, reflective.
Summary of the invention
The essence of said system is the method by images match, finds the content that the information picture uploaded with user mates most, and various forms is supplied to user in information picture database.
These existing methodical subject matters are: what provide in this way issues value-added service to information, can only for partial information.Service then cannot be provided for not appearing in the information in information picture database;Further, since the data base with relevant information of ununified information of depositing picture or website, cause that user does not know whom information picture issues.These problems limit the existing value-added service to informative advertising.
For such problem, it is proposed that the information processor of a kind of topic identification for picture without setting up picture database and method.This apparatus and method are not limited to be applied to Information value-added service scene.
According to one embodiment of present invention, it is provided that a kind of information processor, including: character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to alternative characters string;Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;Verification unit, in response to the first other input of web page class, compareing the first webpage classification and the term obtained by term acquiring unit is verified, and is input to term acquiring unit by the term alternately character string after verification;And type identification unit, for image content type of theme being identified based on the term corresponding with the second webpage classification and the picture classification system pre-build.
According to another embodiment of the invention, it is provided that a kind of information processor, including: character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to alternative characters string;Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;Verification unit, in response to the first other input of web page class, compareing the first webpage classification and the term obtained by term acquiring unit is verified, and is input to term acquiring unit by the term alternately character string after verification;And type identification unit, for image content type of theme being identified based on the term corresponding with the second webpage classification and the picture classification system pre-build;And query unit, for carrying out data query based on the image content type of theme identified.
According to another embodiment of the invention, it is provided that a kind of information processing method, including: from picture, identify at least one character string alternately character string;In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to alternative characters string;In response to the acquisition of term, acquired term is used to carry out searching web pages;In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;In response to the first other selection of web page class, compare the first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;And based on the term corresponding with the second webpage classification and the picture classification system pre-build, image content type of theme is identified.
According to another embodiment of the invention, it is provided that a kind of information processing method, including: from picture, identify at least one character string alternately character string;In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to alternative characters string;In response to the acquisition of term, acquired term is used to carry out searching web pages;In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;In response to the first other selection of web page class, compare the first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;Based on the term corresponding with the second webpage classification and the picture classification system pre-build, image content type of theme is identified;And carry out data query based on the image content type of theme identified.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, the above and other objects, features and advantages of the present invention can be more readily understood that.In the accompanying drawings, identical or correspondence technical characteristic or parts will adopt identical or corresponding accompanying drawing labelling to represent.Size and the relative position of unit need not be drawn out in the accompanying drawings according to ratio.
Fig. 1 is the block diagram of the structure illustrating image content type of theme identification device according to embodiments of the present invention.
Fig. 2 is the block diagram of the structure illustrating term acquiring unit according to embodiments of the present invention.
Fig. 3 is the block diagram of the structure illustrating webpage selection unit according to embodiments of the present invention.
Fig. 4 is the block diagram of the structure illustrating the information query device based on image content type of theme according to embodiments of the present invention.
Fig. 5 is the flow chart illustrating image content type of theme recognition methods according to embodiments of the present invention.
Fig. 6 is the flow chart illustrating the information query method based on image content type of theme according to embodiments of the present invention.
Fig. 7 is the block diagram of the example arrangement illustrating the computer realizing the present invention.
Fig. 8 is the example of the information picture of the photographic means shooting illustrating that user uses configuration on such as portable equipment.
Detailed description of the invention
Embodiments of the invention are described with reference to the accompanying drawings.It should be noted that, for purposes of clarity, accompanying drawing and eliminate expression and the description of unrelated to the invention, parts well known by persons skilled in the art and process in illustrating.
Fig. 1 is the block diagram of the structure illustrating image content type of theme identification device 100 according to embodiments of the present invention.Image content type of theme identification device 100 includes: character recognition unit 101, term acquiring unit 102, retrieval unit 103, webpage select unit 104, verification unit 105 and type identification unit 106.
Character recognition unit 101 identifies at least one character string from the picture being input to image content type of theme identification device 100, and it can be used as alternative characters string to be input in term acquiring unit 102.
Can passing through various optical instruments, picture is inputted image content type of theme identification device 100 by such as image scanner, facsimile machine or any photographic goods.Photographic goods can include the photographic head configured on the portable equipment of photographing unit or such as mobile phone.Fig. 8 is the example of the information picture of the photographic means shooting illustrating that user uses configuration on such as portable equipment.Illustrate in order to convenient, hereinafter just use this picture example that various embodiments of the present invention are described.It should be apparent that present invention could apply to the various application needing that image content type of theme is identified, and it is not limited to the identification of type of theme to information image content.
Character recognition unit 101 can adopt the character being currently being widely used various optical character recognition (OCR) technology to identify in picture.In one embodiment, first character recognition unit 101 carries out text location, identifies the character area of picture.Then, picture character is identified.For the information picture shown in Fig. 8, character recognition unit 101 may identify which out such as following character string:
Logical w1F1 mobile phone Lee in the whole nation
The logical w1F1 intelligent machine set meal in the whole nation
Combine huge offering
XZY company
Two fa.1.hy
Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong
Adjoin 5 words
Significantly, since the reason of the deformed letters such as picture quality, characters in a fancy style, the recognition result possibility of character recognition unit 101 cannot provide gratifying key word to constitute the term for searching web pages.Then, character string (can be all or part of character string identified) the alternately character string that character recognition unit 101 will identify that is input in term acquiring unit 102.
Term acquiring unit 102, in response to the input of alternative characters string, obtains for carrying out the term retrieved according to the alternative characters string inputted.
Specifically, term acquiring unit 102 selects key word according to predetermined rule from alternative characters string, and key word or crucial contamination are defined as the term for retrieving.
Predetermined rule is such as: gets rid of unacceptable word (i.e. stop words) set in advance from alternative characters string, get rid of the result after carrying out word segmentation processing and be not recorded in the character string in pre-prepd vocabulary, get rid of the probability of occurrence that adopts the transition probability computational methods based on language material the to calculate character string less than predetermined threshold, and/or according at least one in the word frequency of this character string, recognition confidence, position, font, name Entity recognition result and part of speech, character string is ranked up, therefrom select the higher character string of importance as key word.It is to be understood that the rule of predetermined selection key word is not limited to this, it is also possible to adopt Else Rule as required.
Can using the various combination of key word or key word as term.Such as " whole nation is logical ", " the logical XYZ company in the whole nation ", " the logical intelligent machine in the whole nation " etc. can give retrieval unit 103 as term and retrieve.
Hereinafter, reference Fig. 2 is described an embodiment of term acquiring unit 102.Fig. 2 illustrates the block diagram of the structure of term acquiring unit according to an embodiment of the invention.In this embodiment, term acquiring unit 102 includes filter element 201 and sequencing unit 202.
The effect of filter element 201 is to remove the noise in alternative characters string, such as " Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong " and " adjoining 5 words ", also have some stop words also can be filtered.
Specifically, filter element 201 can filter stop words based on the stop words dictionary pre-build from alternative characters string.Stop words all in this way " ", the auxiliary word of " ", or such as " ", " in " preposition etc., it is also possible to be other any character string being not intended to be used as term.
Regardless of whether filter stop words, alternative characters string can be carried out word segmentation processing by filter element 201, and searches the result after participle in vocabulary, if this result can not be found in vocabulary, then the character string belonging to this word segmentation result is filtered.
Such as, filter element 201 is to character string " Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong " carry out word segmentation processing.Such as, by " Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong " it is divided into " Buddhist nun's cinder thing ", " hands is laughed at without shallow ", " " " and " Gong Letong ".Then, pre-prepd vocabulary is searched these participles respectively.Owing to cannot find such words such as similar " Buddhist nun's cinder things " in vocabulary, thus filter element 201 filters character string " Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong ".In this example, all of word segmentation result all can not find in vocabulary, thus has filtered respective symbols string.It should be understood that in certain embodiments, when in multiple word segmentation result, only one of which can not find in vocabulary, it is also possible to filter corresponding character string.
Selectively, filter element 201 can adopt the transition probability computational methods based on language material to filter the probability of occurrence alternative characters string less than predetermined threshold.
Such as, on the basis of large-scale corpus statistics, to character string " Buddhist nun's cinder thing hands is laughed at without shallow " Gong Letong "; filter element 201 can draw the probability statistics that the right side individual character of individual character " hands " occurs; the probability of the appearance such as " machine " (mobile phone), " art " (operation) is bigger; and probability that " laughing at " occurs is very little, and " hands is laughed at without shallow " Gong Letong " probability that occurs is zero.So, filter element 201 can by " hands is laughed at without shallow " Gong Letong " etc. noise filtering fall.
The alternative characters string that filter element 201 will filter out noise information later exports sequencing unit 202.Key word can be ranked up by sequencing unit 202 according at least one in the word frequency of alternative characters string, recognition confidence, position, font, name Entity recognition result and part of speech.
Here, the main purpose of sequence is desirable to pick out important, to be more conducive to search the information relevant to picture theme key word from a large amount of alternative characters strings.The foundation that filter element 201 is ranked up can have the word frequency of each alternative characters string, recognition confidence, position, font, name Entity recognition result, part of speech etc..Such as, the importance of name entity (in the example of fig. 8 such as trade (brand) name) will be higher than common word.The information such as the word frequency of alternative characters string, recognition confidence, position, font can be directly obtained by OCR.
On the basis of sequencing unit 202 sequence, it is possible to using the various combination of key word or key word as term.All " whole nation is logical " as mentioned above, " the logical XYZ company in the whole nation ", " the logical intelligent machine in the whole nation " etc. can give retrieval unit 103 as term and retrieve.
Return to Fig. 1.Retrieval unit 103, in response to the acquisition of term, uses acquired term to carry out searching web pages.Acquired term can be sent into search engine and retrieve by retrieval unit 103.
Retrieval unit 103 carries out being likely in the result retrieved comprise the information relevant with picture theme.Possibly also owing to the inaccurate reason of term, cause that the result of retrieval does not have the information relevant with picture theme.This is accomplished by follow-up processing procedure and retrieval results web page is carried out data mining.
Webpage selects unit 104 in response to the webpage retrieved by retrieval unit 103, and the webpage retrieved is clustered.Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to above-mentioned second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification.Will be exemplified in detail below.
Fig. 3 illustrates that webpage selects the block diagram of the structure of unit 104 according to an embodiment of the invention.In this embodiment, webpage selects unit 104 to include: home page filter unit 301, cluster cell 302 and dependency judging unit 303.
Wherein, home page filter unit 301 is selectable unit.It can adopt the modes such as content extraction, information unrelated with web page contents on filtering web page that webpage is carried out.Information and link etc. on all webpages in this way of irrelevant information.Then, home page filter unit 301 will filter out the webpage of irrelevant information and is input to cluster cell 302.
The webpage of input is clustered by cluster cell 302.The effect of cluster is that retrieval result particular keywords obtained is finely divided, and is divided into some classes that content is more like.Such as, in the collections of web pages obtained with " whole nation logical " for term, it is possible to be divided into and lead to self brand with the whole nation and be illustrated as the class of main contents, it is also possible to be divided into and logical with the whole nation cooperate the class for main contents with certain cell phone manufacturer.So the result of classification contributes to further information excavating.
Cluster cell 302 can adopt currently commonly used various clustering algorithms that the webpage retrieved is clustered.Then, the webpage classification of acquisition is input to dependency judging unit 303 by cluster cell 302.
These classifications, after obtaining some webpage classifications, are carried out topic relativity judgement by dependency judging unit 303, to judge the dependency of each webpage classification and term.
This dependency reflects on the whole to have in the webpage comprised in webpage classification and how much is associated with term.Such as, dependency is more low, then the webpage being associated with term in webpage classification is more few, and dependency is more high, then the webpage being associated with term in webpage classification is more many.This dependency can be weighed by various methods.In one example, dependency judging unit 303 can adopt the method that dependency judges to judge the dependency between webpage classification and the term of the result after as cluster.It is for instance possible to use KL distance judges as the dependency between webpage classification and the term of cluster result.As shown in equation (1):
D I S ( Q , C ) = Π W i ∈ Q P ( W i | Q ) l o g P ( W i | Q ) P ( W i | C ) - - - ( 1 )
Wherein, Q represents the set of all terms, and wi represents some term therein, and C represents clustered some the webpage classification later obtained of webpage.Equation (1) illustrates the KL distance DIS (Q, C) between Q and C.The KL more big explanation of distance is more little as the dependency R between webpage classification and the term of cluster result;Otherwise, the KL more little explanation of distance is more big as the dependency R between webpage classification and the term of cluster result.
In the present embodiment, dependency judging unit 303, after the KL distance obtained between Q and C, compares with two threshold k L1 and KL2 set in advance.Wherein KL1 > KL2.
Such as, first KL distance is compared with threshold k L1.When KL distance is more than KL1, illustrate as the dependency between webpage classification and the term of cluster result only small.Thus, obtained webpage classification and the corresponding term being used for retrieving can not meet the requirement of such as accuracy.
When KL distance is less than or equal to KL1 but more than KL2 (corresponding to the dependency of the webpage classification that obtains as cluster result and term be more than or equal to the first predetermined extent but less than the situation of the second predetermined extent), illustrate as the good relationship between webpage classification and the term of cluster result.Thus, this webpage classification is input to verification unit as the first webpage classification by dependency judging unit 303, so that acquired term to be verified.
(correspond to the dependency situation be more than or equal to the second bigger predetermined extent of the webpage classification as cluster result acquisition and term) when KL distance is less than or equal to KL2, illustrate to have substantially met, as the dependency between webpage classification and the term of cluster result, the requirement that image content type of theme is identified.Therefore, this webpage classification is input to type identification unit as the second webpage classification by dependency judging unit 303, to carry out the identification of image content type of theme.
Turn again to Fig. 1 below, describe the process carried out in verification unit 105 and type identification unit 106 in detail.
Verification unit 105 is in response to the first other input of web page class, compare the first webpage classification the term obtained by term acquiring unit 102 is verified, and the term alternately character string after verification is again inputted into term acquiring unit 102, to repeat the process carried out in term acquiring unit 102, retrieval unit 103, webpage selection unit 104.
Verification unit 105 compares the first webpage classification, namely with the webpage classification of the good relationship of term, the term (specifically, constitute the key word of term) obtained by term acquiring unit 102 is verified, to help to obtain term more accurately.
In conjunction with the example of information picture in Fig. 8, owing to the propagation characteristic of the Internet and information issue the multiformity on ground, pictorial information can correspond to multiple similar, identical webpage reasons such as () reprintings.And these webpages are converged to a classification in the process to search result clustering.
When term mistake, such as, word in information is " whole nation is logical ", and the term obtained after treated is " current global mechanism ", then using " current global mechanism " to send into search engine, its result obtained also can restrain (having part webpage can be polymerized to a class with greater probability).At this moment the retrieval cluster result being accomplished by " current global mechanism " is got and the key word obtained in term acquiring unit 102 verify.If it find that only key word " current global mechanism " occurs in retrieval result, and other key word " intelligent machine ", " XYZ company " etc. occur without or occur less, can judge that key word " current global mechanism " is problematic accordingly, thus the term of its composition is problematic.
In cluster result after correct term processes, other term can also be verified by verification unit 105.Such as use in the retrieval result that key word " intelligent machine " or " XYZ company " obtain, the method using character series coupling, it has been found that the situation that the term that term acquiring unit 102 obtains occurs in cluster result.Such as we are it appeared that " full * is logical " in " current global mechanism " occurs in a large number in the result of cluster, and the most close character series is " whole nation is logical "." current global mechanism " at this moment just can be corrected as " whole nation is logical ".And correct later term, bring raising equally also can to retrieval result.
The concrete mode that verification unit 105 described above carries out verifying is illustrative of, and certainly can also adopt other verification mode, as long as term can be corrected.
It should be noted that, term acquiring unit 102, retrieval unit 103, webpage select the process such as the filtration in unit 104 and verification unit 105, sequence, retrieval, cluster, topic relativity judgements, cross check to be the iterative process of continuous repeated optimization, until finally giving a cluster result with gratifying topic relativity.In the above embodiments, this has the cluster result of gratifying topic relativity corresponding to the KL distance situation less than small threshold KL2.
When KL distance is less than (being equal to) small threshold KL2, the dependency of the webpage classification and the term that namely obtain as cluster result is more than a certain predetermined extent, namely, when degree of correlation is fully high, webpage selects unit 104 that this webpage classification is input to type identification unit 106.
Image content type of theme is identified by type identification unit 106 based on the term corresponding be more than or equal to the second webpage classification of the second bigger predetermined extent with the dependency of the webpage classification obtained as cluster result Yu term and the picture classification system that pre-builds.
Various ways can be adopted to set up picture classification system.For the information picture shown in Fig. 8 information classification body system be established as example.For example, it is possible to information to be carried out the classification such as " activity ", " product ", " company's popularization ", it is also possible to information to be carried out " commonweal information ", " non-commonweal information " such classification.
The processing mode that different classification is corresponding different, such as to " activity " category information, when carrying out topic identification, will extract the key elements such as " activity name ", " time ", " place ".And for the information of " product " class,
In type identification unit 106, it is identified according to corresponding to the other term of web page class fully high with the degree of correlation of term and the picture classification system having built up.In one embodiment, for information picture, the foundation that type identification unit 106 is identified is the distance of term set Q and certain message subject classification.In one example, it is possible to be calculated this distance DIS (Q, T) by equation below (2):
D I S ( Q , T ) = Π W i ∈ Q P ( W i | Q ) l o g P ( W i | Q ) P s ( W i | T ) - - - ( 2 )
Wherein, T is the open-ended semantic vocabulary set of certain message subject.Such as, " product " can be expanded into " commodity " ", " " brand ", " model " etc..Ps (Wi | T) is the probability that certain word belongs to certain semantic category, for instance Ps (promise is strange | product) is exactly the probability that " promise is strange " belongs to " product " classification.
It is to be noted that simply realize an example of type identification unit 106 above, it is also possible to adopt other various modes, various calculation based on specific term and the picture classification system pre-build, image content type of theme to be identified.
Image content type of theme identification device 100 according to embodiments of the present invention is described above in conjunction with Fig. 1 to Fig. 3.Below with reference to Fig. 4, another kind of information processor according to embodiments of the present invention is described.
Fig. 4 is the block diagram of the structure illustrating the information query device 400 based on image content type of theme according to embodiments of the present invention.Information query device 400 includes: character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage select unit 404, verification unit 405, type identification unit 406 and query unit 407.
Character recognition unit 401 identifies at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit 402.Term acquiring unit 402, in response to the input of alternative characters string, obtains for carrying out the term retrieved according to alternative characters string.Retrieval unit 403, in response to the acquisition of term, uses acquired term to carry out searching web pages.Webpage selects unit 404 in response to the webpage retrieved, and the webpage retrieved is clustered.Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, webpage selects unit 404 that as the first webpage classification, this webpage classification is input to verification unit 405.When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, webpage selects unit 404 that as the second webpage classification, this webpage classification is input to type identification unit 406.Verification unit 405, in response to the first other input of web page class, compares the first webpage classification and the term obtained by term acquiring unit 402 is verified, and the term alternately character string after verification is input to term acquiring unit 402.Image content type of theme is identified by type identification unit 406 based on the term corresponding with the second webpage classification and the picture classification system pre-build.
Character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage select unit 404, verification unit 405 and type identification unit 406 to select unit 104, verification unit 105 and type identification unit 106 to have identical 26S Proteasome Structure and Function with the character recognition unit 101 illustrated with reference to Fig. 1, term acquiring unit 102, retrieval unit 103, webpage, thus, omit the detailed description to character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage selection unit 404, verification unit 405 and type identification unit 406.
Query unit 407 carries out data query based on the image content type of theme identified by type identification unit 406.
In one embodiment, query unit 407 can include extracting unit (not shown).This extracting unit extracts the information relevant to the image content type of theme identified from the second webpage classification.
Still being issued as example with information, but be not limited to this application, the theme of information is to use to express as the word in the webpage classification of cluster result, phrase, continuous print character etc..These information can as follow-up key word for providing value-added service to provide information.
Here mainly with the maximally related collections of web pages of this information (cluster) in carry out Topics Crawling.Concrete method is first the content of webpage to be extracted, the result of extraction will carry out the identification of theme, mainly extract it and neutralize the information that message subject classification is relevant, such as, for " activity " category information, when carrying out topic identification, the key elements such as " activity name ", " time ", " place " to be extracted.What be substantially carried out here is name Entity recognition, extracts out out by the name entity adhering to a classification separately therein.
When carrying out topic identification, it is necessary to extract the key element such as " ProductName ", " model ".The available value-added service of different information types is also not quite similar, such as to " product " category information, it is provided that the details of product, rate of exchange information, word-of-mouth information etc..And for the information of " performance " class, it is provided that background context introduction, rate of exchange information etc..
For follow-up value-added service, it is also possible to the method adopt the word that result document is concentrated to be ranked up, extending obtains the more vocabulary relevant to information, and according to intending providing the classification of service, carries out the excavation of the degree of depth.
Fig. 5 is the flow chart illustrating image content type of theme recognition methods according to embodiments of the present invention.
In step S501, from pending picture, identify at least one character string alternative characters string as the term therefrom obtained for retrieving.This identification step can adopt OCR to carry out.
In step S502, in response to the alternative characters string identified, obtain for carrying out the term retrieved according to this alternative characters string.Specifically, in step S502, from alternative characters string, select key word according to pre-defined rule, and key word or crucial contamination are defined as the term for retrieving.
In one embodiment, in step S502, alternative characters string is filtered.Such as, from alternative characters string, stop words is filtered based on the stop words dictionary pre-build.Further, it is also possible to alternative characters string is carried out word segmentation processing, and result hereafter will be divided to make a look up in vocabulary, and filter the character string described in the participle that can not find in vocabulary.
Selectively, it is possible to adopt the transition probability computational methods based on language material to filter the probability of occurrence alternative characters string less than predetermined threshold.
It addition, after alternative characters string is filtered, it is possible to according at least one in the residue word frequency of alternative characters string, recognition confidence, position, font, name Entity recognition result and part of speech, key word is ranked up.And choose most important character string as key word.
Then, key word or their various combination is used to constitute term.
In step S503, in response to the acquisition of term, the term obtained is used to carry out searching web pages.
In step S504, in response to the webpage retrieved, the webpage retrieved is clustered.Current conventional various clustering algorithms can be used to perform this clustering processing.
In step S505, ask for the degree of correlation R of the webpage classification as the acquisition of clustering processing result and term.This degree of correlation can adopt equation as mentioned above (1) to ask for.When the KL striked by formula (1) is when big, degree of correlation R is little.Otherwise, when the KL distance hour striked by formula (1), degree of correlation R is big.
In step S506, it is judged that whether degree of correlation R is be more than or equal to the first predetermined extent R1.When being judged as NO, determining that the webpage classification obtained as clustering processing result is less with the degree of correlation R of term in step s 507, the term of this webpage classification and its correspondence is not suitable for the type of theme identifying picture.Then the other process of this web page class is terminated.System reacquires other term and carries out follow-up process.
When being judged as R >=R1, in step S508, judge that whether degree of correlation R is be more than or equal to the second predetermined extent R2.When being judged as NO, when namely degree of correlation R is less than the second predetermined extent R2, this webpage classification is chosen as the first webpage classification, and in step S509, compares this first webpage classification term is verified.Then, by the term alternately character string after verification for obtaining term further, namely step returns S502.
When being judged as YES in step S508, namely when degree of correlation R is be more than or equal to the second predetermined extent R2, this webpage classification is chosen as the second webpage classification.Image content type of theme is identified by step S510 based on the term corresponding with the second webpage classification and the picture classification system pre-build.After identifying the content topic type of picture, image content type of theme recognition methods according to embodiments of the present invention completes.
Pre-building referring to the explanation above in conjunction with Fig. 1 of picture classification system.
Fig. 6 is the flow chart illustrating the information query method based on image content type of theme according to embodiments of the present invention.Owing to the process carried out of the step S501 to S510 in step S601 to S610 and the Fig. 5 in Fig. 6 is identical, thus omit the detailed description to step S601 to S610.
In the step S611 of Fig. 6, carry out data query based on the image content type of theme identified in step S610.
In one embodiment, in step s 611 from having to term the webpage classification of good correlation and extracting the information relevant with the image content type of theme identified.The result of extraction will carry out the identification of theme.For information picture, mainly extract it and neutralize the information that message subject classification is relevant, such as, for " activity " category information, when carrying out topic identification, the key elements such as " activity name ", " time ", " place " will be extracted.What be substantially carried out here is name Entity recognition, extracts out out by the name entity adhering to a classification separately therein.
When carrying out topic identification, it is necessary to extract the key element such as " ProductName ", " model ".The available value-added service of different information types is also not quite similar, such as to " product " category information, it is provided that the details of product, rate of exchange information, word-of-mouth information etc..And for the information of " performance " class, it is provided that background context introduction, rate of exchange information etc..
For follow-up value-added service, it is also possible to the method adopt the word that result document is concentrated to be ranked up, extending obtains the more vocabulary relevant to information, and according to intending providing the classification of service, carries out the excavation of the degree of depth.
Embodiments of the invention, compared with traditional method, have the advantage that picture need not be left in data base by embodiments of the invention in advance, it is not necessary to compile information picture in advance.Its applied range, it is provided that value-added service also more flexible.
It addition, in the application scenarios kind of such as Information value-added service, such as through taking pictures, the method for uploading pictures to be to obtain the further information of this information.In providing value-added service, as third-party Information Service Institution, can be more objective, there is more motility.
Hereinafter, the example arrangement of the computer of the data handling equipment realizing the present invention is described with reference to Fig. 7.Fig. 7 is the block diagram of the example arrangement illustrating the computer realizing the present invention.
In the figure 7, CPU (CPU) 701 is according to the program stored in read only memory (ROM) 702 or the program various process of execution being loaded into random access memory (RAM) 703 from storage part 708.In RAM703, also according to needing to store the data required when CPU701 performs various process.
CPU701, ROM702 and RAM703 are connected to each other via bus 704.Input/output interface 705 is also connected to bus 704.
Components described below is connected to input/output interface 705: importation 706, including keyboard, mouse etc.;Output part 707, including display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.;Storage part 708, including hard disk etc.;And communications portion 709, including NIC such as LAN card, modem etc..Communications portion 709 performs communication process via network such as the Internet.
As required, driver 710 is also connected to input/output interface 705.Detachable media 711 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in driver 710 as required so that the computer program read out is installed in storage part 708 as required.
When being realized above-mentioned steps and process by software, the program constituting software is installed from network such as the Internet or storage medium such as detachable media 711.
It will be understood by those of skill in the art that this storage medium be not limited to shown in Fig. 7 wherein have program stored therein and method distributes the detachable media 711 of the program that provides a user with separately.The example of detachable media 711 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD)) and semiconductor memory.Or, storage medium can be the hard disk etc. comprised in ROM702, storage part 708, wherein computer program stored, and is distributed to user together with the method comprising them.
Herein above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiment in same or similar mode, combined with the feature in other embodiment, or substitute the feature in other embodiment.
It should be emphasized that term " include/comprise " refers to the existence of feature, key element, step or assembly herein when using, but it is not precluded from the existence of one or more further feature, key element, step or assembly or additional.Relate to the term " first " of ordinal number, " second " etc. are not offered as enforcement order or the importance degree of feature, key element, step or assembly that these terms limit, and be only used to describe clear for the purpose of and be arranged to and be identified between these features, key element, step or assembly.
Additionally, the method for various embodiments of the present invention be not limited to specifications described in or accompanying drawing shown in time sequencing perform, it is also possible to according to other time sequencing, concurrently or independently executable.Therefore, the technical scope of the present invention is not construed as limiting by the execution sequence of the method described in this specification.
To sum up, in an embodiment according to the present invention, the invention provides following scheme:
1. 1 kinds of information processors of remarks, including:
Character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;
Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to described alternative characters string;
Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;
Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;
Described verification unit, for in response to the first other input of web page class, compare the described first webpage classification term to being obtained by described term acquiring unit to verify, and the term alternately character string after verification is input to described term acquiring unit;And
Described type identification unit, for being identified image content type of theme based on the term corresponding with described second webpage classification and the picture classification system pre-build.
The remarks 2. information processor according to remarks 1, wherein, described term acquiring unit selects key word according to pre-defined rule from described alternative characters string, and key word or crucial contamination are defined as the term for retrieving.
The remarks 3. information processor according to remarks 2, wherein, described term acquiring unit includes filter element, and described filter element for filtering stop words based on the stop words dictionary pre-build from described alternative characters string.
The remarks 4. information processor according to remarks 2 or 3, wherein, described filter element is for carrying out word segmentation processing to described alternative characters string, and the result after participle is made a look up in vocabulary, and filters the character string belonging to the word segmentation result that can not find in vocabulary.
The remarks 5. information processor according to remarks 2 or 3, wherein, described filter element is for adopting the transition probability computational methods based on language material to filter the probability of occurrence alternative characters string less than predetermined threshold.
Remarks 6. is according to the information processor that in remarks 3 to 5, any one is described, wherein, described term acquiring unit includes sequencing unit, and described sequencing unit is for being ranked up character string according at least one in the word frequency of described alternative characters string, recognition confidence, position, font, name Entity recognition result and part of speech.
Remarks 7. is according to the information processor that in remarks 1 to 6, any one is described, and wherein, described webpage selects unit to include home page filter unit, the information that described home page filter unit is unrelated with web page contents on filtering web page.
Remarks 8. is according to the information processor that in remarks 1 to 7, any one is described, wherein, described type identification unit identifies image content type of theme based on the set of the term of input with the distance of the particular picture classification of definition in the picture classification system pre-build.
Remarks 9. is according to the information processor that in remarks 1 to 8, any one is described, and wherein, described picture is information picture, and described information processor is information content type of theme identification device.
10. 1 kinds of information processors of remarks, including:
Character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;
Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to described alternative characters string;
Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;
Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;
Described verification unit, for in response to the first other input of web page class, compare the described first webpage classification term to being obtained by described term acquiring unit to verify, and the term alternately character string after verification is input to described term acquiring unit;And
Described type identification unit, for being identified image content type of theme based on the term corresponding with described second webpage classification and the picture classification system pre-build;And
Query unit, for carrying out data query based on the image content type of theme identified.
The remarks 11. information processor according to remarks 10, wherein, described query unit includes extracting unit, and described extracting unit for extracting the information relevant to the image content type of theme identified from described second webpage classification.
The remarks 12. information processor according to remarks 10 or 11, wherein, described picture is information picture, and described information processor is based on the information query device of information content type of theme.
13. 1 kinds of information processing methods of remarks, including:
At least one character string alternately character string is identified from picture;
In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to described alternative characters string;
In response to the acquisition of term, acquired term is used to carry out searching web pages;
In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to the first other selection of web page class, compare described first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;And
Based on the term corresponding with described second webpage classification and the picture classification system pre-build, image content type of theme is identified.
14. 1 kinds of information processing methods of remarks, including:
At least one character string alternately character string is identified from picture;
In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to described alternative characters string;
In response to the acquisition of term, acquired term is used to carry out searching web pages;
In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to the first other selection of web page class, compare described first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;
Based on the term corresponding with described second webpage classification and the picture classification system pre-build, image content type of theme is identified;And
Data query is carried out based on the image content type of theme identified.

Claims (10)

1. an information processor, including:
Character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;
Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to described alternative characters string;
Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;
Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;
Described verification unit, for in response to the first other input of web page class, compare the described first webpage classification term to being obtained by described term acquiring unit to verify, and the term alternately character string after verification is input to described term acquiring unit;And
Described type identification unit, for being identified image content type of theme based on the term corresponding with described second webpage classification and the picture classification system pre-build.
2. information processor according to claim 1, wherein, described term acquiring unit includes filter element, and described filter element for filtering stop words based on the stop words dictionary pre-build from described alternative characters string.
3. information processor according to claim 2, wherein, described filter element is for carrying out word segmentation processing to described alternative characters string, and the result after participle is made a look up in vocabulary, and filters the character string belonging to the word segmentation result that can not find in vocabulary.
4. information processor according to claim 2, wherein, described filter element is for adopting the transition probability computational methods based on language material to filter the probability of occurrence alternative characters string less than predetermined threshold.
5. according to the information processor that in claim 2 to 4, any one is described, wherein, described term acquiring unit includes sequencing unit, and described sequencing unit is for being ranked up character string according at least one in the word frequency of described alternative characters string, recognition confidence, position, font, name Entity recognition result and part of speech.
6. according to the information processor that in Claims 1-4, any one is described, wherein, described type identification unit identifies image content type of theme based on the set of the term of input with the distance of the particular picture classification of definition in the picture classification system pre-build.
7. an information processor, including:
Character recognition unit, for identifying at least one character string from picture, and it can be used as alternative characters string to be input to term acquiring unit;
Term acquiring unit, for the input in response to alternative characters string, obtains for carrying out the term retrieved according to described alternative characters string;
Retrieval unit, for the acquisition in response to term, uses acquired term to carry out searching web pages;
Webpage selects unit, in response to the webpage retrieved, the webpage retrieved being clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is input to type identification unit as the second webpage classification;
Described verification unit, for in response to the first other input of web page class, compare the described first webpage classification term to being obtained by described term acquiring unit to verify, and the term alternately character string after verification is input to described term acquiring unit;And
Described type identification unit, for being identified image content type of theme based on the term corresponding with described second webpage classification and the picture classification system pre-build;And
Query unit, for carrying out data query based on the image content type of theme identified.
8. information processor according to claim 7, wherein, described query unit includes extracting unit, and described extracting unit for extracting the information relevant to the image content type of theme identified from described second webpage classification.
9. an information processing method, including:
At least one character string alternately character string is identified from picture;
In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to described alternative characters string;
In response to the acquisition of term, acquired term is used to carry out searching web pages;
In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to the first other selection of web page class, compare described first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;And
Based on the term corresponding with described second webpage classification and the picture classification system pre-build, image content type of theme is identified.
10. an information processing method, including:
At least one character string alternately character string is identified from picture;
In response to the alternative characters string obtained, obtain for carrying out the term retrieved according to described alternative characters string;
In response to the acquisition of term, acquired term is used to carry out searching web pages;
In response to the webpage retrieved, the webpage retrieved is clustered;Further, when the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification;When the dependency of the webpage classification obtained as cluster result and term is be more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to the first other selection of web page class, compare described first webpage classification and term is verified, and by the term alternately character string after verification for obtaining term further;
Based on the term corresponding with described second webpage classification and the picture classification system pre-build, image content type of theme is identified;And
Data query is carried out based on the image content type of theme identified.
CN201210112493.9A 2012-04-16 2012-04-16 Information processor and information processing method Active CN103377199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210112493.9A CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210112493.9A CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Publications (2)

Publication Number Publication Date
CN103377199A CN103377199A (en) 2013-10-30
CN103377199B true CN103377199B (en) 2016-06-29

Family

ID=49462329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210112493.9A Active CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Country Status (1)

Country Link
CN (1) CN103377199B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108234347B (en) 2017-12-29 2020-04-07 北京神州绿盟信息安全科技股份有限公司 Method, device, network equipment and storage medium for extracting feature string
CN110889028A (en) * 2018-08-15 2020-03-17 北京嘀嘀无限科技发展有限公司 Corpus processing and model training method and system
CN111726336B (en) * 2020-05-14 2021-10-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0752673A1 (en) * 1995-07-03 1997-01-08 Canon Kabushiki Kaisha Information processing method and apparatus for searching image or text information
CN101419673A (en) * 2004-04-12 2009-04-29 富士施乐株式会社 Image dictionary creating apparatus and method
CN101556584A (en) * 2008-04-10 2009-10-14 深圳市万水千山网络发展有限公司 Computer system and method for achieving picture transaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0752673A1 (en) * 1995-07-03 1997-01-08 Canon Kabushiki Kaisha Information processing method and apparatus for searching image or text information
US6310971B1 (en) * 1995-07-03 2001-10-30 Canon Kabushiki Kaisha Information processing method and apparatus, and storage medium storing medium storing program for practicing this method
CN101419673A (en) * 2004-04-12 2009-04-29 富士施乐株式会社 Image dictionary creating apparatus and method
CN101556584A (en) * 2008-04-10 2009-10-14 深圳市万水千山网络发展有限公司 Computer system and method for achieving picture transaction

Also Published As

Publication number Publication date
CN103377199A (en) 2013-10-30

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
US20220261427A1 (en) Methods and system for semantic search in large databases
US9372920B2 (en) Identifying textual terms in response to a visual query
US7386438B1 (en) Identifying language attributes through probabilistic analysis
CN101467145B (en) Method and apparatus for automatically annotating images
US7739258B1 (en) Facilitating searches through content which is accessible through web-based forms
US8577882B2 (en) Method and system for searching multilingual documents
CN102779140B (en) A kind of keyword acquisition methods and device
US8290269B2 (en) Image document processing device, image document processing method, program, and storage medium
Khusro et al. On methods and tools of table detection, extraction and annotation in PDF documents
CN109145110B (en) Label query method and device
CN106708940B (en) Method and device for processing pictures
US20080201131A1 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
CN102402604A (en) Effective Forward Ordering Of Search Engine
US10152540B2 (en) Linking thumbnail of image to web page
US20220382975A1 (en) Self-supervised document representation learning
CN103377199B (en) Information processor and information processing method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN112241463A (en) Search method based on fusion of text semantics and picture information
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
CN114706948A (en) News processing method and device, storage medium and electronic equipment
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN111752922A (en) Method and device for establishing knowledge database and realizing knowledge query
Vadivukarassi et al. A framework of keyword based image retrieval using proposed Hog_Sift feature extraction method from Twitter Dataset
CN111241313A (en) Retrieval method and device supporting image input

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant