CN103377199A - Information processing device and information processing method - Google Patents

Information processing device and information processing method Download PDF

Info

Publication number
CN103377199A
CN103377199A CN2012101124939A CN201210112493A CN103377199A CN 103377199 A CN103377199 A CN 103377199A CN 2012101124939 A CN2012101124939 A CN 2012101124939A CN 201210112493 A CN201210112493 A CN 201210112493A CN 103377199 A CN103377199 A CN 103377199A
Authority
CN
China
Prior art keywords
term
webpage
classification
unit
webpage classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101124939A
Other languages
Chinese (zh)
Other versions
CN103377199B (en
Inventor
夏迎炬
杨宇航
葛付江
孙健
潘屹峰
陈思源
何源
孙俊
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210112493.9A priority Critical patent/CN103377199B/en
Publication of CN103377199A publication Critical patent/CN103377199A/en
Application granted granted Critical
Publication of CN103377199B publication Critical patent/CN103377199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a device and a method for processing information. The method includes: identifying character strings in an image to serve as alternative; responding to the alternative character strings acquisition, and acquiring a search word according to the alternative character strings; responding to the search word acquisition, using the search work to search webpages; responding to the searched webpages, and clustering the searched webpages; when the relevance of the webpage category, serving as the clustering results, and the search word is not smaller than a first preset degree but smaller than a second preset degree, using the webpage category as the first webpage category; when the relevance of the webpage category and the search word is larger than the second preset degree, using the webpage category as the second webpage category; responding to selection of the first webpage category, verifying the search word by comparing with the first webpage category, using the verified search word as alternative character strings to further acquire a search word; identifying image content topic categories on the basis of the search word corresponding to the second webpage category and a pre-built image categorizing system.

Description

Signal conditioning package and information processing method
Technical field
The present invention relates to field of information processing, relate in particular to a kind of signal conditioning package and information processing method for identifying the image content type of theme and carrying out information inquiry based on type of theme.
Background technology
Non-online information release carrier (such as papery, lamp box, signboard) often can not provide detailed information as space is limited.If the user wants to understand more information, such as: the relevant informations of activity detailed rules and regulations, product detail information, company etc. often need further search.In addition, for user's comparison Related product (technical indicator, price etc.), check the demands such as public praise information, then need search repeatedly.How locating these information in the internet information of magnanimity is relatively more difficult to domestic consumer.
In the present method, have specific picture is put in the database, when user's uploading pictures, by the method for images match, with the most similar content retrieval out, and the details of this content are presented to the user., simultaneously corresponding picture is kept in the database when the non-online information of issue such as, information publisher, when the user sees non-online information, and when interested in it, can be by taking pictures and with the server end of picture uploading to the information publisher.The information publisher uses the method for images match that the advertising message of mating most in the database is returned to the user when obtaining retrieval request.The method that also has is to add the methods such as bar code or two-dimension code in advertisement, and the user only needs that bar code or two-dimension code image are uploaded to server and gets final product.Server is when carrying out picture coupling, because the characteristics such as easy to identify of bar code and two-dimension code, can greatly improve the precision of picture coupling.The defective such as the camera installation resolution that can partly remedy the user not high (intelligent terminal is such as mobile phone), light are bad, reflective.
Summary of the invention
The essence of said system is the method by images match, the content that in the information picture database, finds the information picture uploaded with the user to mate most, and various forms has offered the user.
These existing methodical subject matters are: what provide in this way issues value-added service to information, can only be for partial information.Then can't provide service for the information that does not appear in the information picture database; In addition and since not the unified information of depositing picture with database relevant information or website, cause the user not know whom the information picture issued.These problems have limited existing value-added service to informative advertising.
For such problem, a kind of signal conditioning package and method for the theme identification of picture that need not to set up picture database proposed.This apparatus and method are not limited to be applied to the information increment service scenarios.
According to one embodiment of present invention, provide a kind of signal conditioning package, comprising: character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string; The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to the alternative characters string; Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term; The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification; Verification unit is used for contrasting the first webpage classification the term that is obtained by the term acquiring unit being carried out verification, and the term after the verification is input to the term acquiring unit as the alternative characters string in response to other input of the first web page class; And the type identification unit, be used for based on the picture classification system of the term corresponding with the second webpage classification and in advance foundation the image content type of theme being identified.
According to another embodiment of the invention, provide a kind of signal conditioning package, comprising: character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string; The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to the alternative characters string; Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term; The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification; Verification unit is used for contrasting the first webpage classification the term that is obtained by the term acquiring unit being carried out verification, and the term after the verification is input to the term acquiring unit as the alternative characters string in response to other input of the first web page class; And the type identification unit, be used for based on the picture classification system of the term corresponding with the second webpage classification and in advance foundation the image content type of theme being identified; And query unit, be used for carrying out data query based on the image content type of theme that identifies.
According to another embodiment of the invention, provide a kind of information processing method, comprising: at least one character string of identification is as the alternative characters string from picture; In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to the alternative characters string; In response to obtaining of term, use the term that obtains to come searching web pages; In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification; In response to other selection of the first web page class, contrast the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term; And based on the term corresponding with the second webpage classification and the picture classification system of setting up in advance the image content type of theme is identified.
According to another embodiment of the invention, provide a kind of information processing method, comprising: at least one character string of identification is as the alternative characters string from picture; In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to the alternative characters string; In response to obtaining of term, use the term that obtains to come searching web pages; In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification; In response to other selection of the first web page class, contrast the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term; Picture classification system based on the term corresponding with the second webpage classification and in advance foundation is identified the image content type of theme; And carry out data query based on the image content type of theme that identifies.
Description of drawings
With reference to below in conjunction with the explanation of accompanying drawing to the embodiment of the invention, can understand more easily above and other purpose of the present invention, characteristics and advantage.In the accompanying drawings, technical characterictic or parts identical or correspondence will adopt identical or corresponding Reference numeral to represent.Needn't go out according to scale in the accompanying drawings size and the relative position of unit.
Fig. 1 is the block diagram that illustrates according to the structure of the image content type of theme recognition device of the embodiment of the invention.
Fig. 2 is the block diagram that illustrates according to the structure of the term acquiring unit of the embodiment of the invention.
Fig. 3 is the block diagram that illustrates according to the structure of the webpage selected cell of the embodiment of the invention.
Fig. 4 is the block diagram based on the structure of the information query device of image content type of theme that illustrates according to the embodiment of the invention.
Fig. 5 is the process flow diagram that illustrates according to the image content type of theme recognition methods of the embodiment of the invention.
Fig. 6 is the process flow diagram based on the information query method of image content type of theme that illustrates according to the embodiment of the invention.
Fig. 7 is the block diagram that the example arrangement that realizes computing machine of the present invention is shown.
Fig. 8 illustrates the example that the user uses the information picture that the photographic means that disposes on the portable equipment for example takes.
Embodiment
Embodiments of the invention are described with reference to the accompanying drawings.Should be noted that for purpose clearly, omitted expression and the description of parts that have nothing to do with the present invention, well known by persons skilled in the art and processing in accompanying drawing and the explanation.
Fig. 1 is the block diagram that illustrates according to the structure of the image content type of theme recognition device 100 of the embodiment of the invention.Image content type of theme recognition device 100 comprises: character recognition unit 101, term acquiring unit 102, retrieval unit 103, webpage selected cell 104, verification unit 105 and type identification unit 106.
Character recognition unit 101 is identified at least one character string from the picture that is input to image content type of theme recognition device 100, and it is input in the term acquiring unit 102 as the alternative characters string.
Can pass through various optical instruments, such as image scanner, facsimile recorder or any photographic goods picture be inputted image content type of theme recognition device 100.Photographic goods can comprise the camera that disposes on camera or the portable equipment such as mobile phone.Fig. 8 illustrates the example that the user uses the information picture that the photographic means that disposes on the portable equipment for example takes.For convenience of description, hereinafter just use this picture example that various embodiments of the present invention are described.But it should be understood that the present invention can be applied to the various application that need to identify the image content type of theme, and be not limited to the identification to the type of theme of information image content.
Character recognition unit 101 can adopt current widely used various optical character identification (OCR) technology to identify character in the picture.In one embodiment, character recognition unit 101 at first carries out text location, identifies the character area of picture.Then, picture character is identified.Take the information picture shown in Fig. 8 as example, character recognition unit 101 can identify for example following character string:
Logical w1F1 mobile phone Lee in the whole nation
The logical w1F1 intelligent machine set meal in the whole nation
Unite huge offering
XZY company
Two f are y a.1.h
Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong
Adjoin 5 words
It should be noted that owing to deformed letters such as picture quality, characters in a fancy style, the recognition result of character recognition unit 101 possibly can't provide gratifying keyword to be configured for the term of searching web pages.So character recognition unit 101 is input to the character string that identifies (can be all or the partial character string that identifies) in the term acquiring unit 102 as the alternative characters string.
Term acquiring unit 102 obtains be used to the term of retrieving according to the alternative characters string of inputting in response to the input of alternative characters string.
Particularly, term acquiring unit 102 is selected keyword according to predetermined rule from the alternative characters string, and keyword or crucial contamination are defined as for the term of retrieving.
Predetermined rule for example is: get rid of predefined unacceptable word (being stop words) from the alternative characters string, get rid of the result who carries out after the word segmentation processing and be not recorded in character string in the pre-prepd vocabulary, get rid of and adopt probability of occurrence that the transition probability computing method based on language material calculate less than the character string of predetermined threshold, and/or according to one of at least character string is sorted in word frequency, recognition confidence, position, font, named entity recognition result and the part of speech of this character string, therefrom select the higher character string of importance as keyword.Be understandable that: the rule of predetermined selection keyword is not limited to this, can also adopt as required Else Rule.
Can be with the various combination of keyword or keyword as term.Such as " whole nation logical ", " the logical XYZ company in the whole nation ", " the logical intelligent machine in the whole nation " etc. can be used as term and gives retrieval unit 103 and retrieve.
An embodiment of term acquiring unit 102 hereinafter, is described with reference to Fig. 2.Fig. 2 illustrates the according to an embodiment of the invention block diagram of the structure of term acquiring unit.In this embodiment, term acquiring unit 102 comprises filter element 201 and sequencing unit 202.
The effect of filter element 201 is the noises that remove in the alternative characters string, such as " Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong " and " adjoining 5 words ", also have some stop words also can be filtered.
Specifically, filter element 201 can be based on stop words dictionary filtering stop words from the alternative characters string of setting up in advance.The auxiliary word of stop words all in this way " ", " ", or such as " ", " in " preposition etc., also can be other any character string of not planning as term.
Filtering stop words no matter whether, filter element 201 can carry out word segmentation processing to the alternative characters string, and searches the result behind the participle in vocabulary, if can not find this result in vocabulary, then with the character string filtering under this word segmentation result.
For example, 201 pairs of character strings of filter element " Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong " carry out word segmentation processing.For example, with " Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong " be divided into " Buddhist nun's cinder thing ", " hand is laughed at without shallow ", " " " and " Gong Letong ".Then, in pre-prepd vocabulary, search respectively these several participles.Owing to can't in vocabulary, find such words such as similar " Buddhist nun's cinder things ", thereby filter element 201 filtering character strings " Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong ".In this example, all word segmentation result all can not find in vocabulary, thus filtering the respective symbols string.But be appreciated that in certain embodiments, when only having one can not in vocabulary, find in a plurality of word segmentation result, also can the corresponding character string of filtering.
Selectively, filter element 201 can adopt based on the transition probability computing method filtering probability of occurrence of the language material alternative characters string less than predetermined threshold.
For example, on the basis of large-scale corpus statistics, to character string " Buddhist nun's cinder thing hand is laughed at without shallow " Gong Letong "; filter element 201 can draw the probability statistics that the right side individual character of individual character " hand " occurs; the probability that " machine " (mobile phone), " art " (operation) etc. occur is larger; and the probability that " laughing at " occurs is very little, and " hand is laughed at without shallow " Gong Letong " probability of appearance is zero.Like this, filter element 201 can be with " hand is laughed at without shallow " Gong Letong " etc. noise filtering fall.
Filter element 201 will filter out the later alternative characters string of noise information and output to sequencing unit 202.Sequencing unit 202 can be according to sorting to keyword one of at least in word frequency, recognition confidence, position, font, named entity recognition result and the part of speech of alternative characters string.
Here, the fundamental purpose of ordering be wish from a large amount of alternative characters strings, to pick out important, more help to search keyword with the information of picture Topic relative.The foundation that filter element 201 sorts can have word frequency, recognition confidence, position, font, named entity recognition result, part of speech of each alternative characters string etc.For example, the importance of named entity (for example trade (brand) name in the example of Fig. 8) will be higher than common word.The information such as the word frequency of alternative characters string, recognition confidence, position, font can directly obtain by optical character recognition.
On the basis of sequencing unit 202 ordering, can be with the various combination of keyword or keyword as term.All " whole nation logical " as mentioned above, " the logical XYZ company in the whole nation ", " the logical intelligent machine in the whole nation " etc. can be used as term and give retrieval unit 103 and retrieve.
Get back to Fig. 1.Retrieval unit 103 obtains in response to term, uses the term that obtains to come searching web pages.Retrieval unit 103 can be sent the term that obtains into search engine and retrieve.
May comprise the information with the picture Topic relative among the result that retrieval unit 103 is retrieved.Also may be owing to the inaccurate reason of term, the result who causes retrieving does not have and the information of picture Topic relative.This just needs follow-up processing procedure to come the result for retrieval webpage is carried out data mining.
Webpage selected cell 104 carries out cluster in response to the webpage that is retrieved by retrieval unit 103 to the webpage that retrieves.And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to above-mentioned the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification.The below will describe in detail for example.
Fig. 3 illustrates the according to an embodiment of the invention block diagram of the structure of webpage selected cell 104.In this embodiment, webpage selected cell 104 comprises: home page filter unit 301, cluster cell 302 and correlativity judging unit 303.
Wherein, home page filter unit 301 is selectable unit.It can adopt modes such as webpage being carried out content extraction, on the filtering web page with the irrelevant information of web page contents.Information on all in this way webpages of irrelevant information and link etc.Then, home page filter unit 301 webpage that will filter out irrelevant information is input to cluster cell 302.
The webpage of 302 pairs of inputs of cluster cell carries out cluster.The effect of cluster is that the result for retrieval that particular keywords obtains is segmented, and is divided into the more similar some classes of content.For example, in the collections of web pages that obtains for term take " whole nation logical ", can be divided into the class that is illustrated as main contents with logical self brand in the whole nation, also can be divided into leading to take the whole nation with certain cell phone manufacturer and cooperate class as main contents.The result of classification helps further information excavating like this.
Cluster cell 302 can adopt the current various clustering algorithms that generally adopt that the webpage that retrieves is carried out cluster.Then, cluster cell 302 is input to correlativity judging unit 303 with the webpage classification that obtains.
Correlativity judging unit 303 carries out topic relativity to these classifications and judges, to judge the correlativity of each webpage classification and term after obtaining some webpage classifications.
This correlativity has reflected on the whole has what to be associated with term in the webpage that comprises in the webpage classification.For example, correlativity is lower, and the webpage that then is associated with term in the webpage classification is just fewer, and correlativity is higher, and the webpage that then is associated with term in the webpage classification is just more.Can weigh this correlativity by the whole bag of tricks.In an example, the method that correlativity judging unit 303 can adopt correlativity to judge is judged as the webpage classification of the result after the cluster and the correlativity between the term.For example, can adopt the KL distance to judge as the webpage classification of cluster result and the correlativity between the term.Shown in equation (1):
DIS ( Q , C ) = Π Wi ∈ Q P ( Wi | Q ) log P ( Wi | Q ) P ( Wi | C ) - - - ( 1 )
Wherein, Q represents the set of all terms, and wi represents some terms wherein, and C represents some webpage classifications that webpage obtains after cluster.Equation (1) illustrates the KL distance B IS (Q, C) between Q and the C.The larger explanation of KL distance is less as webpage classification and the correlativity R between the term of cluster result; Otherwise more bright webpage classification and the correlativity R between the term as cluster result of novel is larger for the KL distance.
In the present embodiment, correlativity judging unit 303 compares with predefined two threshold k L1 and KL2 after the KL distance that obtains between Q and the C.KL1>KL2 wherein.
For example, at first the KL distance is compared with threshold k L1.When KL distance during greater than KL1, illustrate as webpage classification and the correlativity between the term of cluster result very little.Thereby resulting webpage classification and the corresponding term that is used for retrieval can not satisfy the requirement such as accuracy.
When KL distance during less than or equal to KL1 but greater than KL2 (corresponding to the correlativity of the webpage classification that obtains as cluster result and term more than or equal to the first predetermined extent but less than the situation of the second predetermined extent), illustrate as the webpage classification of cluster result and the good relationship between the term.Thereby correlativity judging unit 303 is input to verification unit with this webpage classification as the first webpage classification, so that the term that is obtained is carried out verification.
When KL distance during less than or equal to KL2 (corresponding to the correlativity of the webpage classification that obtains as cluster result and the term situation more than or equal to the second larger predetermined extent), webpage classification and the substantially satisfied requirement that the image content type of theme is identified of the correlativity between the term as cluster result are described.Therefore, correlativity judging unit 303 is input to the type identification unit with this webpage classification as the second webpage classification, to carry out the identification of image content type of theme.
The below gets back to Fig. 1 again, is described in detail in the processing of carrying out in verification unit 105 and the type identification unit 106.
Verification unit 105 is in response to other input of the first web page class, contrast the first webpage classification the term that is obtained by term acquiring unit 102 is carried out verification, and the term after the verification is input to term acquiring unit 102 again as the alternative characters string, with the processing that repeats in term acquiring unit 102, retrieval unit 103, webpage selected cell 104, to carry out.
Verification unit 105 contrasts the first webpage classification, namely with the webpage classification of the good relationship of term, the term (specifically, consisting of the keyword of term) that is obtained by term acquiring unit 102 is carried out verification, to help to obtain more accurately term.
In conjunction with the example of information picture among Fig. 8, because the diversity on the propagation characteristic of internet and information issue ground, pictorial information can be corresponding to a plurality of similar, identical webpages reasons such as () reprintings.And these webpages are converged to a classification in the process to search result clustering.
In the situation of term mistake, for example the word in the information is " whole nation is logical ", and the term that obtains after treated is " current global mechanism ", then uses " current global mechanism " to send into search engine, and its result who obtains also can restrain (having the part webpage to gather into a class with greater probability).The retrieval cluster result that at this moment just need to get " current global mechanism " and the keyword that obtains in term acquiring unit 102 carry out verification.Only have keyword " current global mechanism " to appear in the result for retrieval if find, and other keyword " intelligent machine ", " XYZ company " etc. do not occur or occur less, can judge accordingly that keyword " current global mechanism " has problem, thereby the term of its formation also there is problem.
In the cluster result after correct term is processed, verification unit 105 can also be carried out verification to other term.For example use in the result for retrieval that keyword " intelligent machine " or " XYZ company " obtain, use the method for character series coupling, find the situation that term that term acquiring unit 102 obtains occurs in cluster result.Can find that such as us " full * is logical " in " current global mechanism " occurs in a large number in the result of cluster, and the most close character series is " whole nation is logical ".At this moment just " current global mechanism " can be corrected as " whole nation is logical ".And correct later term, equally also can bring raising to result for retrieval.
Above the said verification unit 105 concrete mode of carrying out verification be exemplary, certainly can also adopt other verification mode, as long as can correct term.
It should be noted that, the processing such as the filtration in term acquiring unit 102, retrieval unit 103, webpage selected cell 104 and the verification unit 105, ordering, retrieval, cluster, topic relativity judgement, cross check are the iterative process of continuous repeated optimization, until finally obtain a cluster result with gratifying topic relativity.In the above embodiments, this cluster result with gratifying topic relativity is corresponding to the situation of KL distance less than less threshold k L2.
When KL distance less than (equaling) less threshold k L2, the webpage classification that namely obtains as cluster result and the correlativity of term are greater than a certain predetermined extent, be that webpage selected cell 104 is input to type identification unit 106 with this webpage classification in the fully high situation of degree of correlation.
Type identification unit 106 is based on the image content type of theme being identified more than or equal to the corresponding term of the second webpage classification of the second larger predetermined extent and the picture classification system of in advance foundation with the correlativity of the webpage classification that obtains as cluster result and term.
Can adopt various ways to set up the picture classification system.The example that is established as with the information classification body system that is used for information picture shown in Figure 8.For example, can carry out the classification such as " activity ", " product ", " company's popularization " to information, also can carry out " commonweal information ", " non-commonweal information " such classification to information.
The processing mode that different classification is corresponding different such as to " activity " category information, when carrying out theme identification, extract the key elements such as " activity name ", " time ", " place ".And for the information of " product " class,
In type identification unit 106, according to corresponding to identifying with fully high other term of web page class of the degree of correlation of term and the picture classification system of having set up.In one embodiment, take the information picture as example, the foundation that type identification unit 106 is identified is the distance of term set Q and certain message subject classification.In an example, can calculate this distance B IS (Q, T) with following equation (2):
DIS ( Q , T ) = Π Wi ∈ Q P ( Wi | Q ) log P ( Wi | Q ) Ps ( Wi | T ) - - - ( 2 )
Wherein, T is the open-ended semantic lexical set of certain message subject.Such as, " product " can be expanded into " commodity " ", " " brand ", " model " etc.Ps (Wi|T) is the probability that certain word belongs to certain semantic classes, and for example Ps (promise is strange | product) is exactly the probability that " promise is strange " belongs to " product " classification.
It should be noted that: a top example just realizing type identification unit 106, can also adopt other variety of way, various account form based on the picture classification system of specific term and in advance foundation the image content type of theme to identify.
The above has illustrated image content type of theme recognition device 100 according to the embodiment of the invention in conjunction with Fig. 1 to Fig. 3.Below with reference to the another kind of signal conditioning package of Fig. 4 explanation according to the embodiment of the invention.
Fig. 4 is the block diagram based on the structure of the information query device 400 of image content type of theme that illustrates according to the embodiment of the invention.Information query device 400 comprises: character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage selected cell 404, verification unit 405, type identification unit 406, and query unit 407.
Character recognition unit 401 is identified at least one character string from picture, and it is input to term acquiring unit 402 as the alternative characters string.Term acquiring unit 402 obtains be used to the term of retrieving according to the alternative characters string in response to the input of alternative characters string.Retrieval unit 403 obtains in response to term, uses the term that obtains to come searching web pages.Webpage selected cell 404 carries out cluster in response to the webpage that retrieves to the webpage that retrieves.And when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, webpage selected cell 404 is input to verification unit 405 with this webpage classification as the first webpage classification.When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, webpage selected cell 404 is input to type identification unit 406 with this webpage classification as the second webpage classification.Verification unit 405 contrasts the first webpage classification the term that is obtained by term acquiring unit 402 is carried out verification, and the term after the verification is input to term acquiring unit 402 as the alternative characters string in response to other input of the first web page class.Type identification unit 406 is identified the image content type of theme based on the picture classification system of the term corresponding with the second webpage classification and in advance foundation.
Character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage selected cell 404, verification unit 405 and type identification unit 406 have identical 26S Proteasome Structure and Function with the character recognition unit 101, term acquiring unit 102, retrieval unit 103, webpage selected cell 104, verification unit 105 and the type identification unit 106 that illustrate with reference to figure 1, thereby, omit the detailed description to character recognition unit 401, term acquiring unit 402, retrieval unit 403, webpage selected cell 404, verification unit 405 and type identification unit 406.
Query unit 407 is carried out data query based on the image content type of theme that is identified by type identification unit 406.
In one embodiment, query unit 407 can comprise the extracting unit (not shown).This extracting unit extracts the information relevant with the image content type of theme of identifying from the second webpage classification.
Still be issued as example with information, use but be not limited to this, the theme of information is to use as the word in the webpage classification of cluster result, phrase, continuous character etc. to express.These information can be used as follow-up keyword information are provided for providing value added service.
Here mainly be with the maximally related collections of web pages of this information (cluster) in carry out Topics Crawling.Concrete method is first the content of webpage to be extracted, the identification of theme will be carried out among the result who extract, mainly be to extract information relevant with the message subject classification in it, such as, for " activity " category information, when carrying out theme identification, extract the key elements such as " activity name ", " time ", " place ".What mainly carry out here is named entity recognition, and the named entity that adheres to a classification separately is wherein extracted out out.
When carrying out theme identification, need to extract the key elements such as " ProductName ", " model ".The available value-added service of different information types also is not quite similar, and such as to " product " category information, can provide details, rate of exchange information, public praise information of product etc.And for the information of " performance " class, can provide background context introduction, rate of exchange information etc.
For follow-up value-added service, can also adopt the method that the concentrated word of result document is sorted, expands to obtain the more vocabulary relevant with information, and according to the classification of intending providing service, carry out the excavation of the degree of depth.
Fig. 5 is the process flow diagram that illustrates according to the image content type of theme recognition methods of the embodiment of the invention.
In step S501, at least one character string of identification is as the alternative characters string that will therefrom obtain for the term of retrieval from pending picture.This identification step can adopt optical character recognition to carry out.
In step S502, in response to the alternative characters string that identifies, obtain be used to the term of retrieving according to this alternative characters string.Specifically, in step S502, from the alternative characters string, select keyword according to pre-defined rule, and keyword or crucial contamination are defined as for the term of retrieving.
In one embodiment, in step S502, the alternative characters string is filtered.For example, based on stop words dictionary filtering stop words from the alternative characters string of setting up in advance.Further, can also carry out word segmentation processing to the alternative characters string, and will divide result after this in vocabulary, to search, and the filtering described character string of participle that can not in vocabulary, find.
Selectively, can adopt based on the transition probability computing method filtering probability of occurrence of the language material alternative characters string less than predetermined threshold.
In addition, after the alternative characters string is filtered, can be according to one of at least keyword be sorted in word frequency, recognition confidence, position, font, named entity recognition result and the part of speech of residue alternative characters string.And choose most important character string as keyword.
Then, use keyword or their various combination to consist of term.
In step S503, in response to obtaining of term, come searching web pages with the term that obtains.
In step S504, in response to the webpage that retrieves, the webpage that retrieves is carried out cluster.Can use current various clustering algorithms commonly used to carry out this clustering processing.
In step S505, ask for as the webpage classification of clustering processing result acquisition and the degree of correlation R of term.This degree of correlation can adopt equation as mentioned above (1) to ask for.When the KL distance of asking for when formula (1) was large, degree of correlation R was little.Otherwise, the KL distance of asking for when formula (1) hour, degree of correlation R is large.
In step S506, judge that whether degree of correlation R is more than or equal to the first predetermined extent R1.When being judged as when no, determine that in step S507 the webpage classification that obtains as the clustering processing result is less with the degree of correlation R of term, this webpage classification and its corresponding term are not suitable for the type of theme of identifying picture.Then other processing finishes to this web page class.System again obtains other term and carries out follow-up processing.
When being judged as R>R1, judge that in step S508 whether degree of correlation R is more than or equal to the second predetermined extent R2.When being judged as when no, when namely degree of correlation R is less than the second predetermined extent R2, this webpage classification is chosen as the first webpage classification, and this first webpage classification of contrast is carried out verification to term in step S509.Then, with the term after the verification as the alternative characters string to be used for further obtaining term, namely step is returned S502.
When in step S508, being judged as when being, namely as degree of correlation R during more than or equal to the second predetermined extent R2, this webpage classification is chosen as the second webpage classification.In step S510, based on the picture classification system of the term corresponding with the second webpage classification and in advance foundation the image content type of theme is identified.After identifying the content topic type of picture, finish according to the image content type of theme recognition methods of the embodiment of the invention.
Setting up in advance referring to top explanation in conjunction with Fig. 1 of picture classification system.
Fig. 6 is the process flow diagram based on the information query method of image content type of theme that illustrates according to the embodiment of the invention.Because the step S601 to S610 among Fig. 6 is identical with the processing that the step S501 to S510 among Fig. 5 carries out, thereby omit the detailed description to step S601 to S610.
In the step S611 of Fig. 6, carry out data query based on the image content type of theme of in step S610, identifying.
In one embodiment, in step S611, extract the information relevant with the image content type of theme of identifying the webpage classification of good correlation from having with term.The identification of theme will be carried out among the result who extract.Take the information picture as example, mainly be to extract information relevant with the message subject classification in it, such as, for " activity " category information, when carrying out theme identification, extract the key elements such as " activity name ", " time ", " place ".What mainly carry out here is named entity recognition, and the named entity that adheres to a classification separately is wherein extracted out out.
When carrying out theme identification, need to extract the key elements such as " ProductName ", " model ".The available value-added service of different information types also is not quite similar, and such as to " product " category information, can provide details, rate of exchange information, public praise information of product etc.And for the information of " performance " class, can provide background context introduction, rate of exchange information etc.
For follow-up value-added service, can also adopt the method that the concentrated word of result document is sorted, expands to obtain the more vocabulary relevant with information, and according to the classification of intending providing service, carry out the excavation of the degree of depth.
Embodiments of the invention are compared with traditional method, and such advantage is arranged: embodiments of the invention do not need picture is left in the database in advance, do not need to compile in advance the information picture.Its applied range, the value-added service that provides is also more flexible.
In addition, in the application scenarios kind such as information increment service, can by take pictures, the method for uploading pictures obtains the further information of this information.Aspect providing value added service, as third-party Information Service Institution, can be more objective, more dirigibility is arranged.
The example arrangement of the computing machine of realizing data processing equipment of the present invention hereinafter, is described with reference to figure 7.Fig. 7 is the block diagram that the example arrangement that realizes computing machine of the present invention is shown.
In Fig. 7, CPU (central processing unit) (CPU) 701 carries out various processing according to the program of storage in the ROM (read-only memory) (ROM) 702 or from the program that storage area 708 is loaded into random access memory (RAM) 703.In RAM703, also store as required data required when CPU701 carries out various the processing.
CPU701, ROM702 and RAM703 are connected to each other via bus 704.Input/output interface 705 also is connected to bus 704.
Following parts are connected to input/output interface 705: importation 706 comprises keyboard, mouse etc.; Output 707 comprises display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.; Storage area 708 comprises hard disk etc.; And communications portion 709, comprise such as LAN card, modulator-demodular unit etc. of network interface unit.Communications portion 709 is processed via network such as the Internet executive communication.
As required, driver 710 also is connected to input/output interface 705.Detachable media 711 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed on the driver 710 as required, so that the computer program of therefrom reading is installed in the storage area 708 as required.
Realizing by software in the situation of above-mentioned steps and processing, such as the Internet or storage medium such as detachable media 711 program that consists of software is being installed from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 7 wherein has program stored therein, distributes separately to provide the detachable media 711 of program to the user with method.The example of detachable media 711 comprises disk, CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM702, the storage area 708 etc., computer program stored wherein, and be distributed to the user with the method that comprises them.
In the above in the description to the specific embodiment of the invention, can in one or more other embodiment, use in identical or similar mode for the feature that a kind of embodiment is described and/or illustrated, combined with the feature in other embodiment, or the feature in alternative other embodiment.
Should emphasize that term " comprises/comprise " existence that refers to feature, key element, step or assembly when this paper uses, but not get rid of the existence of one or more further feature, key element, step or assembly or additional.The term " first " that relates to ordinal number, " second " etc. do not represent enforcement order or the importance degree of feature, key element, step or assembly that these terms limit, and only is for for the purpose of being described clearly and be arranged between these features, key element, step or assembly and identify.
In addition, describe during the method for various embodiments of the present invention is not limited to specifications or accompanying drawing shown in time sequencing carry out, also can be according to other time sequencing, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions is not construed as limiting technical scope of the present invention.
To sum up, in an embodiment according to the present invention, the invention provides following scheme:
1. 1 kinds of signal conditioning packages of remarks comprise:
Character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string;
The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to described alternative characters string;
Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term;
The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification;
Described verification unit, be used in response to other input of the first web page class, contrast described the first webpage classification the term that is obtained by described term acquiring unit is carried out verification, and the term after the verification is input to described term acquiring unit as the alternative characters string; And
Described type identification unit is used for based on the picture classification system of the term corresponding with described the second webpage classification and in advance foundation the image content type of theme being identified.
Remarks 2. is according to remarks 1 described signal conditioning package, and wherein, described term acquiring unit is selected keyword according to pre-defined rule from described alternative characters string, and keyword or crucial contamination are defined as for the term of retrieving.
Remarks 3. is according to remarks 2 described signal conditioning packages, and wherein, described term acquiring unit comprises filter element, and described filter element is used for based on the stop words dictionary of setting up in advance from described alternative characters string filtering stop words.
Remarks 4. is according to remarks 2 or 3 described signal conditioning packages, wherein, described filter element is used for described alternative characters string is carried out word segmentation processing, and the result behind the participle is searched in vocabulary, and the character string under the filtering word segmentation result that can not find in vocabulary.
Remarks 5. is according to remarks 2 or 3 described signal conditioning packages, and wherein, described filter element is used for adopting based on the transition probability computing method filtering probability of occurrence of the language material alternative characters string less than predetermined threshold.
Remarks 6. is according to any described signal conditioning package in the remarks 3 to 5, wherein, described term acquiring unit comprises sequencing unit, and described sequencing unit is used for one of at least character string is sorted according to word frequency, recognition confidence, position, font, named entity recognition result and the part of speech of described alternative characters string.
Remarks 7. is according to any described signal conditioning package in the remarks 1 to 6, and wherein, described webpage selected cell comprises the home page filter unit, and described home page filter unit is for the information that has nothing to do with web page contents on the filtering web page.
Remarks 8. is according to any described signal conditioning package in the remarks 1 to 7, the distance of the particular picture classification that wherein, defines in the picture classification system of described type identification unit based on the set of the term of input and in advance foundation is identified the image content type of theme.
Remarks 9. is according to any described signal conditioning package in the remarks 1 to 8, and wherein, described picture is the information picture, and described signal conditioning package is information content type of theme recognition device.
10. 1 kinds of signal conditioning packages of remarks comprise:
Character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string;
The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to described alternative characters string;
Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term;
The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification;
Described verification unit, be used in response to other input of the first web page class, contrast described the first webpage classification the term that is obtained by described term acquiring unit is carried out verification, and the term after the verification is input to described term acquiring unit as the alternative characters string; And
Described type identification unit is used for based on the picture classification system of the term corresponding with described the second webpage classification and in advance foundation the image content type of theme being identified; And
Query unit is used for carrying out data query based on the image content type of theme that identifies.
Remarks 11. is according to remarks 10 described signal conditioning packages, and wherein, described query unit comprises extracting unit, and described extracting unit is used for extracting the information relevant with the image content type of theme of identifying from described the second webpage classification.
Remarks 12. is according to remarks 10 or 11 described signal conditioning packages, and wherein, described picture is the information picture, and described signal conditioning package is based on the information query device of information content type of theme.
13. 1 kinds of information processing methods of remarks comprise:
At least one character string of identification is as the alternative characters string from picture;
In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to described alternative characters string;
In response to obtaining of term, use the term that obtains to come searching web pages;
In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to other selection of the first web page class, contrast described the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term; And
Picture classification system based on the term corresponding with described the second webpage classification and in advance foundation is identified the image content type of theme.
14. 1 kinds of information processing methods of remarks comprise:
At least one character string of identification is as the alternative characters string from picture;
In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to described alternative characters string;
In response to obtaining of term, use the term that obtains to come searching web pages;
In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to other selection of the first web page class, contrast described the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term;
Picture classification system based on the term corresponding with described the second webpage classification and in advance foundation is identified the image content type of theme; And
Carry out data query based on the image content type of theme that identifies.

Claims (10)

1. signal conditioning package comprises:
Character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string;
The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to described alternative characters string;
Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term;
The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification;
Described verification unit, be used in response to other input of the first web page class, contrast described the first webpage classification the term that is obtained by described term acquiring unit is carried out verification, and the term after the verification is input to described term acquiring unit as the alternative characters string; And
Described type identification unit is used for based on the picture classification system of the term corresponding with described the second webpage classification and in advance foundation the image content type of theme being identified.
2. signal conditioning package according to claim 1, wherein, described term acquiring unit comprises filter element, described filter element is used for based on the stop words dictionary of setting up in advance from described alternative characters string filtering stop words.
3. signal conditioning package according to claim 2, wherein, described filter element is used for described alternative characters string is carried out word segmentation processing, and the result behind the participle is searched in vocabulary, and the character string under the filtering word segmentation result that can not find in vocabulary.
4. signal conditioning package according to claim 2, wherein, described filter element is used for adopting based on the transition probability computing method filtering probability of occurrence of the language material alternative characters string less than predetermined threshold.
5. any described signal conditioning package in 4 according to claim 2, wherein, described term acquiring unit comprises sequencing unit, and described sequencing unit is used for one of at least character string is sorted according to word frequency, recognition confidence, position, font, named entity recognition result and the part of speech of described alternative characters string.
6. any described signal conditioning package in 4 according to claim 1, the distance of the particular picture classification that wherein, defines in the picture classification system of described type identification unit based on the set of the term of input and in advance foundation is identified the image content type of theme.
7. signal conditioning package comprises:
Character recognition unit is used for from least one character string of picture identification, and it is input to the term acquiring unit as the alternative characters string;
The term acquiring unit is used for the input in response to the alternative characters string, obtains be used to the term of retrieving according to described alternative characters string;
Retrieval unit is used for using the term that obtains to come searching web pages in response to the obtaining of term;
The webpage selected cell is used in response to the webpage that retrieves, and the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is input to verification unit as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is input to the type identification unit as the second webpage classification;
Described verification unit, be used in response to other input of the first web page class, contrast described the first webpage classification the term that is obtained by described term acquiring unit is carried out verification, and the term after the verification is input to described term acquiring unit as the alternative characters string; And
Described type identification unit is used for based on the picture classification system of the term corresponding with described the second webpage classification and in advance foundation the image content type of theme being identified; And
Query unit is used for carrying out data query based on the image content type of theme that identifies.
8. signal conditioning package according to claim 7, wherein, described query unit comprises extracting unit, described extracting unit is used for extracting the information relevant with the image content type of theme of identifying from described the second webpage classification.
9. information processing method comprises:
At least one character string of identification is as the alternative characters string from picture;
In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to described alternative characters string;
In response to obtaining of term, use the term that obtains to come searching web pages;
In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to other selection of the first web page class, contrast described the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term; And
Picture classification system based on the term corresponding with described the second webpage classification and in advance foundation is identified the image content type of theme.
10. information processing method comprises:
At least one character string of identification is as the alternative characters string from picture;
In response to the alternative characters string that obtains, obtain be used to the term of retrieving according to described alternative characters string;
In response to obtaining of term, use the term that obtains to come searching web pages;
In response to the webpage that retrieves, the webpage that retrieves is carried out cluster; And, when the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the first predetermined extent but less than the second predetermined extent, this webpage classification is chosen as the first webpage classification; When the correlativity of the webpage classification that obtains as cluster result and term during more than or equal to the second predetermined extent, this webpage classification is chosen as the second webpage classification;
In response to other selection of the first web page class, contrast described the first webpage classification term carried out verification, and with the term after the verification as the alternative characters string to be used for further obtaining term;
Picture classification system based on the term corresponding with described the second webpage classification and in advance foundation is identified the image content type of theme; And
Carry out data query based on the image content type of theme that identifies.
CN201210112493.9A 2012-04-16 2012-04-16 Information processor and information processing method Active CN103377199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210112493.9A CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210112493.9A CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Publications (2)

Publication Number Publication Date
CN103377199A true CN103377199A (en) 2013-10-30
CN103377199B CN103377199B (en) 2016-06-29

Family

ID=49462329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210112493.9A Active CN103377199B (en) 2012-04-16 2012-04-16 Information processor and information processing method

Country Status (1)

Country Link
CN (1) CN103377199B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108234347A (en) * 2017-12-29 2018-06-29 北京神州绿盟信息安全科技股份有限公司 A kind of method, apparatus, the network equipment and storage medium for extracting feature string
CN110889028A (en) * 2018-08-15 2020-03-17 北京嘀嘀无限科技发展有限公司 Corpus processing and model training method and system
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0752673A1 (en) * 1995-07-03 1997-01-08 Canon Kabushiki Kaisha Information processing method and apparatus for searching image or text information
CN101419673A (en) * 2004-04-12 2009-04-29 富士施乐株式会社 Image dictionary creating apparatus and method
CN101556584A (en) * 2008-04-10 2009-10-14 深圳市万水千山网络发展有限公司 Computer system and method for achieving picture transaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0752673A1 (en) * 1995-07-03 1997-01-08 Canon Kabushiki Kaisha Information processing method and apparatus for searching image or text information
US6310971B1 (en) * 1995-07-03 2001-10-30 Canon Kabushiki Kaisha Information processing method and apparatus, and storage medium storing medium storing program for practicing this method
CN101419673A (en) * 2004-04-12 2009-04-29 富士施乐株式会社 Image dictionary creating apparatus and method
CN101556584A (en) * 2008-04-10 2009-10-14 深圳市万水千山网络发展有限公司 Computer system and method for achieving picture transaction

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108234347A (en) * 2017-12-29 2018-06-29 北京神州绿盟信息安全科技股份有限公司 A kind of method, apparatus, the network equipment and storage medium for extracting feature string
CN108234347B (en) * 2017-12-29 2020-04-07 北京神州绿盟信息安全科技股份有限公司 Method, device, network equipment and storage medium for extracting feature string
US11379687B2 (en) 2017-12-29 2022-07-05 Nsfocus Technologies Group Co., Ltd. Method for extracting feature string, device, network apparatus, and storage medium
CN110889028A (en) * 2018-08-15 2020-03-17 北京嘀嘀无限科技发展有限公司 Corpus processing and model training method and system
CN111726336A (en) * 2020-05-14 2020-09-29 北京邮电大学 Method and system for extracting identification information of networked intelligent equipment

Also Published As

Publication number Publication date
CN103377199B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN109947909B (en) Intelligent customer service response method, equipment, storage medium and device
US9245243B2 (en) Concept-based analysis of structured and unstructured data using concept inheritance
CN102779140B (en) A kind of keyword acquisition methods and device
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
US8108413B2 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
US20090319449A1 (en) Providing context for web articles
CN100583082C (en) Methods and systems for information extraction
US20140101544A1 (en) Displaying information according to selected entity type
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
CN1307705A (en) Data retrieval method and apparatus with multiple source capability
CN101606152A (en) The mechanism of the content of automatic matching of host to guest by classification
KR20090033989A (en) Method for advertising local information based on location information and system for executing the method
US20050138079A1 (en) Processing, browsing and classifying an electronic document
CN111814481B (en) Shopping intention recognition method, device, terminal equipment and storage medium
CN103377199B (en) Information processor and information processing method
CN117874249A (en) Authentication knowledge base construction system and method
CN101894158B (en) Intelligent retrieval system
Maiya et al. Exploratory analysis of highly heterogeneous document collections
CN113297482B (en) User portrayal describing method and system of search engine data based on multiple models
CN115210708B (en) Method and system for processing text data, and non-transitory computer readable medium
CN113095078A (en) Associated asset determination method and device and electronic equipment
EP1361524A1 (en) Method and system for processing classified advertisements
CN112241463A (en) Search method based on fusion of text semantics and picture information
CN117688162B (en) Full text retrieval method and system based on OCR (optical character recognition)
CN110020029B (en) Method and device for acquiring correlation between document and query term

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant