CN103544186B - The method and apparatus excavating the subject key words in picture - Google Patents

The method and apparatus excavating the subject key words in picture Download PDF

Info

Publication number
CN103544186B
CN103544186B CN201210246688.2A CN201210246688A CN103544186B CN 103544186 B CN103544186 B CN 103544186B CN 201210246688 A CN201210246688 A CN 201210246688A CN 103544186 B CN103544186 B CN 103544186B
Authority
CN
China
Prior art keywords
candidate keywords
term
picture
key word
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210246688.2A
Other languages
Chinese (zh)
Other versions
CN103544186A (en
Inventor
孙健
夏迎炬
潘屹峰
葛付江
杨宇航
张明明
陈思源
何源
孙俊
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201210246688.2A priority Critical patent/CN103544186B/en
Publication of CN103544186A publication Critical patent/CN103544186A/en
Application granted granted Critical
Publication of CN103544186B publication Critical patent/CN103544186B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font

Abstract

The present invention relates to a kind of method and apparatus excavating the subject key words in picture.The method excavating the subject key words in picture includes:Initial retrieval word identification step, the key word in identification picture is as initial term;Candidate keywords extraction step, using the retrieval word and search subject web page related to picture therefrom to extract candidate keywords;Term selects step, the linking relationship between term according to used by candidate keywords and search candidate keywords, selects a part of candidate keywords as the term used by next candidate keywords extraction step from candidate keywords;And repeat candidate keywords extraction step and term selection step until meeting predetermined condition.

Description

The method and apparatus excavating the subject key words in picture
Technical field
The present invention relates to field of information processing and in particular to excavate picture in subject key words method and apparatus.
Background technology
Word in picture is often extremely important to the content understanding this picture.For example, advertising pictures Chinese version information pair Client understands that ad content has important function.Using character recognition(For example, OCR identification)Result and the network information can be more Plus comprehensively extract the content of text of advertisement, and by excavating these information and extracting the theme of advertisement, will be to its expansion of lead referral Exhibition application or service.
Because character recognition technologies can not lock representative picture(For example, advertising pictures)The key word of theme, so by The substantial amounts of text message in the Internet, verifies and extracts the text in advertising image.Using keyword retrieval in character identification result, The data mining means such as text cluster and coupling, can obtain the subject web page related with advertisement(The webpage of retrieval and advertisement itself All express a content).Yet with character identification result, there is certain imperfection or incorrectness, lead to Partial key The webpage that word and search goes out is likely to be of diversity, generates noise data, and if the webpage of keyword search dissipates, its input The correct recognition result of key word will be dropped it is impossible to recall.
Accordingly, it would be desirable to a kind of technology that can solve the problem that the problems referred to above.
Content of the invention
Brief overview with regard to the present invention is given below, to provide the basic reason with regard to certain aspects of the invention Solution.It should be appreciated that this general introduction is not the exhaustive general introduction with regard to the present invention.It is not intended to determine the key of the present invention Or pith, nor is it intended to limit the scope of the present invention.Its purpose only provides some concepts in simplified form, with This is as the preamble in greater detail discussed after a while.
One main purpose of the present invention is, provides a kind of method and apparatus excavating the subject key words in picture.
According to an aspect of the invention, it is provided a kind of method excavating the subject key words in picture includes:Initially Term identification step, the key word in identification picture is as initial term;Candidate keywords extraction step, using retrieval The word and search subject web page related to picture is therefrom to extract candidate keywords;Term selects step, according to candidate keywords The linking relationship and term used by search candidate keywords between, selects a part of candidate keywords from candidate keywords As the term used by next candidate keywords extraction step;And repeat candidate keywords extraction step and retrieval selected ci poem Select step until meeting predetermined condition.
According to another aspect of the present invention, there is provided a kind of excavate picture in subject key words equipment, including:Just Beginning term identification module, be arranged to identify picture in key word as initial term;Candidate keywords are extracted Module, is arranged to using the term search subject web page related to picture therefrom to extract candidate keywords;Term Selecting module, is arranged to the linking relationship between the term according to used by candidate keywords and search candidate keywords, Select a part of candidate keywords as candidate keywords extraction module search next time candidate keywords institute from candidate keywords Term;And control module, it is arranged to control candidate keywords extraction module and the circulation of term selecting module Operation is until meeting predetermined condition.
In addition, embodiments of the invention additionally provide the computer program for realizing said method.
Additionally, embodiments of the invention additionally provide the computer program of at least computer-readable medium form, its Upper record has the computer program code for realizing said method.
By the detailed description to highly preferred embodiment of the present invention below in conjunction with accompanying drawing, the these and other of the present invention is excellent Point will be apparent from.
Brief description
Below with reference to the accompanying drawings illustrate embodiments of the invention, can be more readily understood that the above of the present invention and its Its objects, features and advantages.Part in accompanying drawing is intended merely to illustrate the principle of the present invention.In the accompanying drawings, identical or similar Technical characteristic or part will be represented using same or similar reference.
Fig. 1 is the flow chart illustrating the according to embodiments of the present invention method of subject key words excavated in picture;
Fig. 2 is the schematic diagram of the method for subject key words excavated in picture illustrating an example according to the present invention;
Fig. 3 is the schematic diagram illustrating to select candidate keywords by Feature Fusion;
Fig. 4 is an example illustrating the picture according to the present invention;
Fig. 5 is an example illustrating the search and webpage according to the present invention;
Fig. 6 is the schematic diagram of the linking relationship illustrating term and candidate keywords;
Fig. 7 is to illustrate the block diagram excavating the equipment of subject key words in picture according to an embodiment of the invention;
Fig. 8 is the block diagram of the configuration illustrating term selecting module;
Fig. 9 is the frame illustrating the equipment of subject key words excavating in picture according to another embodiment of the invention Figure;
Figure 10 is the block diagram of the configuration illustrating candidate keywords extraction module;And
Figure 11 is the meter illustrating can be used for the method and apparatus of subject key words excavating in picture implementing the present invention The structure chart of the citing of calculation equipment.
Specific embodiment
Embodiments of the invention to be described with reference to the accompanying drawings.An accompanying drawing or a kind of embodiment of the present invention are retouched The element stated and feature can be combined with the element shown in one or more other accompanying drawings or embodiment and feature.Should Work as attention, for purposes of clarity, eliminate in accompanying drawing and explanation known to unrelated to the invention, those of ordinary skill in the art Part and process expression and description.
Fig. 1 is the flow chart illustrating the according to embodiments of the present invention method 100 of subject key words excavated in picture.
As shown in figure 1, in step s 102, key word in picture can be identified as initial term.For example, may be used With by OCR(Optical Character Recognition)Method is identifying the key word in picture.But character recognition Method not limited to this, and can be using arbitrarily suitable character identifying method.Picture can be arbitrarily to need picture to be processed, example Such as, advertising pictures, the picture intercepting from video or any other pictures.
In step S104, it is possible to use the retrieval word and search subject web page related to picture is therefrom to extract candidate key Word.
In step s 106, can according to candidate keywords and search candidate keywords used by term between link Relation, selects a part of candidate keywords as the retrieval used by next candidate keywords extraction step from candidate keywords Word.For example, it is possible to the candidate keywords that prioritizing selection is retrieved by more terms extract step as next candidate keywords Suddenly term used.
In step S108, judge whether predetermined condition is satisfied.
If judging that in step S108 predetermined condition is not satisfied, return to step S104.
If judging that in step S108 predetermined condition is satisfied, terminate flow process.
Described predetermined condition can be arbitrarily suitable condition herein, the including but not limited to predetermined condition of convergence, Predetermined cycle-index or its combination etc..
When executing term selection step S106, can also be using the key word of identification and candidate keywords from picture Between similarity.For example, it is possible to according to the similarity between the key word of identification and candidate keywords and root from picture According to the linking relationship between the term used by candidate keywords and search candidate keywords, from candidate keywords, select one Divide candidate keywords as the term used by next candidate keywords extraction step S104.
The framework of the subject key words excavated in picture of an example according to the present invention to be described hereinafter with reference to Fig. 2 Flow process 200.
First, in step S202, by suitable text recognition method such as OCR(Optical Character Recognition)Text recognition method is identifying the character in picture.
Then, in step S204-1, extract the key word in picture from the character of identification(Hereinafter referred to as from picture The key word of identification).Initially, the knot in step S206 and step S208 will should be used directly as by the key word of identification from picture Really, i.e. a part as the initial term in step S210.
Furthermore, it is possible to extract entity name in step S204-2 from the character identifying, entity name can include Trade (brand) name occurring in name, place name, mechanism's name, time, quantity and other self-defining entity names, such as picture etc..By To search related web page, there is important indicative function in these entity names, so in step S210, using in step S204- The entity name extracted in 2 and the combining form of the OCR key word extracting in step S204-1 to generate term.Change sentence Talk about, the form of the term generating in step S210 can be the knot of a key word and one or more entity name Close.But in fact, the form not limited to this of term.For example, term can only include one or more key words, and does not wrap Include entity name.
Then, in step S212, retrieval in search engine put in the term generating in step S210.
In step S214 using text cluster and and in step S216, subject web page is extracted by text matches mode.
Specifically, text cluster is that the webpage searching out is clustered, this is because the webpage that can cluster more has May the description theme related to picture.
Although additionally, the webpage of cluster is more similar each other, but it cannot be guaranteed that these webpages all describe and picture Related theme.For example, if input entity name:Name, place name and mechanism's name etc., then the webpage clustering may only describe institute State the details of input entity name, and the non-depicted theme related to picture.For example, referring to the picture in Fig. 4, if with " bank " carrys out search and webpage for term and executes cluster, then the webpage clustering may only describe " bank ", and non-depicted with The related theme " coffee " of picture.Therefore, in step S216, description is excavated further in text matches mode related to picture Subject web page.Specifically, in step S216, on the basis of the text cluster of step S214, by each webpage and should The OCR recognition result of picture does matching primitives.
Then, in step S218, the score value according to text matches is ranked up to webpage, to select to describe and picture phase The webpage of the theme closing, i.e. subject web page.
Obtaining subject web page it should be appreciated that arriving notwithstanding by text cluster and text matches, may be used herein Step after directly being executed using the webpage searching with not executing text cluster and text matches, or can only hold One of row text cluster and text matches are carrying out webpage screening.
Then, in step S220, judge whether predetermined condition is satisfied.Described predetermined condition can be any herein Suitable condition, the including but not limited to predetermined condition of convergence, predetermined cycle-index or its combine etc..
If judging that in step S220 predetermined condition is not satisfied, and proceeds to step S206.
In step S206, according to the character in subject web page and from picture identification key word between similarity from Candidate keywords are extracted in subject web page.Preferably, can be according to the specific editing distance formula being described later on and by multinomial The mode of Feature Fusion is calculating similarity.
In step S208, can chain between candidate keywords and the term searching for used by this candidate keywords The relation that connects selects a part of candidate keywords from candidate keywords.For example, it is possible to prioritizing selection is retrieved by more terms One or more candidate keywords as subsequent term or term a part(Another part can be physical name Claim), will be described in after a while.
For example, it is possible to the candidate keywords being retrieved by most terms and entity name combination producing are executed next time Term used during step S210.
Next execution step S212 is to step S220.If judging that in step S220 predetermined condition is not satisfied, Then again proceed to step S206.When judging that in step S220 predetermined condition is satisfied, for example, when key word meets in advance During fixed condition, terminate flow process.Herein, this predetermined condition can be manual type given threshold.
Next, the calculating by the similarity describing between the key word identifying from picture and candidate keywords.Phase Calculating like degree is related to editing distance and multiple features selection and fusion.
The editing distance computational methods of the confidence level based on the key word identifying in picture are described first.
Because character recognition algorithm may not be entirely accurate, for example, the problems such as mistake, noise, institute in character recognition Can extract the key word of identification from picture using editing distance algorithm(That is, initial term or initial term A part).The calculating of editing distance is found currently minimum editor's cost to realize in dynamic programming mode.Editor's cost Including three kinds:Increase the cost that a character is spent, delete the cost that a character is spent, and replace a character institute The cost spending.
In one embodiment of the invention, general editing distance algorithm is improved.
Each character due to character recognition has confidence level.The value of confidence level represents the accuracy rate of character recognition.Put Reliability is higher, illustrates that character recognition is more accurate.Therefore, in the present invention, have modified editor's cost function, i.e. by each character Replacement function be transformed into the confidence level of character.
Assume that the key word character string identifying from picture is O=O1, O2... ..., OmWith corresponding candidate keywords character Go here and there as C=C1, C2... ..., Cn, then as follows from the editing distance δ (O, C) of character string O to character string C:
δ (O, C)=min γ (S) | editor's sequence for O to C for the S } (1)
Above-mentioned formula can recursive definition as follows:
γ (S) represents the cost function of editor's sequence S, and ε represents empty string, γ (Oi→ ε) represent and delete character Oi, modification Replace cost and be changed into confidence value confidence (Oi).
Fig. 4 is the example illustrating the picture according to the present invention.
Picture in Fig. 4 is advertising pictures.Each word of one of key word of identification " cangue 1 afternoon " from this picture Symbol(" cangue ", " 1 ", " ", " noon ", " afterwards ", ", ")All there is confidence level.Specific as follows:" cangue 1 afternoon, " overall confidence level For 0.8827, the confidence level of " cangue " is 0.3346, and the confidence level of " 1 " is " 0.7777 ", " " confidence level be 0.8571, " noon " Confidence level be " 0.9577 ", the confidence level of " afterwards " is 0.9417, and the confidence level of ", " is " -1.0000 ".
This key word and candidate keywords editing distance is as follows:
The editing distance of the substring Cj of [0....j] in substring Oi to the C of [0....i] in Edit (i, j) expression O, f (i, J) represent that in O, i-th character O (i) is transformed into the operation cost required for j-th character C (j) in C, if O (i)=C (j), Do not need any operation f (i, j)=0;Otherwise, replacement operation, f (i, j)=conf (i, j) are needed.
If i=0 and j=0, edit (0,0)=1
If i=0 and j>0, then edit (0, j)=edit (0, j-1)+1
If i>0 and j=0, edit (i, 0)=edit (i-1,0)+1
If i>0 and j>0, then edit (i, j)=min (edit (i-1, j)+1, edit (i, j-1)+1, edit (i-1, j- 1)+conf(i,j))
Multiple features selection and fusion are below described.Fig. 3 is the signal illustrating to select candidate keywords by Feature Fusion Figure.
From picture, the key word of identification and the feature of subject web page have important function to the selection of candidate keywords, its Feature is as shown in Figure 3.
Can calculate in the way of using Feature Fusion between the key word O identifying from picture and candidate keywords C Similarity Sim (O, C) is as follows:
Sim(O,C)=α1f12f2+……+αnfn(3)
Wherein, α12,……,αnThe parameter being characterized, f1,f2,……,fnFor the feature that can select, O is from picture In the key word that identifies, C is candidate keywords.
Wherein, feature f1,f2,……,fnAt least one in the following can be included:The key of identification from picture Position in corresponding text of the size of word, candidate keywords, candidate keywords and from picture the key word of identification public Substring, from picture identification key word mutual information in corresponding text of the geometric distance in picture, candidate keywords, with And from picture identification key word and candidate keywords between editing distance.
The size description information importance of the key word of identification from picture.From picture, the key word of identification is more big then more Can illustrate that picture wants to present to the information of user in itself, more can represent the meaning of this picture.For example, it is possible to pass through following formula(4) Using the size normalization of the key word of identification from picture one of as features described above.
Wherein, NormalizationiRepresent the normalized size of i-th key word of identification from picture, SizeiTable Show the size of not normalized i-th key word, Max (Size) represents the size of that maximum key word.
One of skill in the art will understand that, not necessarily execute normalization, and can be directly using the size of key word.
Candidate keywords are from web page contents text, and its position being located has different weights, such as title, pluck Will, content there are different weight meanings, so candidate keywords position in the text be a key feature.
The public substring of candidate keywords C and the key word O identifying from picture represents that the candidate extracting from webpage is closed The similarity degree of the keyword C and key word O of identification from picture.So public substring number also have impact on and select institute candidate The credibility of key word.
The text composition of picture illustrates the coupling degree of dependence of the important information of picture in fact.From geometric angle, Multiple character arrangements of picture closely illustrate that they are representing same meaning, or in one activity of supplementary notes and In the characteristic of product, therefore text, the co-occurrence degree of multiple characters more can explain in detail the information of picture, using character recognition Coordinate information is as follows come the feature to extract multiple characters Euclidean distance each other:
X and Y is the key word of identification from picture respectively, subscript left, and right, on, down represent respectively from picture Left and right, the upper and lower coordinate of the key word of identification.
Candidate keywords in the text of subject web page each other mutual information its text degree of dependence each other is described, its Mutual information is bigger, and co-occurrence degree is bigger, and pictorial information is more comprehensive.Mutual information I (A, B) can be calculated as follows:
P (A) represents word X probability in the text, and P (A, B) represents A and B joint probability in the text.
By analysis, one or more of above-mentioned multiple features can be merged, the pass based on identification from picture Keyword is selecting the candidate keywords in the text in subject web page.To describe based on above-mentioned feature referring to Fig. 4 and Fig. 5 Merge and to produce an example of candidate keywords.Fig. 4 is an example illustrating the picture according to the present invention.Fig. 5 is to illustrate One example of the search and webpage according to the present invention.
As shown in figure 4, the character in rectangle frame is identified character string.The coordinate of these character strings and normalized big Little result is as shown in table 1 below.
Table 1:
By way of above-mentioned Feature Fusion, in the webpage from Fig. 5, extract candidate keywords(With rectangle collimation mark Show).That is, " the pleasantly surprised courteous reception or treatment of the brush credit card is shared only needs half cost ", " leisurely afternoon, please work together and have a cup of coffee ", " tacit agreement is more Further, cost only needs half " and " satisfied originally simple thing ".
As shown in figure 5, " leisurely afternoon, please work together and have a cup of coffee ", " further, cost only needs half for tacit agreement " and The key word of identification from picture that " satisfied originally simple thing " is represented with sequence number 4,5 and 6 respectively has the longest public Substring.And the Euclidean distance of the coordinate of each character in these candidate keywords is close.Can see, by Feature Fusion Mode can easily from webpage extract candidate keywords.
The example calculations method of vocabulary score explained below.
In order to excavate the subject key words of picture, make full use of the candidate key word information calculating every time, on the one hand will Candidate keywords are as term next time, the on the other hand points relationship of analysis term and candidate keywords.When one When term generates candidate keywords, term just has to this candidate keywords sensing, constantly circulation behaviour Make, candidate keywords will have the sensing of multiple terms, this sensing illustrates that its candidate keywords more can illustrate picture Information, more can representative picture subject key words.Under this scene, a kind of new vocabulary scoring method is proposed.This algorithm Information using the key word of identification from picture calculates the relation of term and candidate keywords, and excavates the pass of picture theme Keyword.This algorithm is related to two kinds of word, term S and candidate keywords C.In initialization, this algorithm pertains only to retrieve Word S, each term Si(I=1,2 ... ..., n)All it is transfused in searching system, through retrieval, website construction, webpage coupling And produce candidate keywordsWherein SiRepresent term.Each candidate keywords that will produceAs new term. Repeat aforesaid operations.
Fig. 6 is the schematic diagram of the linking relationship illustrating term and candidate keywords.
In Fig. 6, each frame represents a term(Or candidate keywords).There are two kinds of frames, the frame table of background blank in Fig. 6 Show the term under init state(Serial number 1,2,3,4,5,6), the frame of background shadow represents newfound candidate keywords, Wherein, candidate keywords are simultaneously also using the term as retrieval next time(Serial number 7,8).When a term a produces one Individual candidate keywords, and using this candidate keywords as term b when, term a has a directional arrow to b.Wherein, each The size of frame be by be pointed at several number and corresponding frame size determine.As shown in fig. 6, the size of frame 4 is by frame 1,2,3 and 7 frame size sums determine, the size of frame 5 is determined by the size sum of frame 2 and 3, the size of frame 7 is only determined by frame 4 size, frame 8 Determined by the size of frame 5.In this example, the size of frame can be understood as the size of vocabulary score.In other words, by more inspections The vocabulary score of the candidate keywords that rope word and search arrives is bigger.When carrying out the candidate keywords extraction step of next time, preferential choosing Select the bigger candidate keywords of vocabulary score as term.
It is assumed that there being n term according to one embodiment of present invention(S1,S2,……,Sn)Point to candidate key Word C, i.e. candidate keywords C can be retrieved by this n term.The computing formula of vocabulary score PR (C) can be:
Wherein PR (Si) represent i-th term S pointing to candidate keywords CiVocabulary score.O(Si) represent to retrieve Word SiEnter line retrieval and the candidate keywords of generation number.D represents damped coefficient.In formula (7)Represent retrieval Word SiProduced candidate keywords are equivalent probability.
According to another embodiment of the invention, candidate keywords probability of occurrence in the text can be different 's.The computing formula of vocabulary score PR (C) can be:
PR (C)=(1-d)+d (P (S1→C)×PR(S1)+P(S2→C)×PR(S2)+…+P(Sn→C)×PR(Sn)) (8)
Wherein, P (Si→C) it is by term SiProduce the probability of candidate keywords C, PR (Si) it is term SiVocabulary obtain Point, wherein, i=1,2 ... ... n, d are damped coefficients.Wherein,
In P (Si→j) in, each term SiCandidate keywords C producingjIt is all by the verification with character identification result Draw, so the value of its similarity will be used as its weight calculation.
OkRepresent the key word of identification from described picture,Represent and OkDo the candidate keywords calculating,Represent OkWithBetween similarity.RepresentThe probability occurring, because candidate keywords are come Come from the text in subject web page, so its value is the probability in its candidate key set of words.
Due to the webpage enormous amount of actual treatment, and candidate keywords are also to be on the increase, thus can adopt iteration Mode calculating vocabulary score.Table 2 is the result of candidate keywords iteration 13 times in example picture, and damped coefficient d takes 0.5.
Table 2:The vocabulary score value of candidate keywords in example picture
Find out from upper table, through successive ignition, one stationary value of vocabulary score value programmable single-chip system of each candidate keywords.And And learnt by analysis, the size of vocabulary score value explains the key word impact of picture theme.As follows by impact size sequence: " COSTA ", " credit card ", " leisurely in the afternoon please work together and have a cup of coffee ", " share and only need half cost ", " pleasantly surprised courteous reception or treatment ", " satisfied The originally simple thing of meaning ", " meeting originally simple thing ", " the originally simple thing of joy ".And " leisurely In the afternoon please work together and have a cup of coffee ", " satisfied originally simple thing " does not identify correctly in the Text region stage, and at this specially Can identify in sharp solution, and become subject key words, improve recall rate, solve the problems, such as anticipation.Permissible See, it is possible to use selected a part of candidate keywords are excavating the subject key words in picture.
Fig. 7 is to illustrate the frame excavating the equipment 700 of subject key words in picture according to an embodiment of the invention Figure.
As shown in fig. 7, equipment 700 includes initial retrieval word identification module 702, candidate keywords extraction module 704, retrieval Word selecting module 706 and control module 708.
Initial retrieval word identifies that 702 modules can identify the key word in picture as initial term.
Candidate keywords extraction module 704 can be using the term search subject web page related to picture therefrom to extract Candidate keywords.
Term selecting module 706 can be according to the candidate keywords of candidate keywords extraction module 704 extraction and candidate Linking relationship between keyword extracting module 704 execution search term used, selects a part from candidate keywords Candidate keywords are as candidate keywords extraction module execution 704 next time term used.For example, term selecting module 706 can be searched using the candidate keywords that prioritizing selection is retrieved by more terms as execution next time of candidate keywords extraction module The term of Suo Suoyong.
Term selecting module 706 carries as candidate keywords selecting a part of candidate keywords from candidate keywords Delivery block 704 execution next time search term used be also conceivable to from picture the key word of identification and candidate keywords it Between similarity.In other words, term selecting module 706 can be according to the key word of identification and candidate keywords from picture Between similarity and according to candidate keywords and search candidate keywords used by term between linking relationship, from time Select and select a part of candidate keywords in key word as candidate keywords extraction module execution search 704 next time retrieval used Word.
Control module 708 can control candidate keywords extraction module and term selecting module circulate operation until meeting Predetermined condition.Wherein, predetermined condition includes the predetermined condition of convergence and/or predetermined number of times.
Fig. 8 is the block diagram of the configuration illustrating term selecting module 706.
As shown in figure 8, term selecting module 706 can include vocabulary score calculation unit 706-2 and term selects Unit 706-4.
Vocabulary score calculation unit 706-2 can calculate the vocabulary score of each candidate keywords C
Wherein, SiIt is that retrieval candidate is closed I-th term that keyword C is utilized, PR (Si) it is term SiVocabulary score, O (Si) it is using term SiExamined The number of candidate keywords produced by rope, wherein, i=1,2 ... ..., n, d are damped coefficients.
Alternatively, vocabulary score calculation unit 706-2 can calculate the vocabulary of each candidate keywords C according to following equation Score PR (C) is as follows:
PR (C)=(1-d)+d (P (S1→C)×PR(S1)+P(S2→C)×PR(S2)+…+P(Sn→C)×PR(Sn))
Wherein P (Si→C) it is by term SiProduce the probability of candidate keywords C, PR (Si) it is term SiVocabulary obtain Point, wherein, i=1,2 ... ... n, d are damped coefficients, wherein OkRepresent the key word of identification from picture,Represent and OkDo the candidate keywords calculating,Represent OkWith Between similarity,RepresentThe probability occurring.
Term select unit 706-4 can be closed using high candidate keywords C of prioritizing selection vocabulary score PR (C) as candidate Keyword extraction module execution search 704 next time term used.
Wherein, similarity is to be calculated according to the key word of identification from picture and the feature of candidate keywords.
Feature used by calculating similarity includes at least one in the following:From picture, the key word of identification is big Little, candidate keywords are in corresponding text position, candidate keywords and from picture the public substring of key word of identification, from In picture, the key word of identification is in mutual information in corresponding text of the geometric distance in picture, candidate keywords and from figure Editing distance between the key word of identification and candidate keywords in piece.
Preferably, the character that can be calculated according to the confidence level of the key word of identification from picture in editing distance is replaced Cost.
During execution retrieval, term used can also include entity name, entity name include from picture identification with Time, the place vocabulary relevant with title.
Fig. 9 is the equipment 700 ' illustrating the subject key words excavated in picture according to another embodiment of the invention Block diagram.
The difference of the equipment 700 in the equipment 700 ' in Fig. 9 and Fig. 7 is, equipment 700 ' also includes subject key words and digs Pick module 710.
Subject key words are excavated module 710 and can be excavated the master in picture using selected a part of candidate keywords Topic key word.
Figure 10 is the block diagram of the configuration illustrating candidate keywords extraction module 704.
As shown in Figure 10, candidate keywords extraction module 704 can include text matches unit 704-2, subject web page choosing Select unit 704-4 and candidate keywords extraction unit 706.
Text matches unit 704-2 can carry out text to the recognition result of the webpage searching by term and picture Coupling.
Subject web page select unit 704-4 can select and picture phase from the webpage searching according to text matches result The subject web page closing.
Candidate keywords extraction unit 704-6 can extract candidate keywords from subject web page.
To sum up, in the above-described embodiments, according to character recognition(For example, OCR)The result to picture recognition for the technology, using mutual Networking solutions are retrieved to OCR result, website construction, the selection of webpage coupling and candidate keywords, and according to search key and The linking relationship of candidate keywords, selects a part of candidate keywords as new term, repeats described web search and retrieval Selected ci poem selects step until meeting predetermined condition.
Whole framework flow process is made up of one or more of multiple technologies scheme, including entity name identification, search skill Art, Text Clustering Algorithm, document matches, candidate keywords verification, this patent is edited described in candidate keywords checking procedure Apart from innovatory algorithm, feature selection and vocabulary scoring method.
According to embodiments of the invention, can be to character recognition(For example, OCR)Result and the various features of the Internet text Merged, extracted picture(Such as advertising pictures)The similar key of text, by new vocabulary scoring method to multiple passes Keyword carries out score calculation, and the key word of final ads lock theme.
Describe the ultimate principle of the present invention above in association with specific embodiment, however, it is desirable to it is noted that to this area It is to be understood that whole or any steps of methods and apparatus of the present invention or part, Ke Yi for those of ordinary skill Any computing device(Including processor, storage medium etc.)Or in the network of computing device, with hardware, firmware, software or Combinations thereof is realized, and this is that those of ordinary skill in the art use them in the case of the explanation having read the present invention Basic programming skill can be achieved with.
Therefore, the purpose of the present invention can also by run on any computing device a program or batch processing Lai Realize.Described computing device can be known fexible unit.Therefore, the purpose of the present invention can also comprise only by offer The program product of program code realizing methods described or device is realizing.That is, such program product is also constituted The present invention, and the storage medium of such program product that is stored with also constitutes the present invention.Obviously, described storage medium can be Any known storage medium or any storage medium being developed in the future.
In the case that embodiments of the invention are realized by software and/or firmware, from storage medium or network to having The computer of specialized hardware structure, such as the general purpose computer 1100 shown in Figure 11 installs the program constituting this software, this calculating Machine, when being provided with various program, is able to carry out various functions etc..
In fig. 11, CPU (CPU) 1101 according in read only memory (ROM) 1102 storage program or from Storage part 1108 is loaded into the various process of program performing of random access memory (RAM) 1103.In RAM1103, also root Store the data required when CPU1101 executes various process etc. according to needs.CPU1101, ROM1102 and RAM1103 via Bus 1104 link each other.Input/output interface 1105 also link to bus 1104.
Components described below link is to input/output interface 1105:Importation 1106(Including keyboard, mouse etc.), output Part 1107(Including display, such as cathode ray tube (CRT), liquid crystal display (LCD) etc., and speaker etc.), storage part Divide 1108(Including hard disk etc.), communications portion 1109(Including NIC such as LAN card, modem etc.).Communication unit Divide 1109 via network such as the Internet execution communication process.As needed, driver 1110 also can link connect to input/output Mouth 1105.Detachable media 1111 such as disk, CD, magneto-optic disk, semiconductor memory etc. are installed in drive as needed So that the computer program reading out is installed in storage part 1108 as needed on dynamic device 1110.
In the case that above-mentioned series of processes is realized by software, such as removable from network such as the Internet or storage medium Unload medium 1111 and the program constituting software is installed.
It will be understood by those of skill in the art that this storage medium is not limited to the journey that is wherein stored with shown in Figure 11 Sequence and equipment are separately distributed to provide a user with the detachable media 1111 of program.The example bag of detachable media 1111 Containing disk (comprising floppy disk (registered trade mark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), Magneto-optic disk(Comprise mini-disk (MD) (registered trade mark)) and semiconductor memory.Or, storage medium can be ROM1102, deposit Hard disk comprising in storage part 1108 etc., wherein computer program stored, and it is distributed to user together with the equipment comprising them.
The present invention also proposes a kind of program product of the instruction code of the machine-readable that is stored with.Instruction code is read by machine When taking and executing, can perform above-mentioned method according to embodiments of the present invention.
Correspondingly, the storage medium for carrying the program product of the instruction code of the above-mentioned machine-readable that is stored with also wraps Include in disclosure of the invention.Storage medium includes but is not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc..
It should be appreciated by those skilled in the art that enumerated at this is exemplary, the invention is not limited in this.
In this manual, " first ", " second " and " n-th " etc. statement be in order to by described feature in word On distinguish, so that the present invention is explicitly described.Therefore, should not serve to that there is any determinate implication.
As an example, each step of said method and all modules of the said equipment and/or unit can To be embodied as software, firmware, hardware or a combination thereof, and as the part in relevant device.In said apparatus, each forms mould Block, unit when being configured by way of software, firmware, hardware or a combination thereof spendable specific means or mode be ability Known to field technique personnel, will not be described here.
As an example, in the case of being realized by software or firmware, can be from storage medium or network to having The computer of specialized hardware structure(General purpose computer 1100 for example shown in Figure 11)The program constituting this software, this calculating are installed Machine, when being provided with various program, is able to carry out various functions etc..
In the description to the specific embodiment of the invention above, for a kind of description of embodiment and/or the feature that illustrates Can be used in one or more other embodiments in same or similar mode, with the feature in other embodiment Combined, or substitute the feature in other embodiment.
It should be emphasized that term "comprises/comprising" refers to the presence of feature, key element, step or assembly herein when using, but simultaneously It is not excluded for other features one or more, the presence of key element, step or assembly or additional.
Additionally, the method for the present invention be not limited to specifications described in time sequencing executing it is also possible to according to it He time sequencing ground, concurrently or independently execute.Therefore, the execution sequence of the method described in this specification is not to this Bright technical scope is construed as limiting.
The present invention and its advantage are it should be appreciated that in the essence without departing from the present invention being defined by the claims appended hereto Various changes, replacement and conversion can be carried out in the case of god and scope.And, the scope of the present invention is not limited only to description institute The process of description, equipment, means, the specific embodiment of method and steps.One of ordinary skilled in the art is from the present invention's Disclosure will readily appreciate that, according to the present invention can using the execution function essentially identical to corresponding embodiment in this or Obtain result, the existing and in the future to be developed process essentially identical with it, equipment, means, method or step.Cause This, appended claim is directed in the range of them including such process, equipment, means, method or step.
With regard to the embodiment of above example, following remarks is also disclosed.
A kind of method excavating the subject key words in picture of remarks 1., including:
Initial retrieval word identification step, the key word in the described picture of identification is as initial term;
Candidate keywords extraction step, using the described retrieval word and search subject web page related to described picture therefrom to carry Take candidate keywords;
Term selects step, between the term according to used by described candidate keywords and the described candidate keywords of search Linking relationship, select a part of candidate keywords to extract as next described candidate keywords from described candidate keywords Term used by step;And
Repeat described candidate keywords extraction step and described term selects step until meeting predetermined condition.
Method according to remarks 1 for the remarks 2., wherein, described term selects step to include:
According to the similarity between the key word of identification and described candidate keywords from described picture and according to described The linking relationship between term used by candidate keywords and the described candidate keywords of search, selects from described candidate keywords Select a part of candidate keywords as the term used by next described candidate keywords extraction step.
Method according to remarks 1 or 2 for the remarks 3., wherein, described according to described candidate keywords and search for described time The linking relationship between the term used by key word is selected to select a part of candidate keywords conduct from described candidate keywords The term used by described candidate keywords extraction step of next time includes:Wait selecting a part from described candidate keywords When selecting key word as term used by next described candidate keywords extraction step, prioritizing selection is examined by more terms The candidate keywords that rope arrives are as the term used by next described candidate keywords extraction step.
Method according to remarks 3 for the remarks 4., wherein, the candidate that described prioritizing selection is retrieved by more terms is closed Keyword includes as the term used by next described candidate keywords extraction step:
Calculate vocabulary score PR (C) of each described candidate keywords C, Wherein, SiIt is to retrieve i-th term that described candidate keywords C are utilized, PR (Si) it is term SiVocabulary score, O (Si) it is using described term SiEnter the number of candidate keywords produced by line retrieval, wherein, i=1,2 ... ..., n, d are Damped coefficient;And
Vocabulary score PR (C) of described candidate keywords C is higher, and more candidate keywords C described in prioritizing selection are as next time The term used by described candidate keywords extraction step.
Method according to remarks 3 for the remarks 5., wherein, the candidate that described prioritizing selection is retrieved by more terms is closed Keyword includes as the term used by next described candidate keywords extraction step:
Calculate vocabulary score PR (C) of each described candidate keywords C, PR (C)=(1-d)+d (P (S1→C)×PR(S1)+ P(S2→C)×PR(S2)+…+P(Sn→C)×PR(Sn)),
Wherein, P (Si→C) it is by term SiProduce the probability of candidate keywords C, PR (Si) it is term SiVocabulary Score, wherein, i=1,2 ... ... n, d are damped coefficients,
Wherein,
Wherein, OkRepresent the key word of identification from described picture,Represent and OkDo the candidate keywords calculating,Represent OkWithBetween similarity,RepresentThe probability occurring,
Vocabulary score PR (C) of described candidate keywords C is higher, and more candidate keywords C described in prioritizing selection are as next time The term used by described candidate keywords extraction step.
Method according to remarks 2 or 5 for the remarks 6., wherein, according to the key word and described of identification from described picture The feature of candidate keywords is calculating described similarity.
Method according to remarks 6 for the remarks 7., wherein, described feature includes at least one in the following:From institute State in picture the size of key word of identification, the described candidate keywords position in corresponding text, described candidate keywords and From described picture the public substring of key word of identification, from described picture identification geometry in described picture for the key word Distance, described candidate keywords the mutual information in corresponding text and from described picture identification key word and described time Select the editing distance between key word.
Method according to remarks 7 for the remarks 8., wherein, according to from described picture identification key word confidence level Lai Calculate the cost that the character in described editing distance is replaced.
Method according to any one of remarks 1 to 8 for the remarks 9., wherein, described term also includes entity name, institute State that entity name includes from described picture identification with time, place and the relevant vocabulary of title.
Method according to any one of remarks 1 to 8 for the remarks 10., also includes:
Excavate the subject key words in described picture using selected a part of candidate keywords.
Method according to any one of remarks 1 to 8 for the remarks 11., wherein, described candidate keywords extraction step bag Include:
Text matches are carried out to the recognition result of the webpage being searched by described term and described picture;
The subject web page related to described picture is selected from the webpage searching according to text matches result;And
Described candidate keywords are extracted from described subject web page.
Method according to any one of remarks 1 to 8 for the remarks 12., wherein, described predetermined condition includes predetermined convergence Condition and/or predetermined number of times.
A kind of equipment excavating the subject key words in picture of remarks 13., including:
Initial retrieval word identification module, is arranged to identify key word in described picture as initial term;
Candidate keywords extraction module, is arranged to the subject web related to described picture using the search of described term Page is therefrom to extract candidate keywords;
Term selecting module, is arranged to be held according to described candidate keywords and described candidate keywords extraction unit The linking relationship between term used by line search, selects a part of candidate keywords as institute from described candidate keywords State candidate keywords extraction module execution next time term used;And
Control module, is arranged to control described candidate keywords extraction module and the circulation of described term selecting module Operation is until meeting predetermined condition.
Equipment according to remarks 13 for the remarks 14., wherein, described term selecting module is arranged to:
According to the similarity between the key word of identification and described candidate keywords from described picture and according to described The linking relationship between term used by candidate keywords and the described candidate keywords of search, selects from described candidate keywords Select a part of candidate keywords as described candidate keywords extraction module execution next time search term used.
Equipment according to remarks 13 or 14 for the remarks 15., wherein, described term selecting module is arranged to preferentially Select the candidate keywords being retrieved by more terms used as the execution next time search of described candidate keywords extraction module Term.
Equipment according to remarks 15 for the remarks 16., wherein, described term selecting module includes:
Vocabulary score calculation unit, is arranged to calculate vocabulary score PR (C) of each described candidate keywords C,Wherein, SiIt is to retrieve described candidate keywords I-th term that C is utilized, PR (Si) it is term SiVocabulary score, O (Si) it is using described term SiExamined The number of candidate keywords produced by rope, wherein, i=1,2 ... ..., n, d are damped coefficients;And
Term select unit, is arranged to the high described candidate of prioritizing selection vocabulary score PR (C) Key word C is as described candidate keywords extraction module execution next time search term used.
Equipment according to remarks 15 for the remarks 17., wherein, described term selecting module includes:
Vocabulary score calculation unit, is arranged to calculate the vocabulary of each described candidate keywords C according to following equation Score PR (C):
PR (C)=(1-d)+d (P (S1→C)×PR(S1)+P(S2→C)×PR(S2)+…+P(Sn→C)×PR(Sn))
Wherein P (Si→C) it is by term SiProduce the probability of candidate keywords C, PR (Si) it is term SiVocabulary obtain Point, wherein, i=1,2 ... ... n, d are damped coefficients, wherein OkRepresent the key word of identification from described picture,Represent and OkDo the candidate keywords calculating,Represent Ok WithBetween similarity,RepresentThe probability occurring,
Term select unit, is arranged to the high described candidate keywords C conduct of prioritizing selection vocabulary score PR (C) Described candidate keywords extraction module execution next time search term used.
Equipment according to remarks 14 or 17 for the remarks 18., wherein, described similarity is to identify according to from described picture Key word and described candidate keywords feature calculating.
Equipment according to remarks 18 for the remarks 19., wherein, described feature includes at least one in the following:From The size of key word of identification, position in corresponding text for the described candidate keywords, described candidate keywords in described picture With the public substring of the key word of identification from described picture, from described picture, the key word of identification is several in described picture What distance, described candidate keywords the mutual information in corresponding text and from described picture identification key word and described Editing distance between candidate keywords.
Equipment according to remarks 19 for the remarks 20., wherein, according to the confidence level of the key word of identification from described picture To calculate the cost of the replacement of the character in described editing distance.
Equipment according to any one of remarks 13 to 20 for the remarks 21., wherein, described term also includes physical name Claim, described entity name include from described picture identification with time, place and the relevant vocabulary of title.
Equipment according to any one of remarks 13 to 20 for the remarks 22., also includes:
Subject key words excavate module, are arranged to excavate described figure using selected a part of candidate keywords Subject key words in piece.
Equipment according to any one of remarks 13 to 20 for the remarks 23., wherein, described candidate keywords extraction module bag Include:
Text matches unit, is arranged to the identification knot to the webpage searching by described term and described picture Fruit carries out text matches;
Subject web page select unit, be arranged to according to text matches result select from the webpage searching with described The related subject web page of picture;And
Candidate keywords extraction unit, is arranged to extract described candidate keywords from described subject web page.
Method according to any one of remarks 13 to 20 for the remarks 24., wherein, described predetermined condition includes predetermined receipts Hold back condition and/or predetermined number of times.

Claims (10)

1. a kind of method excavating the subject key words in picture, including:
Initial retrieval word identification step, the key word in the described picture of identification is as initial term;
Candidate keywords extraction step, using the described retrieval word and search subject web page related to described picture therefrom to extract time Select key word;
Term selects step, the chain between term according to used by described candidate keywords and the described candidate keywords of search Connect relation, select a part of candidate keywords from described candidate keywords as next described candidate keywords extraction step Term used;And
Repeat described candidate keywords extraction step and described term selects step until meeting predetermined condition.
2. method according to claim 1, wherein, described term selects step to include:
According to the similarity between the key word of identification and described candidate keywords from described picture and according to described candidate The linking relationship between term used by key word and the described candidate keywords of search, selects one from described candidate keywords Part candidate keywords are as the term used by next described candidate keywords extraction step.
3. method according to claim 1 and 2, wherein, described according to described candidate keywords, described candidate is closed with search The linking relationship between term used by keyword selects a part of candidate keywords as next time from described candidate keywords The term used by described candidate keywords extraction step include:Close selecting a part of candidate from described candidate keywords When keyword is as term used by next described candidate keywords extraction step, prioritizing selection is retrieved by more terms Candidate keywords as the term used by next described candidate keywords extraction step.
4. method according to claim 3, wherein, the candidate keywords that described prioritizing selection is retrieved by more terms Include as the term used by next described candidate keywords extraction step:
Calculate vocabulary score PR (C) of each described candidate keywords C, Wherein, SiIt is to retrieve i-th term that described candidate keywords C are utilized, PR (Si) it is term SiVocabulary score, O (Si) it is using described term SiEnter the number of candidate keywords produced by line retrieval, wherein, i=1,2 ... ..., n, d are Damped coefficient;And
Vocabulary score PR (C) of described candidate keywords C is higher, and more candidate keywords C described in prioritizing selection are as next institute State the term used by candidate keywords extraction step.
5. method according to claim 3, wherein, the candidate keywords that described prioritizing selection is retrieved by more terms Include as the term used by next described candidate keywords extraction step:
Calculate vocabulary score PR (C) of each described candidate keywords C, PR (C)=(1-d)+d (P (S1→C)×PR(S1)+P (S2→C)×PR(S2)+…+P(Sn→C)×PR(Sn)),
Wherein, P (Si→C) it is by term SiProduce the probability of candidate keywords C, PR (Si) it is term SiVocabulary score, Wherein, i=1,2 ... ... n, d are damped coefficients,
Wherein,And
Wherein, P (Si→j) it is by term SiProduce candidate keywords CjProbability, OkRepresent the pass of identification from described picture Keyword,Represent and OkDo the candidate keywords calculating,Represent OkWithBetween similarity,RepresentThe probability occurring, wherein, j=1 ... ... m,
Vocabulary score PR (C) of described candidate keywords C is higher, and more candidate keywords C described in prioritizing selection are as next institute State the term used by candidate keywords extraction step.
6. the method according to claim 2 or 5, wherein, according to the key word of identification and described candidate from described picture The feature of key word is calculating described similarity.
7. method according to claim 6, wherein, described feature includes at least one in the following:From described figure The size of key word of identification in piece, position in corresponding text for the described candidate keywords, described candidate keywords and from institute State in picture identification the public substring of key word, from described picture identification geometry in described picture for the key word away from From, described candidate keywords the mutual information in corresponding text and from described picture the key word of identification and described candidate Editing distance between key word.
8. method according to claim 7, wherein, calculates according to the confidence level of the key word of identification from described picture The cost that character in described editing distance is replaced.
9. method according to claim 1, wherein, described candidate keywords extraction step includes:
Text matches are carried out to the recognition result of the webpage being searched by described term and described picture;
The subject web page related to described picture is selected from the webpage searching according to text matches result;And
Described candidate keywords are extracted from described subject web page.
10. a kind of equipment excavating the subject key words in picture, including:
Initial retrieval word identification module, is arranged to identify key word in described picture as initial term;
Candidate keywords extraction module, be arranged to using the described term search subject web page related to described picture with Therefrom extract candidate keywords;
Term selecting module, is arranged to according to described candidate keywords and searches for the retrieval used by described candidate keywords Linking relationship between word, selects a part of candidate keywords to extract as described candidate keywords from described candidate keywords Module searches for the term used by described candidate keywords next time;And
Control module, is arranged to control described candidate keywords extraction module and described term selecting module circulate operation Until meeting predetermined condition.
CN201210246688.2A 2012-07-16 2012-07-16 The method and apparatus excavating the subject key words in picture Expired - Fee Related CN103544186B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210246688.2A CN103544186B (en) 2012-07-16 2012-07-16 The method and apparatus excavating the subject key words in picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210246688.2A CN103544186B (en) 2012-07-16 2012-07-16 The method and apparatus excavating the subject key words in picture

Publications (2)

Publication Number Publication Date
CN103544186A CN103544186A (en) 2014-01-29
CN103544186B true CN103544186B (en) 2017-03-01

Family

ID=49967649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210246688.2A Expired - Fee Related CN103544186B (en) 2012-07-16 2012-07-16 The method and apparatus excavating the subject key words in picture

Country Status (1)

Country Link
CN (1) CN103544186B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105812231B (en) * 2014-12-29 2019-11-05 阿里巴巴集团控股有限公司 The method for quickly identifying and its device of chat record
CN108572971B (en) * 2017-03-09 2022-11-01 百度在线网络技术(北京)有限公司 Method and device for mining keywords related to search terms
CN110020042B (en) * 2017-08-25 2021-09-10 杭州海康威视数字技术股份有限公司 Image acquisition method and device based on webpage
CN107633460A (en) * 2017-09-18 2018-01-26 北京奇艺世纪科技有限公司 Content distribution control method and device
CN107798070A (en) * 2017-09-26 2018-03-13 平安普惠企业管理有限公司 A kind of web data acquisition methods and terminal device
CN111488512A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Target to be collected obtaining method, device, equipment and storage medium
CN111859095A (en) * 2019-04-02 2020-10-30 搜狗(杭州)智能科技有限公司 Picture identification method and device
CN113590861A (en) * 2020-04-30 2021-11-02 北京搜狗科技发展有限公司 Picture information processing method and device and electronic equipment
CN112199545B (en) * 2020-11-23 2021-09-07 湖南蚁坊软件股份有限公司 Keyword display method and device based on picture character positioning and storage medium
CN114547404B (en) * 2022-01-10 2023-02-17 普瑞纯证医疗科技(苏州)有限公司 Big data platform system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1763798A1 (en) * 2004-06-17 2007-03-21 Nokia Corporation System and method for search operations
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN102073653A (en) * 2009-11-20 2011-05-25 富士通株式会社 Information extraction method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1763798A1 (en) * 2004-06-17 2007-03-21 Nokia Corporation System and method for search operations
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN102073653A (en) * 2009-11-20 2011-05-25 富士通株式会社 Information extraction method and device

Also Published As

Publication number Publication date
CN103544186A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
CN103544186B (en) The method and apparatus excavating the subject key words in picture
CN104239300B (en) The method and apparatus that semantic key words are excavated from text
Culotta et al. Reducing labeling effort for structured prediction tasks
US8812299B1 (en) Class-based language model and use
US8082151B2 (en) System and method of generating responses to text-based messages
US9245243B2 (en) Concept-based analysis of structured and unstructured data using concept inheritance
US20110231347A1 (en) Named Entity Recognition in Query
CN101799802B (en) Method and system for extracting entity relationship by using structural information
US20080005051A1 (en) Lexicon generation methods, computer implemented lexicon editing methods, lexicon generation devices, lexicon editors, and articles of manufacture
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN103365849B (en) Keyword retrieval method and apparatus
US10949452B2 (en) Constructing content based on multi-sentence compression of source content
US20070233668A1 (en) Method, system, and computer program product for semantic annotation of data in a software system
CN103577414B (en) Data processing method and device
KR100835290B1 (en) System and method for classifying document
WO2014081762A1 (en) Mobile-commerce store generator that automatically extracts and converts data
CN110110218B (en) Identity association method and terminal
CN107329770A (en) The personalized recommendation method repaired for software security BUG
JP2004318510A (en) Original and translation information creating device, its program and its method, original and translation information retrieval device, its program and its method
US11663407B2 (en) Management of text-item recognition systems
US20050065947A1 (en) Thesaurus maintaining system and method
CN107784019A (en) Word treatment method and system are searched in a kind of searching service
JP2012221489A (en) Method and apparatus for efficiently processing query
CN117252186A (en) XAI-based information processing method, device, equipment and storage medium
CN103514194B (en) Determine method and apparatus and the classifier training method of the dependency of language material and entity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170301

Termination date: 20180716