CN110019675A

CN110019675A - A kind of method and device of keyword extraction

Info

Publication number: CN110019675A
Application number: CN201711251357.7A
Authority: CN
Inventors: 谢泽华; 周泽南; 苏雪峰; 佟子健
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2019-07-16
Anticipated expiration: 2037-12-01
Also published as: CN110019675B

Abstract

The invention discloses a kind of method and devices of keyword extraction, when extracting the keyword of a Target Photo, text is obtained from the webpage where the Target Photo, candidate word set is obtained according to the text, it determines in the candidate word set, the characteristic value of the Expressive Features of each candidate word, wherein, it include picture and text feature in Expressive Features, the characteristic value of the picture and text feature of one candidate word, the similarity degree that the candidate word Yu the Target Photo can be characterized extracts keyword from candidate word set according to the characteristic value of the Expressive Features of each candidate word.When due to extracting keyword, based on Expressive Features include the picture and text feature that can characterize the similarity degree of the candidate word and the Target Photo, the keyword extracted is the candidate word high with the Target Photo similarity degree, the keyword can be good at characterizing the content of Target Photo, it is high with the content matching degree of the Target Photo, when using the keyword as index terms, the accuracy rate for retrieving the Target Photo can be improved.

Description

A kind of method and device of keyword extraction

Technical field

The present invention relates to Internet technical fields, more particularly to a kind of method and device of keyword extraction.

Background technique

Keyword is vocabulary used when carrying out web search.Keyword search, be just to look for using the keyword as The process of the target object of index terms.For searching Target Photo A, the row of falling of keyword B and Target Photo A is pre-created Index, keyword B is the index terms of Target Photo A.When input keyword B or with query word similar in keyword B When, search engine can find the Target Photo A using keyword B as index terms automatically.

Before the inverted index for creating a Target Photo, need first to obtain the corresponding keyword of the Target Photo as inspection Rope word.Under normal circumstances, it when obtaining the corresponding keyword of the Target Photo, first from the webpage where the Target Photo, obtains The textview field of the Target Photo, text domain include text related with the Target Photo；Candidate word is obtained from the text again； Then, it is determined that the text feature of each candidate word, this article eigen can characterize the attribute information of the candidate word in the text；Most Afterwards, according to the text feature of candidate word, keyword is extracted from the candidate word.

But the keyword obtained according to the text feature of candidate word, text related with Target Photo can only be characterized Content cannot characterize the content of Target Photo well, i.e., low with the degree of correlation of Target Photo with the degree of correlation of text height, will When index terms of the keyword as the Target Photo, it is not accurate enough to will lead to the picture retrieved.

Summary of the invention

Present invention solves the technical problem that being to provide a kind of method and device of keyword extraction, according to retouching for candidate word The characteristics extraction keyword of feature is stated, which includes the picture and text for characterizing the similarity of candidate word and Target Photo Feature, the keyword can characterize the content of Target Photo, improve the accuracy rate that the Target Photo is arrived by the keyword retrieval.

For this purpose, the technical solution that the present invention solves technical problem is:

In a first aspect, to solve the above-mentioned problems, it is described the embodiment of the invention provides a kind of method of keyword extraction Method includes:

Where from Target Photo in the text of webpage, candidate word set is obtained；

Determine the characteristic value of the Expressive Features of candidate word in the candidate word set, the Expressive Features include picture and text spy Sign, the characteristic value of the picture and text feature are used to characterize the similarity degree of the candidate word and the Target Photo；

According to the characteristic value of the Expressive Features of the candidate word, the target figure is extracted from the candidate word set The corresponding keyword of piece.

Optionally,

The Expressive Features further include text feature, and the characteristic value of the text feature is for characterizing the candidate word in institute State the attribute information in text.

Optionally,

The text feature includes word frequency of the candidate word in the text, and the candidate word is in the text Part of speech, TF-IDF value of the candidate word in the text, the text where the length of the candidate word and the candidate word Whether this belongs to crucial textview field, any one or more in features described above.

Optionally, the characteristic value of the picture and text feature of candidate word includes: in the determination candidate word set

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

High dimensional feature vector sum institute's predicate is embedded in DUAL PROBLEMS OF VECTOR MAPPING to the same feature space；

In the feature space, the similarity of high dimensional feature vector sum institute's predicate insertion vector is obtained, as institute State the characteristic value of the picture and text feature of candidate word.

Optionally, the characteristic value of the Expressive Features according to the candidate word, mentions from the candidate word set The corresponding keyword of the Target Photo is taken to include:

According to the characteristic value of the Expressive Features of the candidate word, the candidate word and institute are determined using preset training pattern The degree of correlation of Target Photo is stated, the preset training pattern is used to characterize the characteristic value and the target of the Expressive Features The corresponding relationship of the degree of correlation of picture；

According to the degree of correlation of the candidate word and the Target Photo, at least one is extracted from the candidate word set The corresponding keyword of the Target Photo.

Optionally, described from the text of webpage where Target Photo, obtaining candidate word set includes:

From the webpage where the Target Photo, the textview field of the Target Photo is obtained；

Obtain the text in the textview field；

Word segmentation processing is carried out to the text, obtains the candidate word set.

Optionally, after carrying out word segmentation processing to the text, the method also includes:

The text after participle is carried out closing word processing and/or stop words is gone to handle.

Second aspect, it is to solve the above-mentioned problems, described the embodiment of the invention provides a kind of device of keyword extraction Device includes:

Module is obtained, for obtaining candidate word set from the text of webpage where Target Photo；

Determining module, for determining the characteristic value of the Expressive Features of candidate word in the candidate word set, the description is special Sign includes picture and text feature, and the characteristic value of the picture and text feature is used to characterize the similar journey of the candidate word and the Target Photo Degree；

Extraction module, for the characteristic value according to the Expressive Features of the candidate word, from the candidate word set Extract the corresponding keyword of the Target Photo.

Optionally,

Optionally, the determining module includes:

First extraction unit, for extracting the high dimensional feature vector of the Target Photo；

Second extraction unit, the word for extracting the candidate word are embedded in vector；

Map unit, for high dimensional feature vector sum institute's predicate to be embedded in DUAL PROBLEMS OF VECTOR MAPPING to the same feature space；

Obtaining unit, in the feature space, obtaining high dimensional feature vector sum institute's predicate insertion vector Similarity, the characteristic value of the picture and text feature as the candidate word.

Optionally, the extraction module includes:

Determination unit is determined for the characteristic value according to the Expressive Features of the candidate word using preset training pattern The degree of correlation of the candidate word and the Target Photo, the preset training pattern are used to characterize the spy of the Expressive Features The corresponding relationship of value indicative and the degree of correlation of the Target Photo；

Extraction unit, for the degree of correlation according to the candidate word and the Target Photo, from the candidate word set It is middle to extract the corresponding keyword of at least one described Target Photo.

Optionally, the acquisition module includes:

First acquisition unit, for obtaining the text of the Target Photo from the webpage where the Target Photo This domain；

Second acquisition unit, for obtaining the text in the textview field；

Third acquiring unit obtains the candidate word set for carrying out word segmentation processing to the text.

Optionally, after carrying out word segmentation processing to the text, described device further include:

Processing unit, for carrying out closing word processing to the text after participle and/or stop words being gone to handle.

The third aspect, it to include memory and one or one that the embodiment of the invention provides a kind of electronic equipment Above program, one of them perhaps more than one program be stored in memory and be configured to by one or one with It includes the instruction for performing the following operation that upper processor, which executes the one or more programs:

Optionally,

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

Obtain the text in the textview field；

Optionally, after carrying out word segmentation processing to the text, described instruction further include:

Fourth aspect, the embodiment of the invention provides a kind of non-transitorycomputer readable storage mediums, when the storage When instruction in medium is executed by the processor of electronic equipment, so that electronic equipment is able to carry out a kind of side of keyword extraction Method, which comprises

Optionally,

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

Obtain the text in the textview field；

According to the above-mentioned technical solution, the method have the advantages that:

When extracting the keyword of a Target Photo, text is obtained from the webpage where the Target Photo, according to this article This acquisition candidate word set determines in the candidate word set, the characteristic value of the Expressive Features of each candidate word, wherein description is special It include picture and text feature in sign, the characteristic value of the picture and text feature of a candidate word can characterize the candidate word and the Target Photo Similarity degree extracts keyword from candidate word set according to the characteristic value of the Expressive Features of each candidate word.It is closed due to extracting When keyword, based on Expressive Features include the picture and text feature that can characterize the similarity degree of the candidate word and the Target Photo, The keyword extracted is the candidate word high with the Target Photo similarity degree, which can be good at characterizing target figure The content of piece, it is high with the content matching degree of the Target Photo, when using the keyword as index terms, it can be improved and retrieve the target The accuracy rate of picture.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the method flow diagram of keyword extraction provided in an embodiment of the present invention；

Fig. 2 is textview field schematic diagram provided in an embodiment of the present invention；

Fig. 3 is Anchor Text domain provided in an embodiment of the present invention schematic diagram；

Fig. 4 is keyword extracting device structural schematic diagram provided in an embodiment of the present invention；

Fig. 5 is electronic equipment hardware structural diagram provided in an embodiment of the present invention.

Specific embodiment

In order to provide the implementation for the keyword extraction for improving picture retrieval accuracy rate, the embodiment of the invention provides one The method and device of kind keyword extraction, below in conjunction with Figure of description, preferred embodiment of the present invention will be described.

Keyword retrieval is one of most common retrieval mode, in retrieval, inputs keyword, searches with the pass Keyword is the target object of index terms.In order to create inverted index to the target object, the target pair can be characterized by needing to obtain The keyword of elephant, the index terms as inverted index.When target object is picture, the text of webpage where obtaining the Target Photo This, obtains candidate word set from the text, and according in candidate word set, the attribute in the text of each candidate word is extracted It can be good at characterizing the keyword of the content of the text, the degree of correlation of the keyword and text is very high.Under normal circumstances, mesh The text in webpage marked on a map where piece, the keyword that will in the manner described above extracts related to Target Photo, as the target The keyword of picture.

But in actual application, the keyword extracted in the manner described above, the key as Target Photo are found When word, there are the following problems: the degree of correlation of the keyword and text is very high, can only characterize the content of the text well, and The content of Target Photo cannot be characterized well, it is very low with the matching degree of the Target Photo, using the keyword as the target figure The index terms of piece can reduce the accuracy rate for retrieving the Target Photo.

For example: Target Photo is the Tom cat in cartoon " cat and mouse ".And the text of webpage where the Target Photo In this, to the effect that the development history of cartoon " cat and mouse ", the keyword extracted from the text in the manner described above It is the corresponding relationship that the inverted index that " cat and mouse " then creates is " cat and mouse " Yu Target Photo." cat and mouse " is although energy The content of enough text of characterization well still can not characterize the content of Target Photo well.And the desired retrieval of user should When Target Photo, under normal circumstances, alternatively it is conceivable to the keyword that can characterize the Target Photo be " Tom cat ", rather than " cat And mouse ", lead to not retrieve the Target Photo.Therefore, the accuracy rate for retrieving the Target Photo is reduced.

To solve the above-mentioned problems, in keyword extracting method provided by the invention, candidate word set is obtained from text, Determine the characteristic value of the Expressive Features of each candidate word in the candidate word set, wherein include that can characterize this in Expressive Features The picture and text feature of the similarity degree of candidate word and the Target Photo, according to the characteristic value of the Expressive Features of each candidate word, from time It selects and extracts keyword in set of words.When due to extracting keyword, the characteristic value of picture and text feature joined, the keyword extracted It is the candidate word high with the Target Photo similarity degree, which can be good at characterizing the content of Target Photo, with the mesh Mark on a map piece content matching degree it is high, when using the keyword as index terms, can be improved the accuracy rate of the retrieval Target Photo.

The embodiment of the present invention is described in detail below.

Deemed-to-satisfy4 example

Fig. 1 is the method flow diagram of keyword extraction provided in an embodiment of the present invention, comprising:

101: where from Target Photo in the text of webpage, obtaining candidate word set.

For a picture, when needing to create the inverted index of the picture, first to obtain can be used as the picture rope Draw the keyword of word, which can be the picture that any one can search on webpage, using the picture as Target Photo.

In an example, from the text of webpage where Target Photo, obtaining candidate word set includes:

From the webpage where Target Photo, the textview field of Target Photo is obtained；

Obtain the text in textview field；

Word segmentation processing is carried out to text, obtains candidate word set.

Where obtaining Target Photo when the text of webpage, where first obtaining the Target Photo in webpage, the Target Photo Textview field.Textview field refers to the region that text is arranged in webpage.The textview field of Target Photo, be in webpage with the Target Photo Related textview field, including title (title) textview field, description (describe) textview field, anchor (anchor) textview field, and Surrounding (surround) textview field.It is the title of Target Photo place webpage in title text domain, describes in textview field to be this The verbal description of Target Photo, in Anchor Text domain in the source code of webpage as described in the Target Photo description text, surrounding text It is then other texts in webpage around the Target Photo other than verbal description in this domain.When specific implementation, the mesh is obtained Mark on a map the textview field of piece, available title text domain describes textview field, Anchor Text domain and surrounding textview field it is any one Kind is a variety of, not only specific here to limit.

For example: as shown in Fig. 2, the Target Photo 202 in webpage 201, uppermost text in webpage 201, for mark Inscribe textview field 203；Line inscribed below Target Photo 202, as the description textview field 204 of the Target Photo 202；Webpage 201 is most One section of following text is surrounding textview field 205.The Anchor Text domain of Target Photo is as shown in Figure 3, wherein the example above be only for Better illustrate the textview field of Target Photo, the description that is not limited in above-mentioned example.

After the textview field for obtaining Target Photo, the text in text domain is obtained.To the text using any one participle Algorithm is segmented, using word resulting after participle as candidate word set.For example: with title text domain 203 shown in Fig. 2 In text " refreshing dog small seven small seven and betel nut younger sister got married Ah mew together with small seven " for, the text is divided Word, after being segmented text " refreshing dog/small by seven/small by seven/and/betel nut/younger sister/marriage// Ah mew/and/it is small by seven// together/ / ", using the word in the text after the participle as candidate word set.

In the above-described example, to text carry out word segmentation processing after, the text after participle can also be carried out close word processing and/ Or stop words is gone to handle.

The first scene obtains the textview field of Target Photo from the webpage where Target Photo, obtains in textview field Text carries out word segmentation processing to text, carries out closing word processing to the text after participle, obtains candidate word set.

When carrying out closing word processing to the text after participle, preset conjunction word dictionary can be used, the conjunction word dictionary will be hit At least two adjacent words be merged into a word.The preset conjunction word dictionary includes name entity dictionary and picture searching Hot word dictionary etc..It include largely being grabbed from interconnection in name entity dictionary, common name.Picture searching hot word word Allusion quotation refers to grabbing in search pictures, the very high keyword of frequency of use.Preset conjunction word dictionary can also be according to reality It needs to include other dictionaries, and can update at any time to meet actual demand.

The text after participle is carried out using preset conjunction word dictionary to close word processing, resulting candidate word and only executes participle It compares, a correct complete meaning can be expressed.It is carried out closing word processing, resulting candidate word energy according to name entity dictionary Enough characterize the entity with entitled mark occurred in internet.Such as: name, place name, mechanism name etc..According to picture searching Hot word dictionary carries out closing word processing, and resulting candidate word is when carrying out picture searching in internet, to input the very high word of frequency.

For example: also with the text in title text domain 203 shown in Fig. 2, " refreshing dog small seven small seven and betel nut younger sister get married Ah mew is together with small seven " for, the text is segmented, text " the refreshing dog/small by seven/small after being segmented Seven/with/betel nut/younger sister/marriage// Ah mew/with/small by seven// together// ", using preset conjunction word dictionary to participle Text afterwards carry out close word processing, close word processing after resulting candidate word set be combined into "Refreshing dog is small by seven/ small by seven/and/Betel nut younger sister/ knot Wedding//Ah mew and small by seven//together// ", wherein " refreshing small seven " of dog, and " getting rid of bulky younger sister " hit name entity word Allusion quotation, will " refreshing dog/two adjacent words of small seven " carry out closing word processing, obtain "Refreshing dog is small by seven", " betel nut/younger sister " is two adjacent Word carries out closing word processing, obtain "Betel nut younger sister"." Ah mew and small seven " hit heat and search dictionary, will " Ah mew/and/small seven " three it is adjacent Word carry out close word processing, obtain "Ah mew and small by seven”。

Certainly, the example above merely to pairing word processing better illustrated, be not limited in above-mentioned example Description.

Second of scene: from the webpage where Target Photo, the textview field of Target Photo is obtained, is obtained in textview field Text carries out word segmentation processing to text, carries out stop words to the text after participle and handle, obtains candidate word set.

Stop words, refers to the function word for not having physical meaning in text, including English character, number, mathematical character, with And the Chinese word character etc. that frequency of use is extra-high.To text carry out word segmentation processing after, remove word segmentation processing after text in stop words, The number that the not candidate word of physical meaning can be reduced improves the efficiency of keyword extraction.

For example: also with the text in title text domain 203 shown in Fig. 2, " refreshing dog small seven small seven and betel nut younger sister get married Ah mew is together with small seven " for, the text is segmented, text " the refreshing dog/small by seven/small after being segmented Seven/with/betel nut/younger sister/marriage// Ah mew/with/small by seven// together// ", stop words is carried out to the text after participle It handles, stop words " " and " " in the text after removing above-mentioned participle, after removing stop words, resulting candidate word set packet Include " refreshing dog/small by seven/small by seven/with/betel nut/younger sister/marriage Ah mew/with/small by seven// together ".

The third scene: from the webpage where Target Photo, the textview field of Target Photo is obtained, is obtained in textview field Text carries out word segmentation processing to text, carries out closing word processing to the text after participle and stop words is gone to handle, obtain candidate word set It closes.

In the third scene, after being segmented to text, the text after participle to not only be carried out to close word processing, also wanted It carries out stop words to handle, still, the sequence for carrying out closing word processing and stop words being gone to handle is not defined, and can first be carried out Word processing is closed, stop words can also be first carried out and handle.The method for close word processing is similar with the description in the first scene, With reference to the description in the first scene, which is not described herein again.Carry out retouching in scene in the method and second of stop words processing State similar, with reference to the description in second of scene, which is not described herein again.

It is understood that being carried out after closing word processing and stop words being gone to handle to the text after participle, resulting candidate word Set, after only segmenting to text compared with resulting candidate word set, candidate word number is few, and raising is extracted from candidate word The efficiency of keyword；And resulting candidate word can more meet the habit of the word in internet actual scene.

Certainly, using any one of the above method, from the text of the webpage where Target Photo, candidate word set is obtained It closes, includes that multiple candidate words extract the Target Photo from multiple candidate words of the candidate word set in gained candidate word set Keyword, specific implementation is as follows.

102: determining in candidate word set, the characteristic value of the Expressive Features of candidate word, Expressive Features include picture and text feature, figure The characteristic value of literary feature is used to characterize the similarity degree of candidate word and Target Photo.

103: according to the characteristic value of the Expressive Features of candidate word, extracting keyword from candidate word set.

Before extracting keyword in candidate word set, need first to analyze the characteristic value of the Expressive Features of each candidate word. The Expressive Features of candidate word can be and obtain based on practical experience, pre-set, can characterize the attribute letter of the candidate word The feature of breath.For different candidate words, the type of Expressive Features is all identical, and different candidate words, the Expressive Features Characteristic value it is not necessarily identical.In a scenario, the type of the Expressive Features only includes picture and text feature；It, should under another scene The type of Expressive Features not only includes picture and text feature, further includes at least one text feature, such as the word of candidate word in the text Frequently, the part of speech of candidate word in the text, the TF-IDF value etc. of candidate word in the text, here without specifically limiting.

In the embodiment of the present invention, in order to ensure the keyword extracted can be good at characterizing the content of Target Photo, Picture and text feature is added in the Expressive Features as keyword extraction foundation.For a candidate word, the figure of the candidate word The characteristic value of literary feature can characterize the similarity degree of the candidate word and Target Photo.According to the Expressive Features of each candidate word Characteristic value, extract keyword from candidate word set.Due to including picture and text feature in the Expressive Features, then key can be being extracted When word, the similarity of each candidate word and Target Photo is comprehensively considered, extract and the high candidate word conduct of Target Photo similarity Keyword.Therefore, obtained keyword can be good at the content for characterizing Target Photo.Using the keyword as index terms, Inverted index is established to the Target Photo, the accuracy rate for retrieving the Target Photo is higher.

In an example, for any one candidate word in candidate word set, determine that the picture and text of the candidate word are special The characteristic value of sign can use following manner:

Extract the high dimensional feature vector of Target Photo；

The word for extracting candidate word is embedded in vector；

High dimensional feature vector sum word is embedded in DUAL PROBLEMS OF VECTOR MAPPING to the same feature space；

In feature space, the similarity of high dimensional feature vector sum word insertion vector is obtained, the picture and text as candidate word are special The characteristic value of sign.

The characteristic value for determining the picture and text feature an of candidate word, that is, determine the similarity of the candidate word and Target Photo.One Aspect extracts the high dimensional feature vector of Target Photo using neural network, for example, the high position for extracting 1024 dimensions of Target Photo is special The vector of sign；On the other hand, (word embedding) technology is embedded in using word, extracts the word insertion vector of candidate word, for example, The word for extracting 128 dimensions of candidate word is embedded in vector.Then, using the method for Multimodal Learning (multimodal learning), The high dimensional feature vector of Target Photo and the word of candidate word are embedded in vector, are mapped to same feature space, for example, mapping The feature space tieed up to one 512.In this feature space, the similarity of high dimensional feature vector sum word insertion vector is obtained, it will Characteristic value of the similarity as the picture and text feature of the candidate word.Therefore, the characteristic value of the picture and text feature of a candidate word, can Characterize the similarity degree of the candidate word and Target Photo.

In another example, Expressive Features not only include picture and text feature, further include text feature.The figure of one candidate word The characteristic value of literary feature can characterize the similarity degree between the candidate word and Target Photo；And the text of a candidate word is special The characteristic value of sign can then characterize attribute information of the candidate word in the text of the webpage where Target Photo.At this point, again according to When extracting keyword from candidate word set according to the Expressive Features, candidate word journey similar to Target Photo can be not only considered Degree, further accounts for the attribute information of the candidate word in the text, obtained keyword both with the similarity degree of the Target Photo It is relatively high, it can preferably embody the content of the Target Photo；It is also relatively high with the degree of correlation of text simultaneously, it also can be better Characterize the content of text.

When specific implementation, this article eigen includes: the word frequency of candidate word in the text, the part of speech of candidate word in the text, TF-IDF (term frequency-inverse document frequency) value of candidate word in the text, candidate word Whether the text where length and candidate word belongs to crucial textview field, any one or more in features described above.

The number that the word frequency of candidate word in the text, the as candidate word occur in the text.It is understood that candidate Word is the word in text there are physical meaning, is not without " " of physical meaning, words such as " ", at this point, candidate word is in text In word frequency it is higher, then it represents that the degree of correlation of the candidate word and the text is higher.

The TF-IDF value of candidate word in the text is mainly used for embodying the degree of correlation of the candidate word and text.TF-IDF Value can increase with the number direct proportion that the candidate word occurs hereof, and can occur in corpus with the candidate word Number inverse proportion reduces.There are many even number that occurs in the text of candidate word, but the number occurred in corpus is also very much, Then the degree of correlation of the candidate word and the text is not very high.And there are many number occurred in the text when a candidate word, And the number that occurs in corpus it is seldom when, then it represents that the degree of correlation of the candidate word and the text is very high.

Whether the text where candidate word belongs to crucial textview field.On webpage where Target Photo, there are multiple texts This domain, as described above, describing textview field, Anchor Text domain and surrounding textview field, wherein crucial text there are title text domain This domain includes title text domain, describes textview field and Anchor Text domain.Non-key textview field is around textview field.It determines and waits Whether the text where selecting word belongs to crucial textview field, first determines text belonging to the candidate word, then determine belonging to the text Textview field, then determine whether text domain is crucial textview field.

Such as: " refreshing small seven " of dog belong to text, and " refreshing dog small seven small seven and betel nut younger sister's candidate word have got married Ah mew and small by seven Together ", the text belongs to title text domain, and title text domain is crucial textview field, then " refreshing dog is small for candidate word Whether the text where seven " belongs to crucial textview field.

Other than text feature described above, this article eigen can also include the different degree of candidate word.The candidate The different degree of word, when for being characterized in the hit text, number which is entered as keyword.

This article eigen can also include whether candidate word hits hot word dictionary, when the candidate word hits hot word dictionary, Indicate that the candidate word is often carried out web search as keyword.

After the Expressive Features for determining each candidate word, the characteristic value of the Expressive Features according to each candidate word is needed, from pass Keyword is extracted in keyword set, comprising:

According to the characteristic value of the Expressive Features of candidate word, candidate word and Target Photo are determined using preset training pattern Degree of correlation, preset training pattern are used to characterize the characteristic value pass corresponding with the degree of correlation with Target Photo of Expressive Features System；

Preset training pattern is the algorithm model for extracting keyword using any one in advance, by a large amount of instruction Practice what sample was trained.The preset training pattern can characterize the characteristic value of Expressive Features, the phase with Target Photo The corresponding relationship of pass degree.When obtaining the preset training pattern, if extracting keyword from keyword set, figure is only considered Literary feature only includes picture and text feature that is, in Expressive Features, then training when obtaining the preset training pattern, based on describe it is special Sign also only includes picture and text feature.If extracting keyword from keyword set, while considering picture and text feature and text feature, that is, retouches Stating in feature includes not only picture and text feature, further includes text feature, then when training obtains the preset training pattern, based on Expressive Features also include picture and text feature and text feature simultaneously.In order to ensure that the preset training pattern extracts having for keyword Effect property trains and obtains the type of used Expressive Features when the preset training pattern, with the use preset training pattern The type of Expressive Features, needs to be consistent based on when extraction keyword.

For example: when obtaining preset training pattern, which includes picture and text feature, and candidate word is in the text Word frequency, the part of speech of candidate word in the text, the TF-IDF value of candidate word in the text, length and the candidate word institute of candidate word Text whether belong to crucial textview field.Before then extracting keyword, when determining the Expressive Features of candidate word, then it needs to be determined that The characteristic value of the picture and text feature of the candidate word, the word frequency of candidate word in the text, the part of speech of candidate word in the text, candidate word exist Whether the TF-IDF value in text, the length of candidate word and the text where candidate word belong to crucial textview field.Above-mentioned two (one is to obtain preset training pattern step, and one is the description that candidate word is determined before extracting keyword in a different step The step of feature), the type of used Expressive Features is identical.

When specific implementation, which can use LambdaMART algorithm model, using a large amount of training Sample is trained acquisition.LambdaMART algorithm model can consider the correlativity between candidate word, with other algorithm models It compares, required training sample is few.

After obtaining the preset training pattern, by candidate word set, the characteristic value of the Expressive Features of each candidate word As input, which can export as a result candidate word and the degree of correlation of Target Photo.A kind of scene Under, the preset training pattern can by candidate word each in candidate word set, according to the degree of correlation with Target Photo by height to It is low to be ranked up, export the ranking results of candidate word.Under another scene, which can give a mark to candidate word, The candidate word score high with the degree of correlation of Target Photo is high, and the candidate word score low with the degree of correlation of Target Photo is low, defeated The score of candidate word out.

What needs to be explained here is that the preset training pattern, it can also be according to preset keyword number N, default Training pattern when being ranked up from high to low to candidate word according to the degree of correlation with Target Photo, output is arranged in top N Candidate word as keyword；Can also be when preset training pattern can give a mark to candidate word, output score is arranged in top N Candidate word as keyword.To obtain N number of keyword.

As shown in the above, the method for keyword extraction provided by the invention, when extracting keyword, based on retouch Stating feature includes that can characterize the picture and text feature of the similarity degree of the candidate word and the Target Photo, and the keyword extracted is The high candidate word with the Target Photo similarity degree, which can be good at characterizing the content of Target Photo, with the target The content matching degree of picture is high, when using the keyword as index terms, can be improved the accuracy rate for retrieving the Target Photo.

If for example: the Target Photo of webpage as shown in Figure 2, if according to the method for the prior art, according only to candidate The characteristics extraction keyword of the text feature of word, resulting keyword are " small seven " of refreshing dog and " getting rid of bulky younger sister ".And use this hair When the picture and text feature of the keyword extracting method of bright offer, consideration candidate word and Target Photo, " the picture and text feature of refreshing small seven " of dog It is " 0.976659922765 ", and the picture and text of " getting rid of bulky younger sister " are characterized in " 0.0276197218635 ".Resulting keyword is " mind Dog small seven ", it can be seen that " refreshing small seven " of dog are exactly dog shown in Target Photo, which can characterize the interior of Target Photo Hold, it is higher with the matching degree of the Target Photo.

Device example

Fig. 4 is the apparatus structure schematic diagram of keyword extraction provided in an embodiment of the present invention, comprising:

Module 401 is obtained, for obtaining candidate word set from the text of webpage where Target Photo；

Determining module 402, for determining the characteristic value of the Expressive Features of candidate word in the candidate word set, the description Feature includes picture and text feature, and the characteristic value of the picture and text feature is used to characterize the similar journey of the candidate word and the Target Photo Degree；

Extraction module 403, for the characteristic value according to the Expressive Features of the candidate word, from the candidate word set It is middle to extract the corresponding keyword of the Target Photo.

Optionally,

Optionally, the determining module includes:

Optionally, the extraction module includes:

Optionally, the acquisition module includes:

Second acquisition unit, for obtaining the text in the textview field；

Device shown in Fig. 4 is and device corresponding to method shown in FIG. 1 specific implementation and side shown in FIG. 1 Method is similar, and with reference to the description in method shown in FIG. 1, which is not described herein again.

As shown in the above, the device of keyword extraction provided by the invention, when extracting keyword, based on retouch Stating feature includes that can characterize the picture and text feature of the similarity degree of the candidate word and the Target Photo, and the keyword extracted is The high candidate word with the Target Photo similarity degree, which can be good at characterizing the content of Target Photo, with the target The content matching degree of picture is high, when using the keyword as index terms, can be improved the accuracy rate for retrieving the Target Photo.

Referring to Fig. 5, electronic equipment electronic equipment 500 may include following one or more components: processing component 502 is deposited Reservoir 504, power supply module 506, multimedia component 508, audio component 510, the interface 512 of input/output (I/O), sensor Component 514 and communication component 516.

The integrated operation of the usual controlling electronic devices 500 of processing component 502, such as with display, call, data are logical Letter, camera operation and record operate associated operation.Processing component 502 may include one or more processors 520 to hold Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 502 may include one or more moulds Block, convenient for the interaction between processing component 502 and other assemblies.For example, processing component 502 may include multi-media module, with Facilitate the interaction between multimedia component 508 and processing component 502.

Memory 504 is configured as storing various types of data to support the operation in equipment 500.These data are shown Example includes the instruction of any application or method for operating on electronic equipment 500, contact data, telephone directory number According to, message, picture, video etc..Memory 504 can by any kind of volatibility or non-volatile memory device or they Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable Programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, quick flashing Memory, disk or CD.

Power supply module 506 provides electric power for the various assemblies of electronic equipment 500.Power supply module 506 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 500 generate, manage, and distribute the associated component of electric power.

Multimedia component 508 includes the screen of one output interface of offer between the electronic equipment 500 and user. In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touches Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding The boundary of movement, and further acknowledge duration and pressure associated with the touch or slide operation.In some embodiments, Multimedia component 508 includes a front camera and/or rear camera.When equipment 500 is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.

Audio component 510 is configured as output and/or input audio signal.For example, audio component 510 includes a Mike Wind (MIC), when electronic equipment 500 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 504 or via logical Believe that component 516 is sent.In some embodiments, audio component 510 further includes a loudspeaker, is used for output audio signal.

I/O interface 512 provides interface between processing component 502 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Sensor module 514 includes one or more sensors, for providing the state of various aspects for electronic equipment 500 Assessment.Such as the state that opens/closes of equipment 500, the relative positioning of component, such as institute can be confirmed in sensor module 514 The display and keypad that component is electronic equipment 500 are stated, sensor module 514 can also confirm that electronic equipment 500 or electronics The position change of 500 1 components of equipment, the existence or non-existence that user contacts with electronic equipment 500,500 orientation of electronic equipment Or the temperature change of acceleration/deceleration and electronic equipment 500.Sensor module 514 may include proximity sensor, be configured to Presence of nearby objects are confirmed without any physical contact.Sensor module 514 can also include optical sensor, such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, which can be with Including acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 516 is configured to facilitate the communication of wired or wireless way between electronic equipment 500 and other equipment. Electronic equipment 500 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.Show at one In example property embodiment, communication component 516 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 516 further includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 500 can be by one or more application specific integrated circuit (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

Specifically, the embodiment of the invention provides a kind of electronic equipment, which can be specially electronic equipment 500, packet Having included memory 504 and one, perhaps more than one program one of them or more than one program is stored in memory In 504, and be configured to by one or more than one processor 520 execute the one or more programs include use In the instruction performed the following operation:

Optionally,

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

Obtain the text in the textview field；

The embodiment of the invention also provides a kind of non-transitorycomputer readable storage medium including instruction, for example including The memory 504 of instruction, above-metioned instruction can be executed by the processor 520 of device 500 to complete the above method.For example, described non- Provisional computer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and light number According to storage equipment etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of electronic equipment When device executes, so that electronic equipment is able to carry out a kind of setting method of application program, which comprises

Optionally,

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

Obtain the text in the textview field；

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The present invention is directed to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the application Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims. The above is only preferred embodiments of the present invention, it is noted that for those skilled in the art, Without departing from the principles of the invention, it can also make several improvements and retouch, these improvements and modifications also should be regarded as this hair Bright protection scope.

Claims

1. a kind of method of keyword extraction, which is characterized in that the described method includes:

Determine the characteristic value of the Expressive Features of candidate word in the candidate word set, the Expressive Features include picture and text feature, institute The characteristic value for stating picture and text feature is used to characterize the similarity degree of the candidate word and the Target Photo；

According to the characteristic value of the Expressive Features of the candidate word, the Target Photo pair is extracted from the candidate word set The keyword answered.

2. the method according to claim 1, wherein

The Expressive Features further include text feature, and the characteristic value of the text feature is for characterizing the candidate word in the text Attribute information in this.

3. according to the method described in claim 2, it is characterized in that,

The text feature includes word frequency of the candidate word in the text, word of the candidate word in the text Property, TF-IDF value of the candidate word in the text, the text where the length of the candidate word and the candidate word Whether crucial textview field is belonged to, any one or more in features described above.

4. method according to claim 1 to 3, which is characterized in that waited in the determination candidate word set The characteristic value for selecting the picture and text feature of word includes:

Extract the high dimensional feature vector of the Target Photo；

Extract the word insertion vector of the candidate word；

In the feature space, the similarity of high dimensional feature vector sum institute's predicate insertion vector is obtained, as the time Select the characteristic value of the picture and text feature of word.

5. method according to claim 1 to 3, which is characterized in that described to be retouched according to the candidate word The characteristic value for stating feature, the corresponding keyword of the Target Photo is extracted from the candidate word set includes:

According to the characteristic value of the Expressive Features of the candidate word, the candidate word and the mesh are determined using preset training pattern Mark on a map the degree of correlation of piece, the preset training pattern be used to characterize the Expressive Features characteristic value and the Target Photo Degree of correlation corresponding relationship；

According to the degree of correlation of the candidate word and the Target Photo, extracted from the candidate word set described at least one The corresponding keyword of Target Photo.

6. method according to claim 1 to 3, which is characterized in that the text from webpage where Target Photo In this, obtaining candidate word set includes:

Obtain the text in the textview field；

7. according to the method described in claim 6, it is characterized in that, the method is also after carrying out word segmentation processing to the text Include:

The text after participle is carried out to close word processing, and/or, go stop words to handle.

8. a kind of device of keyword extraction, which is characterized in that described device includes:

Determining module, for determining the characteristic value of the Expressive Features of candidate word in the candidate word set, the Expressive Features packet Picture and text feature is included, the characteristic value of the picture and text feature is used to characterize the similarity degree of the candidate word and the Target Photo；

Extraction module is extracted from the candidate word set for the characteristic value according to the Expressive Features of the candidate word The corresponding keyword of the Target Photo.

9. a kind of electronic equipment, which is characterized in that include memory and one or more than one program, wherein one A perhaps more than one program is stored in memory and is configured to execute described one by one or more than one processor A or more than one program includes the instruction for performing the following operation:

10. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by electronics When the processor of equipment executes, so that a kind of method that electronic equipment is able to carry out keyword extraction, which comprises