CN110083729A

CN110083729A - A kind of method and system of picture search

Info

Publication number: CN110083729A
Application number: CN201910345750.5A
Authority: CN
Inventors: 李长亮; 廖敏鹏; 宋振旗; 唐剑波
Original assignee: Chengdu Kingsoft Interactive Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Current assignee: Chengdu Kingsoft Interactive Entertainment Co Ltd; Beijing Jinshan Digital Entertainment Technology Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2019-08-02
Anticipated expiration: 2039-04-26
Also published as: CN110083729B

Abstract

The application provides a kind of method and system of picture search, the described method includes: in the case where obtaining search instruction, it is matched in the database according to the search statement of search instruction and/or search term, wherein, the label that the database purchase has target image and generated according to the target image；The corresponding target image output of the label that matching is obtained, due to including the label of target image descriptive statement in database, and descriptive statement contains to the more complete semantic description of image scene, therefore user equally can search target image by the descriptive statement of similar semantic.The present processes support sentence retrieval not only to enrich image searching approach, also improve picture search efficiency and quality, more enhance user images search experience.

Description

A kind of method and system of picture search

Technical field

This application involves field of computer technology, in particular to a kind of method and system of picture search calculate equipment, deposit Storage media and chip.

Background technique

Picture search is retrieved by inputting word similar with image name or content or sentence, and will be retrieved Image export to user carry out using.

With popularizing for Internet application, demand scene of the user to image is also more and more.Such as user can lead to It crosses network and uploads image, manufacturer can also crawl image by network.But in most cases, these images do not have label, It is difficult to be searched in network, and causes the waste of image resource.

In the prior art, a secondary picture includes complicated semantic information, if user want more accurately as a result, it is desirable to Picture search is carried out using descriptive statement, then needs manufacturer to carry out artificial mark to image in database in advance and corresponds to sentence, but Artificial mark sentence intricate operation, is easy error, and the inefficiency when needing the case where being labeled to large-scale image.

Summary of the invention

In view of this, the embodiment of the present application provides a kind of method and system of picture search, calculates equipment, storage medium And chip, to solve technological deficiency existing in the prior art.

The embodiment of the present application provides a kind of method of picture search, which comprises

In the case where obtaining search instruction, according to the search statement of search instruction and/or search term in the database into Row matching, wherein the label that the database purchase has target image and generated according to the target image；

The corresponding target image output of the label that matching is obtained.

Optionally, the method also includes:

Generate the corresponding descriptive statement of target image；

Keyword is obtained according to descriptive statement；

Using keyword and/or descriptive statement as the label of target image, and by the target image and the label Store database.

Optionally, the corresponding descriptive statement of the generation target image, comprising:

Target image is encoded, corresponding coding characteristic and global pool feature are obtained；

According to the initial reference feature of coding characteristic, global pool feature and first language model, initial polymerization spy is obtained Sign, by the initial polymerization feature be input to second language model generate second language model initial reference feature, and according to The initial reference feature of second language model generates the 1st output word；

T-th of aggregation features is obtained according to coding characteristic, global pool feature and t-th of output word, by described t-th Aggregation features are input to second language model and generate t-th of fixed reference feature of second language model, until meeting iteration ends item Part obtains the t+1 output word, and wherein t >=1 and t are positive integer；

The corresponding descriptive statement of the target image is generated according to the 1st to the t+1 output word.

Optionally, target image is encoded, obtains corresponding coding characteristic and global pool feature includes:

Target image is encoded by convolutional neural networks model, obtains corresponding coding characteristic；

Coding characteristic is subjected to pond processing by the pond layer of convolutional neural networks model, obtains corresponding global pool Feature.

Optionally, it according to the initial reference feature of coding characteristic, global pool feature and first language model, obtains initial Aggregation features, comprising:

According to the global pool feature and the initial reference feature of first language model to the coding characteristic at Reason, obtains initial local feature；

Initial local feature and initial reference feature are carried out polymerization to handle to obtain initial polymerization feature.

Optionally, t-th of aggregation features is obtained according to coding characteristic, global pool feature and t-th of output word, by institute It states t-th of aggregation features and is input to t-th of fixed reference feature that second language model generates second language model, up to meeting iteration Termination condition obtains the t+1 output word, comprising:

S1, t-th of output word is input to first language model, obtains t-th of non-initial reference of first language model Feature；

S2, the coding characteristic is handled according to the global pool feature and t-th of non-initial fixed reference feature, is obtained To t-th of local feature；

S3, it t-th of local feature and t-th of non-initial fixed reference feature is subjected to polymerization handles to obtain t-th of aggregation features；

S4, t-th of aggregation features is input to t-th of non-initial reference that second language model generates second language model Feature generates the t+1 output word according to t-th of non-initial fixed reference feature of second language model；

S5, judge whether the termination condition for reaching iteration, if it is not, step S6 is executed, if so, terminating；

S6, t is added 1 certainly, returns to step S1.

Optionally, keyword is obtained according to descriptive statement, comprising: will describe by word frequency-inverse document frequency algorithm Word in sentence is compared in the database, and scoring is greater than the word of scoring threshold value as keyword.

Optionally, it is matched in the database according to the search statement of search instruction and/or search term, comprising: will search The descriptive statement and/or keyword progress similarity mode in search statement and/or search term and database in Suo Zhiling；

The corresponding target image output of the label that matching is obtained, comprising: determining and described search sentence and/or search term Similarity be greater than the descriptive statement and/or keyword of threshold value, and it is the descriptive statement of the determination and/or keyword is corresponding Target image output.

The embodiment of the present application provides a kind of system of picture search, the system comprises:

Matching module is configured as in the case where obtaining search instruction, according to the search statement of search instruction and/or is searched Rope word is matched in the database, wherein the database purchase has target image and the corresponding mark of the target image Label；

Image output module is configured as that the corresponding target image output of obtained label will be matched.

The embodiment of the present application provides a kind of calculating equipment, including memory, processor and storage are on a memory and can The computer instruction run on a processor, the processor execute the method for realizing picture search as described above when described instruction The step of.

The embodiment of the present application provides a kind of computer readable storage medium, is stored with computer instruction, the instruction quilt The step of processor realizes the method for picture search as described above when executing.

The embodiment of the present application provides a kind of chip, is stored with computer instruction, realization when which is executed by chip The step of method of picture search as described above.

The method and system of picture search provided by the present application are generated by target image and according to the target image Label is stored in database profession, and in the case where obtaining search instruction, is existed according to the search statement of search instruction and/or search term It is matched in database, the corresponding target image of the label that matching is obtained exports.Due to including target figure in database As the label of descriptive statement, and descriptive statement contains to the more complete semantic description of image scene, therefore user can equally lead to The descriptive statement for crossing similar semantic searches target image.The present processes support sentence retrieval not only to enrich picture search Mode also improves picture search efficiency and quality, more enhances user images search experience.

In addition, the application is by encoding target image by convolutional neural networks model, pondization processing, obtain pair Then the coding characteristic and global pool feature answered are input to again and select including first language model, second language model and grid The decoding layer for selecting device is decoded, and finally obtains the corresponding label of the target image, database can not only be had image in this way Mark label, moreover it is possible to will newly collect image, including user uploads image and online large nuber of images marks label in time and is stored in In database and for retrieval, accelerates database expansion speed, saves artificial mark cost, save entreprise cost, increase The probability that customer interaction information is searched.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of the calculating equipment of the embodiment of the present application；

Fig. 2 is the flow diagram of the method for the picture search of one embodiment of the application；

Fig. 3 is the flow diagram of the method for the picture search of one embodiment of the application；

Fig. 4 is the flow diagram of the method for the picture search of one embodiment of the application；

Fig. 5 is the concrete application schematic diagram of the system of the picture search of one embodiment of the application；

Fig. 6 is the structural schematic diagram of the system of the picture search of one embodiment of the application.

Specific embodiment

Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.

The term used in this specification one or more embodiment be only merely for for the purpose of describing particular embodiments, It is not intended to be limiting this specification one or more embodiment.In this specification one or more embodiment and appended claims The "an" of singular used in book, " described " and "the" are also intended to including most forms, unless context is clearly Indicate other meanings.It is also understood that term "and/or" used in this specification one or more embodiment refers to and includes One or more associated any or all of project listed may combine.

It will be appreciated that though may be retouched using term first, second etc. in this specification one or more embodiment Various information are stated, but these information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other It opens.For example, first can also be referred to as second, class in the case where not departing from this specification one or more scope of embodiments As, second can also be referred to as first.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".

Firstly, the vocabulary of terms being related to one or more embodiments of the invention explains.

Area-of-interest (region of interest, ROI): in machine vision, image procossing, from processed image Region to be treated, referred to as area-of-interest are sketched the contours of in a manner of box, circle, ellipse, irregular polygon etc..At image Reason field, area-of-interest (ROI) is the image-region selected from image, to be further processed.This area Domain is image analysis emphasis of interest.The processing time can be reduced by drawing a circle to approve the region, increase precision.

Iamge description (image caption): a fusion calculation machine vision, natural language processing and machine learning it is comprehensive Conjunction problem provides the natural language sentence for capableing of iamge description content according to image, popular to say, it is exactly to translate a secondary picture to be One segment description text.

Affine transformation: referring in geometry, and a vector space carries out once linear transformation and connects a translation, converts For another vector space.

Coding characteristic (image feats): target image is input to convolutional neural networks model and is encoded, is obtained Feature after coding.

Global pool feature (global feats): coding characteristic is input to after pond layer carries out pond processing and is obtained Feature.Pond layer can effectively reduce the size of parameter matrix, to reduce number of parameters.

Local feature (local feats): by global pool feature, the fixed reference feature of coding characteristic and first language model It is input to mesh selector and carries out ROI processing, the feature for obtaining current time is local feature.

Aggregation features: the reference for local feature and first language the model output that current time mesh selector is exported is special Sign carries out the feature of polymerization generation.

Fixed reference feature: the feature of first language model and the output of second language model.

(term frequency-inverse document frequency, the inverse text frequency of word frequency-refer to TF-IDF Number): it is a kind of common weighting technique for information retrieval and data mining, TF refers to word frequency (Term Frequency), IDF refers to inverse document frequency (Inverse Document Frequency).It is available every by TF-IDF algorithm The score value of a word or phrase, to characterize the frequency of occurrences of each word or phrase.If some word or phrase are in an article The frequency of appearance is high, and seldom occurs in other articles, then it is assumed that this word or phrase have good class discrimination energy Power is adapted to classify.

In this application, a kind of method and system of picture search are provided, calculate equipment, storage medium and chip, It is described in detail one by one in the following examples.

Fig. 1 is to show the structural block diagram of the calculating equipment 100 according to one embodiment of this specification.The calculating equipment 100 Component include but is not limited to memory 110 and processor 120.Processor 120 is connected with memory 110 by bus 130, Database 150 is for saving data.

Calculating equipment 100 further includes access device 140, access device 140 enable calculate equipment 100 via one or Multiple networks 160 communicate.The example of these networks includes public switched telephone network (PSTN), local area network (LAN), wide area network (WAN), the combination of the communication network of personal area network (PAN) or such as internet.Access device 140 may include wired or wireless One or more of any kind of network interface (for example, network interface card (NIC)), such as JEEE802.11 wireless local area Net (WLAN) wireless interface, worldwide interoperability for microwave accesses (Wi-MAX) interface, Ethernet interface, universal serial bus (USB) connect Mouth, cellular network interface, blue tooth interface, near-field communication (NFC) interface, etc..

In one embodiment of this specification, other unshowned portions in the above-mentioned component and Fig. 1 of equipment 100 are calculated Part can also be connected to each other, such as pass through bus.It should be appreciated that calculating device structure block diagram shown in FIG. 1 merely for the sake of Exemplary purpose, rather than the limitation to this specification range.Those skilled in the art can according to need, and increases or replaces it His component.

Calculating equipment 100 can be any kind of static or mobile computing device, including mobile computer or mobile meter Calculate equipment (for example, tablet computer, personal digital assistant, laptop computer, notebook computer, net book etc.), movement Phone (for example, smart phone), wearable calculating equipment (for example, smartwatch, intelligent glasses etc.) or other kinds of shifting Dynamic equipment, or the static calculating equipment of such as desktop computer or PC.Calculating equipment 100 can also be mobile or state type Server.

Wherein, processor 120 can execute the step in method shown in Fig. 2.Fig. 2 is to show to be implemented according to the application one The schematic flow chart of the method for the iamge description of example, includes the following steps 201~202.

201, in the case where obtaining search instruction, according to the search statement of search instruction and/or search term in database In matched, wherein the database purchase have target image and according to the target image generate label.

In the case where obtaining search instruction, further includes: parsed to search instruction, obtain the search in search instruction Sentence and/or search term.

It is to be understood that the search statement and/or search term in search instruction can pass through various input sides for user Formula and generate, such as user is generated by the input order of keyboard, or by carrying out identification generation to the voice of input.

Specifically, it is matched in the database according to the search statement of search instruction and/or search term, comprising: will search The descriptive statement and/or keyword progress similarity mode in search statement and/or search term and database in Suo Zhiling.

Specifically, in this present embodiment by target image and the label generated according to the target image storage to number The step of according to library, includes the following steps 301~303 referring to Fig. 3:

301, the corresponding descriptive statement of target image is generated.

Wherein, target image refers to the image resource that enterprise can obtain, including user uploads image, enterprise has image by oneself, climbs Take image etc..

Specifically, step 301 includes the following steps S301~S304:

S301, target image is encoded, obtains corresponding coding characteristic and global pool feature.

Specifically, step S301 includes: to encode target image by convolutional neural networks model, is obtained corresponding Coding characteristic；Coding characteristic is subjected to pond processing by the pond layer of convolutional neural networks model, obtains corresponding global pool Change feature.

In the present embodiment, CNN (Convolutional Neural Network, convolution is can be used in convolutional network model Neural network) model encodes target image, obtains being the corresponding coding characteristic of entire target image.Specific structure ResNet (residual error network), VGG (Visual Geometry Group Network, the visual geometric group of pre-training can be used Network) etc. network models.

Wherein, pondization processing may include a variety of, and common pondization processing has maximum pond (max pooling) or flat The processing operation in equal pond (average pooling).It is operated by pondization, obtains the global pool feature of target image (global feats)。

In the present embodiment, target image is encoded after obtaining coding characteristic by convolutional Neural model, not only Coding characteristic is input to subsequent decoding layer to be decoded, can also the further obtained global pool feature in pond, then will Coding characteristic and pond feature are input to decoding layer jointly and are decoded, can more effective land productivity to guarantee in decoding process With image information, it can guarantee that the result chosen is more accurate when choosing area-of-interest (ROI).

S302, the initial reference feature according to coding characteristic, global pool feature and first language model, are initially gathered Feature is closed, the initial polymerization feature is input to the initial reference feature that second language model generates second language model, and The 1st output word is generated according to the initial reference feature of second language model.

It is to be understood that the initial reference feature of first language model generates by the following method: defeated by word is initialized Enter to first language model, obtains the 1st output feature of first language model as initial reference feature.

Wherein, the initial value that initialization word can be manually set.

Specifically, step S302 includes the following steps S3021~S3022:

S3021, according to the global pool feature and the initial reference feature of first language model to the coding characteristic It is handled, obtains initial local feature.

Specifically, coding characteristic is handled in step S3021 to obtain initial local feature, comprising: according to global pool The initial reference feature for changing feature and first language model, obtains initial affine transformation matrix；According to the initial affine transformation Matrix carries out affine transformation to the coding characteristic, obtains initial local feature.

Affine transformation refers to that in geometry, a vector space carries out once linear transformation and connects a translation, converts For another vector space.

Specifically, initial local feature is exported by mesh selector Grid selector.For example, in step S3021 The initial affine transformation matrix of a 2*3 is generated, then coding characteristic is carried out using the initial radiation transformation matrix of the 2*3 Selection, obtains corresponding initial local feature, to realize the selection to the region of interest ROI of image.

Wherein, mesh selector (Grid selector) is used as bottom component, and area-of-interest (ROI) may be implemented It chooses.

S3022, it initial local feature and initial reference feature is subjected to polymerization handles to obtain initial polymerization feature.

Specifically, step S3022 includes: to calculate initial local feature degree of being associated, and the association that obtains that treated is just Beginning local feature；Association initial local feature and initial reference feature are spliced, initial polymerization feature is obtained

Specifically, step S3022 includes: to multiply the initial reference feature of initial local feature and first language model respectively It with corresponding weight coefficient, is then added, obtains initial intermediate vector matrix；Just by the hyperbolic of the initial intermediate vector matrix Value is cut multiplied by corresponding weight coefficient, the power that gains attention initial weight coefficient；According to attention initial weight coefficient and initial office Portion's feature obtains association initial local feature.

Wherein, hyperbolic tangent function is computationally equal to the ratio of hyperbolic sine and hyperbolic cosine, i.e. tanh (x)=sinh (x)/cosh(x)。

Due toSo definition of hyperbolic tangent function are as follows:

Specifically, attention initial weight coefficient can be obtained by following formula (1):

α_i,1=w_a ^Ttanh(W_vav_i+W_hah₁ ¹) (1)

Wherein, α_i,1Represent attention initial weight coefficient；

W_va、W_ha、W_aIt is weight parameter, W_va∈R^H*V, W_ha∈R^H*M, W_a∈R^H；

h¹ ₁Represent the initial reference feature of first language model；

v_iRepresent initial local feature.

Specifically, association initial local feature can be obtained by following formula (2):

Wherein, α_i,1Represent attention initial weight coefficient；

v_iRepresent initial local feature, i=[1, k]；

Represent association initial local feature.

It should be noted that initial local feature is by global pool feature, the initial reference feature of first language model It is input to the feature vector that mesh selector Grid selector is exported with coding characteristic, initial reference feature is will be initial Change the feature vector of word input first language model output, the initial reference feature of initial local feature and first language model it Between polymerization, the condition for needing to meet be two feature vectors dimension it is identical.So being association by initial local Feature Conversion Initial local feature, to generate one-dimensional vector.In this way, two one-dimensional vector direct splicings, can obtain corresponding initial poly- Close feature.

For example, 2 one-dimensional vectors a and b, after being stitched together, the vector A=[a, b] of generation.

By the processing of this step S3022, image information and text information can be merged, be then input to second Language model generates the initial reference feature of second language model, then obtains the 1st output word.

S303, t-th of aggregation features is obtained according to coding characteristic, global pool feature and t-th of output word, it will be described T-th of aggregation features is input to second language model and generates t-th of fixed reference feature of second language model, until meeting iteration end Only condition obtains the t+1 output word, and wherein t >=1 and t are positive integer.

Specifically, referring to fig. 4, step S303 includes the following steps 401~406:

401, t-th of output word is input to first language model, it is special obtains t-th of first language model non-initial reference Sign.

Specifically, first language model can be LSTM (Long Short-Term Memory, shot and long term memory network) Model.

LSTM (Long Short-Term Memory, shot and long term memory network) model: being a kind of time recurrent neural net Network is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.LSTM model can be used to connect It connects in previous information to current task, such as speculates the understanding to current statement using past sentence.

LSTM model is in the case where receiving t-th of output word, according to exporting the t-1 that word and last time obtain for t-th A non-initial fixed reference feature obtains t-th of non-initial fixed reference feature of first language model.

402, the coding characteristic is handled according to the global pool feature and t-th of non-initial fixed reference feature, Obtain t-th of local feature.

Specifically, the acquisition of local feature is realized by mesh selector Grid selector, so that characterization is to image The selection of region of interest ROI.Compared with prior art, the present embodiment passes through the choosing in decoding layer progress region of interest ROI It takes, and the selection range of region of interest ROI can be changed according to the non-initial fixed reference feature of input every time, so as to Image information is chosen for greater flexibility.

Specifically, the step 402 includes: to obtain t according to global pool feature and t-th of non-initial fixed reference feature A affine transformation matrix；Affine transformation is carried out to the coding characteristic according to t-th of affine transformation matrix, is obtained t-th Local feature.

Specifically, the acquisition of t-th of local feature is realized by mesh selector Grid selector.For example, generating Then t-th of affine transformation matrix of one 2*3 selects coding characteristic using t-th of affine transformation matrix of the 2*3 It selects, obtains corresponding t-th of local feature, to realize the selection to the region of interest ROI of image.

403, t-th of local feature and t-th of non-initial fixed reference feature polymerization is carried out to handle to obtain t-th of polymerization spy Sign.

Specifically, step 403 includes: to calculate t-th local feature degree of being associated, t-th of the pass that obtain that treated Join local feature；T-th of association local feature and t-th of non-initial fixed reference feature are spliced, t-th of polymerization spy is obtained Sign.

Specifically, by t-th of non-initial fixed reference feature of t-th of local feature and first language model respectively multiplied by correspondence Weight coefficient, be then added, obtain intermediate vector matrix；By the tanh value of the intermediate vector matrix multiplied by corresponding Weight coefficient, the power that gains attention weight coefficient；According to attention weight coefficient and t-th of local feature, t-th of association office is obtained Portion's feature.

Specifically, attention weight coefficient can be obtained by following formula (3):

α_i,t=W_a ^Ttanh(W_vav_i+W_hah_t ¹) (3)

Wherein, α_i,tRepresent attention weight coefficient；

h¹ _tRepresent t-th of non-initial fixed reference feature of first language model；

v_iRepresent t-th of local feature.

T-th of association local feature is obtained by following formula (4):

Wherein, α_i,tRepresent attention weight coefficient；

v_iRepresent t-th of local feature；

Represent t-th of association local feature.

It should be noted that t-th of local feature (local feats) is given birth to by mesh selector Grid selector At feature vector, t-th of non-initial fixed reference feature is the feature vector generated by first language model, and between the two is poly- It closes, the condition for needing to meet is that the dimension of two feature vectors is identical.So local feature is converted into association local feature, To generate one-dimensional vector.In this way, two one-dimensional vector direct splicings, can obtain corresponding t-th of aggregation features.

By the processing of this step 403, image information and text information can be merged, then execute subsequent step Suddenly, to predict next output word.

404, t-th of aggregation features is input to t-th of non-initial ginseng that second language model generates second language model Feature is examined, the t+1 output word is generated according to t-th of non-initial fixed reference feature of second language model.

In the present embodiment, second language model can be LSTM model.

LSTM model in the case where receiving t-th of aggregation features, obtained according to t-th of aggregation features and last time T-1 output word obtains t-th of non-initial fixed reference feature of second language model.

In the present embodiment, initial reference feature and the global pool feature exported by first language model is handled, Initial affine transformation matrix can be further generated, then coding characteristic is handled by initial affine transformation matrix, is obtained To initial local feature, it is special that polymerization then is generated using the initial reference feature that initial local feature and first language model export Aggregation features are input to second language model by sign, to realize the word for predicting next output.

In step 404, the t+1 output word, packet are generated according to t-th of non-initial fixed reference feature of second language model It includes: t-th of the second language model non-initial fixed reference feature is subjected to classification processing, obtain corresponding the t+1 output Word.

Specifically, the method that can utilize beam search (beam search) by classifier (classifier), output are worked as The word of preceding moment maximum probability.

405, judge whether the termination condition for reaching iteration, if it is not, step 406 is executed, if so, terminating.

406, t is returned to step 401 from adding 1.

Through the above steps 401~406, other except the 1st output word that be removed export words.

S304, the corresponding descriptive statement of the target image is generated according to the 1st to the t+1 output word.

With the descriptive statement of generation for " apple ", then the descriptive statement includes 3 output word " one " " a " " apples Fruit ".

According to initialization word, the initial reference feature of first language model is obtained, then passes through mesh selector Grid Selector is according to coding characteristic (image feats), global pool feature (global feats) and first language model Initial reference feature, obtain the aggregation features for being input to second language model, and according to the initial of second language model output Fixed reference feature obtains the 1st output word " one ".

Then the 1st output word " one " is inputted into first language model, the 1st for obtaining the output of first language model is non- Initial reference feature, then by mesh selector Grid selector according to coding characteristic (image feats), global pool The 1st non-initial fixed reference feature for changing feature (global feats) and first language model, obtains being input to second language The aggregation features of model, and the 2nd output word " a " is obtained according to the initial reference feature that second language model exports.

Then by the 2nd " a " input first language model of output word, the 2nd for obtaining the output of first language model is non- Initial reference feature, then by mesh selector Grid selector according to coding characteristic (image feats), global pool The 2nd non-initial fixed reference feature for changing feature (global feats) and first language model, obtains being input to second language The aggregation features of model, and the 3rd output word " apple " is obtained according to the initial reference feature that second language model exports.

The present embodiment obtains according to the initial reference feature of coding characteristic, global pool feature and first language model Then the initial polymerization feature of two language models obtains the 1st output word according to the initial polymerization feature of second language model；Root According to t-th of fixed reference feature of coding characteristic, global pool feature and first language model, t-th of second language model is obtained Then aggregation features obtain t-th of output word according to t-th of aggregation features of second language model, it is corresponding to generate target image Descriptive statement, the flexible selection to the area-of-interest of image is realized so as to the generation according to aggregation features.

In the iamge description task of the prior art, need to select area-of-interest (ROI, region of Interest), then ROI region is described.ROI region has just started to generate during encoding image, Coding is completed to be to represent these regions to have generated, and can not change in the later period.Which limits the roots in image generation process Go to pay close attention to the ability in corresponding region according to context and semantic information.And the method for the present embodiment can more completely retain image Local message chooses image information for greater flexibility.

302, keyword is obtained according to descriptive statement.

Wherein, there are many ways to keyword being obtained according to descriptive statement:

Such as by Text Filtering Algorithm, descriptive statement is filtered, obtain keyword.Such as use tendency text Filter algorithm calculates the proneness index of descriptive statement, then generate corresponding weight for each word in descriptive statement, most Keyword is obtained eventually.

In another example passing through TF-IDF (term frequency-inverse document frequency, the inverse text of word frequency- This frequency index) word in descriptive statement is compared algorithm in the database saved in advance, scoring is greater than scoring threshold value Word as keyword.

If a word seldom occurs in tag database, but the frequency occurred in current descriptive statement is high, Then think that the word has good separating capacity, is adapted to the current corresponding target image of descriptive statement and other images It distinguishes, then just using the word as the keyword of target image.

When specifically used, frequency threshold can occur by setting to determine keyword.Such as the frequency of occurrences of a word Lower than the frequency of occurrences threshold value of setting, then using the word as keyword.

By taking the descriptive statement " child is skating " that target image generates as an example, by " small " " friend " " skating " " cunning " " ice " It is searched in the database by TF-IDF algorithm, the final frequency of occurrences for determining " skating " is less than frequency of occurrences threshold value, then will retouch " skating " in predicate sentence extracts the keyword as target image.

303, using keyword and/or descriptive statement as the label of target image, and by the target image and described Label is stored to database.

It is to be understood that the corresponding label more than one of each target image, can generally be corresponding with including sentence and pass Multiple labels of keyword.

Further, in the database, the corresponding target image of each label may be multiple, such as a keyword " skating " can correspond to multiple image, so as to obtain multiple image in the case where searching the keyword for selection by user.

In one case, label and the corresponding image of label are stored in jointly in a database.

In another case, label and image can be stored respectively in different databases, in tag database Store each label and label corresponding image attributes information, such as image link, picture number etc..It is corresponding according to label Image attributes information can be searched in image data base.

202, the corresponding target image of the label obtained matching exports.

Specifically, will the corresponding target image output of the obtained label of matching, comprising: it is determining with described search sentence and/ Or the similarity of search term is greater than the descriptive statement and/or keyword of threshold value, and by the descriptive statement and/or key of the determination The corresponding target image output of word.

Specifically, in the case where in search instruction comprising search statement, by search statement and retouching in tag database Predicate sentence carries out similarity mode.Similarity mode between two sentences can realize by Natural Language Processing Models, Such as convolutional neural networks (Convolutional Neural Network, CNN) model, vector space model (Vector Space Model, VSM) etc..

Compared with the case where being scanned for by keyword, by search statement with as label descriptive statement directly into The detection of row sentence similarity, retrieval mode is more intelligent, and search result is also more accurate.

Specifically, in the case where in search instruction comprising search term, by the keyword in search term and tag database It is matched.The matching way of two words include it is a variety of, such as knowledge based map matching or be based on Word2vec term vector Change tool matching etc..

The method of the picture search of the embodiment of the present application, by generating the corresponding descriptive statement of target image and key Word, and store using descriptive statement and keyword as the label of target image to database, and the case where obtaining search instruction Under, it is matched in the database according to the search statement of search instruction and/or search term, and the label that matching is obtained is corresponding Target image output, due to including the label of target image descriptive statement in database, and descriptive statement contains to image The more complete semantic description of scene, therefore user equally can search target image by the descriptive statement of similar semantic.This The method of application supports sentence retrieval not only to enrich image searching approach, also improves picture search efficiency and quality, more enhances User images search experiences.

In addition, the method for the present embodiment is encoded target image by convolutional neural networks model, pondization processing, obtain To corresponding coding characteristic and global pool feature, then it is input to again including first language model, second language model and net The decoding layer of lattice selector is decoded, and finally obtains the corresponding label of the image, database can not only be had image in this way Mark label, moreover it is possible to will newly collect image, including user uploads image and online large nuber of images marks label in time and is stored in In database and for retrieval, accelerates database expansion speed, saves artificial mark cost, save entreprise cost, increase The probability that customer interaction information is searched.

In order to make it easy to understand, the embodiment of the present application is schematically illustrated with a specific example.Referring to Fig. 5, Fig. 5 with One motorcycle rider is illustrated on the way cycling.The system of iamge description in Fig. 5 includes coding layer and decoding layer. Wherein, coding layer is exported using the hidden layer of CNN model, obtains the coding characteristic (image feats) and global pool of target image Change feature (global feats).

Decoding layer uses 4 modules or model, is followed successively by mesh selector Grid selector, first language model LSTM1, second language model LSTM2 and classifier classifier.

The method of picture search includes:

1) target image is input to CNN model, exports to obtain coding characteristic according to the hidden layer of CNN model.And according to volume The pondization processing of code feature, obtains global pool feature (global feats).

2) coding characteristic (image feats) and global pool feature (global feats) are input to decoding layer side Mesh selector Grid selector.Then according to initialization word, the initial reference feature h of LSTM1 is obtained¹ ₁。

3) by mesh selector Grid selector according to global pool feature (global feats) and initial ginseng Examine feature h¹ ₁Initial affine transformation matrix is obtained, according to the initial affine transformation matrix to the coding characteristic (image Feats affine transformation) is carried out, initial local feature is obtained, by initial local feature (local feats) degree of being associated meter It calculates, the association initial local feature that obtains that treated, association initial local feature and initial reference feature is spliced, obtained Initial polymerization feature.Obtained initial polymerization feature is input to LSTM2, and the initial reference feature h of LSTM2 output² ₁.It will be initial Fixed reference feature h² ₁It is input to classifier classifier, obtains the 1st output word " motorcycle ".

4) the 1st output word " motorcycle " is input to LSTM1, the non-initial fixed reference feature h exported¹ ₂, lead to Mesh selector Grid selector is crossed according to global pool feature (global feats) and non-initial fixed reference feature h¹ ₂? To affine transformation matrix, affine transformation is carried out to coding characteristic according to affine transformation matrix, local feature is obtained, by local feature (local feats) degree of being associated calculates, the association initial local feature that obtains that treated, will association initial local feature and Initial reference feature is spliced, and aggregation features are obtained.Obtained aggregation features are input to LSTM2, and LSTM2 output is non- Initial reference feature h² ₂.By non-initial fixed reference feature h² ₂It is input to classifier classifier, obtains the 2nd output word “driver”。

5) and so on, obtain the 3rd output word " driving ", the 4th output word " on ", the 4th output word " the " With the 6th output word " road ".

6) according to output word, descriptive statement " the motorcycle driver driving on the of target image is obtained road”。

7) descriptive statement is compared by TF-IDF algorithm with database, determines the corresponding keyword of descriptive statement, Including " driving " " motorcycle driver ".

8) using descriptive statement, keyword as the label of target image, it is stored in database jointly together with target image.

9) in the case where obtaining search instruction, the search statement and/or search term in search instruction are parsed, and will search Word is matched in the database, and the corresponding target image of the label that matching is obtained exports.

For example, including search term " drive " in search instruction, then searching label corresponding with search term in the database " driving ", and the corresponding target image of label " driving " is exported.

In another example the similarity threshold of descriptive statement is 0.7 in preset search sentence and image tag.It is wrapped in search instruction Search statement " drive on road " is included, then searching in the database big with search statement " drive on road " similarity In 0.7 descriptive statement or keyword.Optionally, it calculates " drive on road " using convolutional neural networks (CNN) and retouches The similarity of predicate sentence " motorcycle driver driving on the road ", it is big to be such as calculated similarity result In the case where 0.7, the corresponding target image of the descriptive statement is exported.

One embodiment of the application also provides a kind of system of picture search, referring to Fig. 6, comprising:

Matching module 601 is configured as in the case where obtaining search instruction, according to the search statement of search instruction and/ Or search term is matched in the database, wherein the database purchase has target image and according to the target image The label of generation；

Image output module 602 is configured as that the corresponding target image output of obtained label will be matched.

Optionally, described device further include:

Descriptive statement generation module is configurable to generate the corresponding descriptive statement of target image；

Keyword generation module is configured as obtaining keyword according to descriptive statement；

Memory module is configured as using keyword and/or descriptive statement as the label of target image, and by the target Image and the label are stored to database.

Optionally, descriptive statement generation module is specifically configured to:

Coding module is configured as encoding target image, obtains corresponding coding characteristic and global pool feature；

First output word generation module, is configured as according to coding characteristic, global pool feature and first language model Initial reference feature, obtain initial polymerization feature, by the initial polymerization feature be input to second language model generate second The initial reference feature of language model, and the 1st output word is generated according to the initial reference feature of second language model；

Second output word generation module, is configured as according to coding characteristic, global pool feature and t-th of output word Language obtains t-th of aggregation features, and t-th of aggregation features are input to that second language model generates second language model T fixed reference feature obtains the t+1 output word until meeting stopping criterion for iteration, and wherein t >=1 and t are positive integer；

Descriptive statement generation module is configured as generating the target image pair according to the 1st to the t+1 output word The descriptive statement answered.

Optionally, coding module is specifically configured to: target image being encoded by convolutional neural networks model, is obtained To corresponding coding characteristic；Coding characteristic is subjected to pond processing by the pond layer of convolutional neural networks model, is corresponded to Global pool feature.

Optionally, the first output word generation module is specifically configured to: according to the global pool feature and the first language The initial reference feature of speech model handles the coding characteristic, obtains initial local feature；By initial local feature and Initial reference feature carries out polymerization and handles to obtain initial polymerization feature.

Optionally, the first output word generation module is specifically configured to: according to global pool feature and first language mould The initial reference feature of type, obtains initial affine transformation matrix；

Affine transformation is carried out to the coding characteristic according to the initial affine transformation matrix, obtains initial local feature.

Optionally, the first output word generation module is specifically configured to: initial local feature degree of being associated is calculated, The association initial local feature that obtains that treated；Association initial local feature and initial reference feature are spliced, obtained just Beginning aggregation features.

Optionally, the first output word generation module is specifically configured to: by initial local feature and first language model Initial reference feature respectively multiplied by corresponding weight coefficient, be then added, obtain initial intermediate vector matrix；

By the tanh value of the initial intermediate vector matrix multiplied by corresponding weight coefficient, the power that gains attention initially is weighed Weight coefficient；

According to attention initial weight coefficient and initial local feature, association initial local feature is obtained.

Optionally, the initial reference feature of first language model generates by the following method: initialization word is input to First language model obtains the 1st output feature of first language model as initial reference feature.

Optionally, the second output word generation module is specifically configured to:

First non-initial fixed reference feature generation module is configured as t-th of output word being input to first language model, Obtain t-th of non-initial fixed reference feature of first language model；

Local feature generation module is configured as according to the global pool feature and t-th of non-initial fixed reference feature pair The coding characteristic is handled, and t-th of local feature is obtained；

Aggregation features generation module is configured as t-th of local feature and t-th of non-initial fixed reference feature polymerizeing Processing obtains t-th of aggregation features；

Second non-initial fixed reference feature generation module, is configured as t-th of aggregation features being input to second language model T-th of non-initial fixed reference feature for generating second language model, it is raw according to t-th of non-initial fixed reference feature of second language model At the t+1 output word；

Judgment module is configured as judging whether to reach the termination condition of iteration, if it is not, executing from module is increased, if so, knot Beam；

From module is increased, it is configured as returning to from adding 1 and executing the first non-initial fixed reference feature generation module t.

According to global pool feature and t-th of non-initial fixed reference feature, t-th of affine transformation matrix is obtained；

Affine transformation is carried out to the coding characteristic according to t-th of affine transformation matrix, it is special to obtain t-th of part Sign.

Optionally, the second output word generation module is specifically configured to: by t-th of local feature degree of being associated meter It calculates, t-th of the association local feature that obtain that treated；T-th of association local feature and t-th of non-initial fixed reference feature are carried out Splicing, obtains t-th of aggregation features.

Optionally, the second output word generation module is specifically configured to: by t-th of local feature and first language model T-th of non-initial fixed reference feature respectively multiplied by corresponding weight coefficient, be then added, obtain intermediate vector matrix；

By the tanh value of the intermediate vector matrix multiplied by corresponding weight coefficient, the power that gains attention weight coefficient；

According to attention weight coefficient and t-th of local feature, t-th of association local feature is obtained.

Optionally, the second output word generation module is specifically configured to: non-first by t-th of the second language model Beginning fixed reference feature carries out classification processing, obtains corresponding the t+1 output word.

Optionally, matching module 601 is specifically configured to: by word frequency-inverse document frequency algorithm by descriptive statement In word be compared in the database, and will scoring be greater than scoring threshold value word as keyword.

Optionally, matching module 601 is configured as: by the search statement and/or search term and database in search instruction In descriptive statement and/or keyword carry out similarity mode；

Image output module 602 is configured as: determining to be greater than threshold value with the similarity of described search sentence and/or search term Descriptive statement and/or keyword, and the descriptive statement of the determination and/or the corresponding target image of keyword are exported.

A kind of exemplary scheme of the system of above-mentioned picture search for the present embodiment.It should be noted that the system The technical solution of technical solution and the method for above-mentioned picture search belongs to same design, and the technical solution of system is not described in detail Detail content, may refer to the description of the technical solution of the method for above-mentioned picture search.

One embodiment of the application also provides a kind of computer readable storage medium, is stored with computer instruction, the instruction The step of method of picture search as previously described is realized when being executed by processor.

A kind of exemplary scheme of above-mentioned computer readable storage medium for the present embodiment.It should be noted that this is deposited The technical solution of the technical solution of storage media and the method for above-mentioned picture search belongs to same design, the technical side of storage medium The detail content that case is not described in detail may refer to the description of the technical solution of the method for above-mentioned picture search.

The computer instruction includes computer program code, the computer program code can for source code form, Object identification code form, executable file or certain intermediate forms etc..The computer-readable medium may include: that can carry institute State any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, the computer storage of computer program code Device, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), Electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer-readable medium include it is interior Increase and decrease appropriate can be carried out according to the requirement made laws in jurisdiction with patent practice by holding, such as in certain jurisdictions of courts Area does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium.

One embodiment of the application also provides a kind of chip, is stored with computer instruction, real when which is executed by chip Now the step of method of picture search as previously described.

It should be noted that for the various method embodiments described above, describing for simplicity, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because According to the application, certain steps can use other sequences or carry out simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules might not all be this Shen It please be necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiments.

The application preferred embodiment disclosed above is only intended to help to illustrate the application.There is no detailed for alternative embodiment All details are described, are not limited the invention to the specific embodiments described.Obviously, according to the content of this specification, It can make many modifications and variations.These embodiments are chosen and specifically described to this specification, is in order to preferably explain the application Principle and practical application, so that skilled artisan be enable to better understand and utilize the application.The application is only It is limited by claims and its full scope and equivalent.

Claims

1. a kind of method of picture search, which is characterized in that the described method includes:

In the case where obtaining search instruction, carried out in the database according to the search statement of search instruction and/or search term Match, wherein the label that the database purchase has target image and generated according to the target image；

The corresponding target image output of the label that matching is obtained.

2. the method for picture search as described in claim 1, which is characterized in that the method also includes:

Generate the corresponding descriptive statement of target image；

Keyword is obtained according to descriptive statement；

Using keyword and/or descriptive statement as the label of target image, and the target image and the label are stored To database.

3. the method for picture search as claimed in claim 2, which is characterized in that the corresponding description language of the generation target image Sentence, comprising:

According to the initial reference feature of coding characteristic, global pool feature and first language model, initial polymerization feature is obtained, it will The initial polymerization feature is input to the initial reference feature that second language model generates second language model, and according to the second language The initial reference feature of speech model generates the 1st output word；

T-th of aggregation features is obtained according to coding characteristic, global pool feature and t-th of output word, described t-th is polymerize Feature is input to second language model and generates t-th of fixed reference feature of second language model, until meeting stopping criterion for iteration, obtains To the t+1 output word, wherein t >=1 and t are positive integer；

4. the method for iamge description as claimed in claim 3, which is characterized in that encode target image, corresponded to Coding characteristic and global pool feature, comprising:

Coding characteristic is subjected to pond processing by the pond layer of convolutional neural networks model, it is special to obtain corresponding global poolization Sign.

5. the method for iamge description as claimed in claim 3, which is characterized in that according to coding characteristic, global pool feature and The initial reference feature of first language model, obtains initial polymerization feature, comprising:

The coding characteristic is handled according to the initial reference feature of the global pool feature and first language model, is obtained To initial local feature；

6. the method for iamge description as claimed in claim 3, which is characterized in that according to coding characteristic, global pool feature with And t-th of output word obtains t-th of aggregation features, and t-th of aggregation features are input to second language model and generate second T-th of fixed reference feature of language model obtains the t+1 output word until meeting stopping criterion for iteration, comprising:

S1, t-th of output word is input to first language model, obtains t-th of non-initial fixed reference feature of first language model；

S2, the coding characteristic is handled according to the global pool feature and t-th of non-initial fixed reference feature, obtains T local feature；

S4, t-th of aggregation features is input to t-th of non-initial reference spy that second language model generates second language model Sign generates the t+1 output word according to t-th of non-initial fixed reference feature of second language model；

S6, t is added 1 certainly, returns to step S1.

7. the method for picture search as claimed in claim 2, which is characterized in that obtain keyword according to descriptive statement, comprising: The word in descriptive statement is compared in the database by word frequency-inverse document frequency algorithm, and it is big to score In scoring threshold value word as keyword.

8. the method for picture search as claimed in claim 2, which is characterized in that according to the search statement of search instruction and/or Search term is matched in the database, comprising: by search statement and/or search term in search instruction and retouching in database Predicate sentence and/or keyword carry out similarity mode；

The corresponding target image output of the label that matching is obtained, comprising: the determining phase with described search sentence and/or search term It is greater than the descriptive statement and/or keyword of threshold value like degree, and by the descriptive statement of the determination and/or the corresponding target of keyword Image output.

9. a kind of system of picture search, which is characterized in that the system comprises:

Matching module is configured as in the case where obtaining search instruction, according to the search statement and/or search term of search instruction It is matched in the database, wherein the database purchase has target image and the corresponding label of the target image；

10. a kind of calculating equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine instruction, which is characterized in that the processor realizes the step of claim 1-8 any one the method when executing described instruction Suddenly.

11. a kind of computer readable storage medium, is stored with computer instruction, which is characterized in that the instruction is held by processor The step of claim 1-8 any one the method is realized when row.

12. a kind of chip, is stored with computer instruction, which is characterized in that the instruction realizes claim when being executed by chip The step of 1-8 any one the method.