CN110083729B

CN110083729B - Image searching method and system

Info

Publication number: CN110083729B
Application number: CN201910345750.5A
Authority: CN
Inventors: 李长亮; 廖敏鹏; 宋振旗; 唐剑波
Original assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Chengdu Kingsoft Interactive Entertainment Technology Co ltd; Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2019-04-26
Filing date: 2019-04-26
Publication date: 2023-10-27
Anticipated expiration: 2039-04-26
Also published as: CN110083729A

Abstract

The application provides a method and a system for searching images, wherein the method comprises the following steps: under the condition that a search instruction is obtained, matching is carried out in a database according to a search statement and/or a search word of the search instruction, wherein the database stores a target image and a label generated according to the target image; and outputting the target image corresponding to the matched label, wherein the database contains the label of the target image description statement, and the description statement contains more complete semantic description for the image scene, so that the user can search the target image through the description statement with similar semantic. The method of the application supports sentence search, not only enriches the image searching mode, but also improves the image searching efficiency and quality, and further enhances the user image searching experience.

Description

Image searching method and system

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and system for searching images, a computing device, a storage medium, and a chip.

Background

Image searching, which performs searching by inputting words or sentences similar to the names or contents of images, and outputs the searched images to the user for use.

With the popularization of internet applications, the demand scene of users for images is also increasing. For example, the user may upload the image via a network, and the vendor may crawl the image via the network. However, in most cases, these images have no tag, so that they are difficult to be searched in a network, and waste of image resources is caused.

In the prior art, a pair of pictures contains complex semantic information, if a user wants to obtain a more accurate result, a description sentence is needed to be adopted for searching an image, and a manufacturer is needed to manually label the corresponding sentence on the image in the database in advance, but the manual label sentence has complicated work, is easy to make mistakes, and has low efficiency when the condition of labeling a large-scale image is needed.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a method and a system for searching images, a computing device, a storage medium and a chip, so as to solve the technical defects existing in the prior art.

The embodiment of the application provides a method for searching images, which comprises the following steps:

under the condition that a search instruction is obtained, matching is carried out in a database according to a search statement and/or a search word of the search instruction, wherein the database stores a target image and a label generated according to the target image;

And outputting the target image corresponding to the label obtained by matching.

Optionally, the method further comprises:

generating a description sentence corresponding to the target image;

obtaining keywords according to the description sentences;

and taking the keywords and/or the descriptive sentences as labels of the target images, and storing the target images and the labels into a database.

Optionally, the generating the description sentence corresponding to the target image includes:

coding the target image to obtain corresponding coding features and global pooling features;

obtaining initial aggregation features according to the coding features, the global pooling features and the initial reference features of the first language model, inputting the initial aggregation features into the second language model to generate initial reference features of the second language model, and generating a 1 st output word according to the initial reference features of the second language model;

obtaining a t aggregation feature according to the coding feature, the global pooling feature and the t output word, inputting the t aggregation feature into a second language model to generate a t reference feature of the second language model until an iteration termination condition is met, and obtaining a t+1th output word, wherein t is more than or equal to 1 and is a positive integer;

and generating a description sentence corresponding to the target image according to the 1 st to t+1st output words.

Optionally, encoding the target image to obtain the corresponding encoding feature and the global pooling feature includes:

coding the target image through a convolutional neural network model to obtain corresponding coding characteristics;

and carrying out pooling treatment on the coding features through a pooling layer of the convolutional neural network model to obtain corresponding global pooling features.

Optionally, obtaining an initial aggregate feature according to the coding feature, the global pooling feature and the initial reference feature of the first language model, including:

processing the coding features according to the global pooling features and initial reference features of the first language model to obtain initial local features;

and carrying out aggregation treatment on the initial local features and the initial reference features to obtain initial aggregation features.

Optionally, obtaining a t aggregation feature according to the coding feature, the global pooling feature and the t output word, inputting the t aggregation feature to the second language model to generate a t reference feature of the second language model until an iteration termination condition is met, and obtaining a t+1th output word, including:

s1, inputting a t output word into a first language model to obtain a t non-initial reference feature of the first language model;

S2, processing the coding feature according to the global pooling feature and the t non-initial reference feature to obtain a t local feature;

s3, carrying out aggregation treatment on the t local feature and the t non-initial reference feature to obtain a t aggregation feature;

s4, inputting the t aggregation feature into the second language model to generate a t non-initial reference feature of the second language model, and generating a t+1th output word according to the t non-initial reference feature of the second language model;

s5, judging whether the iteration termination condition is met, if not, executing the step S6, and if yes, ending;

s6, adding 1 to t, and returning to the step S1.

Optionally, obtaining the keyword according to the description sentence includes: and comparing the words in the descriptive sentences in the database through a word frequency-inverse text frequency index algorithm, and taking the words with scores larger than a scoring threshold value as keywords.

Optionally, matching in the database according to the search statement and/or the search word of the search instruction includes: performing similarity matching on search sentences and/or search words in the search instructions and description sentences and/or keywords in the database;

outputting the target image corresponding to the label obtained by matching, wherein the target image comprises: and determining the description sentences and/or keywords with the similarity larger than a threshold value with the search sentences and/or the search words, and outputting target images corresponding to the determined description sentences and/or keywords.

The embodiment of the application provides a system for searching images, which comprises the following steps:

the matching module is configured to match in a database according to a search statement and/or a search word of the search instruction under the condition that the search instruction is obtained, wherein the database stores target images and labels corresponding to the target images;

and the image output module is configured to output the target image corresponding to the label obtained by matching.

Embodiments of the present application provide a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, which when executed, implement the steps of the method of image searching as described above.

Embodiments of the present application provide a computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of a method of image searching as described above.

The present application provides a chip storing computer instructions which, when executed by the chip, implement the steps of the method of image searching as described above.

According to the image searching method and system, the target image and the label generated according to the target image are stored in the database, under the condition that the searching instruction is obtained, matching is carried out in the database according to the searching statement and/or the searching word of the searching instruction, and the target image corresponding to the label obtained by matching is output. Because the database contains the labels of the description sentences of the target image and the description sentences contain more complete semantic descriptions for the image scene, the user can search the target image through the description sentences with similar semantic. The method of the application supports sentence search, not only enriches the image searching mode, but also improves the image searching efficiency and quality, and further enhances the user image searching experience.

In addition, the method and the device have the advantages that the target image is subjected to coding and pooling processing through the convolutional neural network model to obtain the corresponding coding features and global pooling features, and then the corresponding coding features and global pooling features are input into the decoding layer comprising the first language model, the second language model and the grid selector for decoding, and finally the label corresponding to the target image is obtained, so that the existing image label of the database can be marked, the newly collected image, including the image uploaded by a user and the online massive image, can be marked and stored in the database in time and can be retrieved, the expansion speed of the database is accelerated, the manual marking cost is saved, the enterprise cost is saved, and the probability of searching the interactive information of the user is increased.

Drawings

FIG. 1 is a schematic diagram of a computing device in accordance with an embodiment of the application;

FIG. 2 is a flow chart of a method of image searching according to an embodiment of the present application;

FIG. 3 is a flow chart of a method of image searching according to an embodiment of the present application;

FIG. 4 is a flow chart of a method of image searching according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a specific application of a system for image searching according to an embodiment of the present application;

fig. 6 is a schematic diagram of a system for image search according to an embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination".

First, terms related to one or more embodiments of the present invention will be explained.

Region of interest (region of interest, ROI): in machine vision and image processing, a region to be processed is outlined from a processed image in a box, circle, ellipse, irregular polygon and the like, and is called a region of interest. In the field of image processing, a region of interest (ROI) is a region of an image selected from the image for further processing. This area is the focus of interest in image analysis. Delineating the region can reduce processing time and increase accuracy.

Image description (image capture): a comprehensive problem of integrating computer vision, natural language processing and machine learning is that natural language sentences capable of describing contents of images are given according to images, and in popular terms, a picture is translated into a section of descriptive text.

Affine transformation: in geometry, a vector space is linearly transformed and translated to another vector space.

Coding features (image features): and inputting the target image into a convolutional neural network model for coding, and obtaining the coded characteristics.

Global pooling feature (global features): inputting the coding features into a pooling layer for pooling treatment to obtain the features. The pooling layer may very effectively reduce the size of the parameter matrix and thus the number of parameters.

Local features (local features): and inputting the global pooling feature, the coding feature and the reference feature of the first language model into a grid selector for ROI processing to obtain the feature of the current moment as the local feature.

Polymerization characteristics: and aggregating the local features output by the grid selector at the current moment and the reference features output by the first language model to generate features.

Reference features: features output by the first language model and the second language model.

TF-IDF (term frequency-inverse text frequency index): is a common weighting technique for information retrieval and data mining, TF refers to Term Frequency (Term Frequency), and IDF refers to inverse text Frequency index (Inverse Document Frequency). Through the TF-IDF algorithm, a scoring value for each word or phrase can be obtained to characterize the frequency of occurrence of each word or phrase. If a word or phrase appears in one article with a high frequency and in other articles with a low frequency, the word or phrase is considered to have good category discrimination and is suitable for classification.

In the present application, a method and a system for searching an image, a computing device, a storage medium and a chip are provided, and detailed descriptions are given one by one in the following embodiments.

Fig. 1 is a block diagram illustrating a configuration of a computing device 100 according to an embodiment of the present description. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. Processor 120 is coupled to memory 110 via bus 130 and database 150 is used to store data.

Computing device 100 also includes access device 140, access device 140 enabling computing device 100 to communicate via one or more networks 160. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 140 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as a JEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 100, as well as other components not shown in FIG. 1, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device shown in FIG. 1 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 100 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 100 may also be a mobile or stationary server.

Wherein the processor 120 may perform the steps of the method shown in fig. 2. Fig. 2 is a schematic flow chart illustrating a method of image description according to an embodiment of the application, including the following steps 201-202.

201. And under the condition that the search instruction is obtained, matching is carried out in a database according to the search statement and/or the search word of the search instruction, wherein the database stores target images and labels generated according to the target images.

In the case of obtaining the search instruction, the method further includes: analyzing the search instruction to obtain search sentences and/or search words in the search instruction.

It should be explained that the search term and/or the search term in the search instruction may be generated by various input modes for the user, for example, by the user inputting a command through a keyboard or by recognizing an input voice.

Specifically, matching in a database according to a search statement and/or a search word of a search instruction includes: and carrying out similarity matching on the search sentences and/or search words in the search instruction and the description sentences and/or keywords in the database.

Specifically, for the step of storing the target image and the tag generated according to the target image in the database in the present embodiment, referring to fig. 3, the steps 301 to 303 include:

301. and generating a description sentence corresponding to the target image.

The target image refers to an image resource which can be acquired by an enterprise, and comprises an image uploaded by a user, an own image of the enterprise, a crawling image and the like.

Specifically, step 301 includes the following steps S301 to S304:

s301, encoding the target image to obtain corresponding encoding features and global pooling features.

Specifically, step S301 includes: coding the target image through a convolutional neural network model to obtain corresponding coding characteristics; and carrying out pooling treatment on the coding features through a pooling layer of the convolutional neural network model to obtain corresponding global pooling features.

In this embodiment, the convolutional network model may encode the target image using a CNN (Convolutional Neural Network ) model, and the obtained encoding features correspond to the entire target image. The specific structure can adopt a pre-trained ResNet (residual error network), VGG (Visual Geometry Group Network ) and other network models.

The pooling process may include various kinds of conventional pooling processes, such as maximum pooling (max pooling) or average pooling (average pooling). And (5) obtaining global pooling features (global features) of the target image through pooling operation.

In this embodiment, after encoding a target image by using a convolutional neural model to obtain an encoded feature, instead of merely inputting the encoded feature to a subsequent decoding layer for decoding, the obtained global pooling feature is pooled further, and then the encoded feature and the pooling feature are input together to the decoding layer for decoding, so as to ensure that image information can be utilized more effectively in the decoding process, and ensure that the selected result is more accurate when a region of interest (ROI) is selected.

S302, obtaining initial aggregation features according to the coding features, the global pooling features and the initial reference features of the first language model, inputting the initial aggregation features into the second language model to generate initial reference features of the second language model, and generating the 1 st output word according to the initial reference features of the second language model.

It should be explained that the initial reference features of the first language model are generated by the following method: inputting the initialization word into the first language model to obtain the 1 st output feature of the first language model as an initial reference feature.

Wherein, the initialization word can be set as an initial value by human.

Specifically, step S302 includes the following steps S3021 to S3022:

s3021, processing the coding features according to the global pooling features and initial reference features of the first language model to obtain initial local features.

Specifically, in step S3021, processing the encoded feature to obtain an initial local feature includes: obtaining an initial affine transformation matrix according to the global pooling feature and the initial reference feature of the first language model; and carrying out affine transformation on the coding features according to the initial affine transformation matrix to obtain initial local features.

Affine transformation means that in geometry, one vector space is subjected to linear transformation once and translated up to another vector space.

Specifically, the initial local feature is output through the Grid selector. For example, in step S3021, an initial affine transformation matrix of 2*3 is generated, and then the initial radiation transformation matrix of 2*3 is used to select the encoding features, so as to obtain corresponding initial local features, thereby implementing selection of the region of interest ROI of the image.

Wherein, grid selector (Grid selector) is used as the bottom layer component, can realize the selection of the region of interest (ROI).

S3022, performing aggregation treatment on the initial local features and the initial reference features to obtain initial aggregated features.

Specifically, step S3022 includes: performing association degree calculation on the initial local features to obtain processed association initial local features; splicing the associated initial local features and the initial reference features to obtain initial aggregation features

Specifically, step S3022 includes: multiplying the initial local feature and the initial reference feature of the first language model by corresponding weight coefficients respectively, and then adding to obtain an initial intermediate vector matrix; multiplying the hyperbolic tangent value of the initial intermediate vector matrix by a corresponding weight coefficient to obtain an attention initial weight coefficient; and obtaining the associated initial local characteristic according to the attention initial weight coefficient and the initial local characteristic.

Wherein the hyperbolic tangent function is computationally equal to the ratio of hyperbolic sine to hyperbolic cosine, i.e., tanh (x) =sinh (x)/cosh (x).

Due toThe hyperbolic tangent function is defined as:

specifically, the attention initial weight coefficient can be obtained by the following formula (1):

α _i,1 ＝w _a ^T tanh(W _va v _i +W _ha h ₁ ¹ ) (1)

wherein alpha is _i,1 Representing an attention initial weight coefficient;

W _va 、W _ha 、W _a are all weight parameters, W _va ∈R ^H*V ，W _ha ∈R ^H*M ，W _a ∈R ^H ；

h ¹ ₁ Representing an initial reference feature of the first language model;

v _i Representing the initial local feature.

Specifically, the correlation initial local feature can be obtained by the following formula (2):

wherein alpha is _i,1 Representing an attention initial weight coefficient;

v _i representing the initial local feature, i= [1, k]；

Representing the associated initial local feature.

It should be noted that, the initial local feature is a feature vector output by inputting the global pooling feature, the initial reference feature of the first language model and the coding feature to the Grid selector, the initial reference feature is a feature vector output by inputting the initialization word to the first language model, and the aggregation between the initial local feature and the initial reference feature of the first language model is that the two feature vectors have the same dimension. Therefore, the initial local feature is converted to an associated initial local feature to generate a one-dimensional vector. Thus, the two one-dimensional vectors are directly spliced, and the corresponding initial aggregation characteristics can be obtained.

For example, 2 one-dimensional vectors a and b are spliced together to generate a vector a= [ a, b ].

Through the process of this step S3022, the image information and the text information may be combined, and then input to the second language model to generate initial reference features of the second language model, and then the 1 st output word is obtained.

S303, obtaining a t aggregation feature according to the coding feature, the global pooling feature and the t output word, inputting the t aggregation feature into a second language model to generate a t reference feature of the second language model until an iteration termination condition is met, and obtaining a t+1th output word, wherein t is more than or equal to 1 and t is a positive integer.

Specifically, referring to fig. 4, step S303 includes the following steps 401 to 406:

401. inputting the t output word into the first language model to obtain the t non-initial reference feature of the first language model.

Specifically, the first language model may be an LSTM (Long Short-Term Memory network) model.

LSTM (Long Short-Term Memory) model: is a time recurrent neural network suitable for processing and predicting important events with relatively long intervals and delays in a time series. The LSTM model may be used to link previous information to a current task, such as using past statements to infer an understanding of the current statement.

Under the condition that the LSTM model receives the t output word, the t non-initial reference feature of the first language model is obtained according to the t output word and the t-1 non-initial reference feature obtained last time.

402. And processing the coding feature according to the global pooling feature and the t non-initial reference feature to obtain the t local feature.

Specifically, local feature acquisition is achieved through a Grid selector, so that the selection of the region of interest (ROI) of the image is characterized. Compared with the prior art, the method and the device have the advantages that the region of interest (ROI) is selected in the decoding layer, and the selection range of the region of interest (ROI) can be changed each time according to the input non-initial reference characteristics, so that the image information can be selected more flexibly.

Specifically, the step 402 includes: obtaining a t affine transformation matrix according to the global pooling feature and the t non-initial reference feature; and carrying out affine transformation on the coding feature according to the t affine transformation matrix to obtain the t local feature.

Specifically, the acquisition of the t-th local feature is realized by a Grid selector. For example, a t-th affine transformation matrix of 2*3 is generated, and then the coding feature is selected by using the t-th affine transformation matrix of 2*3 to obtain a corresponding t-th local feature, so as to realize the selection of the region of interest ROI of the image.

403. And carrying out aggregation treatment on the t local feature and the t non-initial reference feature to obtain a t aggregation feature.

Specifically, step 403 includes: carrying out association degree calculation on the t-th local feature to obtain the processed t-th associated local feature; and splicing the t-th associated local feature and the t-th non-initial reference feature to obtain a t-th aggregation feature.

Specifically, multiplying the t local feature and the t non-initial reference feature of the first language model by corresponding weight coefficients respectively, and then adding to obtain an intermediate vector matrix; multiplying the hyperbolic tangent value of the intermediate vector matrix by a corresponding weight coefficient to obtain an attention weight coefficient; and obtaining the t-th associated local feature according to the attention weight coefficient and the t-th local feature.

Specifically, the attention weighting coefficient can be obtained by the following formula (3):

α _i,t ＝W _a ^T tanh(W _va v _i +W _ha h _t ¹ ) (3)

wherein alpha is _i,t Representing an attention weighting coefficient;

h ¹ _t A t-th non-initial reference feature representing a first language model;

v _i representing the t-th local feature.

The t-th associated local feature is obtained by the following formula (4):

wherein alpha is _i,t Representing an attention weighting coefficient;

v _i represents the t local feature;

Representing the t-th associated local feature.

It should be noted that, the t local features are feature vectors generated by the Grid selector, the t non-initial reference features are feature vectors generated by the first language model, and the aggregation between the two feature vectors is that the dimensions of the two feature vectors are the same. Therefore, the local features are converted into associated local features to generate one-dimensional vectors. Thus, the two one-dimensional vectors are directly spliced, and the corresponding t aggregation characteristic can be obtained.

Through the process of this step 403, the image information and the text information can be combined, and then the subsequent step is performed to predict the next output word.

404. Inputting the t aggregation feature into the second language model to generate the t non-initial reference feature of the second language model, and generating the t+1 output word according to the t non-initial reference feature of the second language model.

In this embodiment, the second language model may be an LSTM model.

Under the condition that the LSTM model receives the t aggregation feature, according to the t aggregation feature and the t-1 output word obtained last time, the t non-initial reference feature of the second language model is obtained.

In this embodiment, the initial reference feature output by the first language model and the global pooling feature are processed, so that an initial affine transformation matrix can be further generated, then the coding feature is processed through the initial affine transformation matrix to obtain an initial local feature, then the initial local feature and the initial reference feature output by the first language model are utilized to generate an aggregate feature, and the aggregate feature is input to the second language model to realize the prediction of the word output next time.

In step 404, generating a t+1st output word according to a t non-initial reference feature of the second language model, including: and classifying the t non-initial reference feature of the second language model to obtain a corresponding t+1th output word.

Specifically, a word having the highest probability of the current time may be output by a classifier (classifier) using a beam search method.

405. Whether the iteration termination condition is reached is judged, if not, step 406 is executed, and if yes, the process is ended.

406. T is self-added to 1 and the process returns to step 401.

Through the above steps 401 to 406, other output words than the 1 st output word are obtained.

S304, generating a description sentence corresponding to the target image according to the 1 st to t+1st output words.

With the generated descriptive statement being "one apple", the descriptive statement includes 3 output words "one", "apple".

According to the initialization word, initial reference characteristics of the first language model are obtained, then aggregate characteristics input to the second language model are obtained through a Grid selector according to coding characteristics (images features), global pooling characteristics (global features) and the initial reference characteristics of the first language model, and 1 st output word 'one' is obtained according to the initial reference characteristics output by the second language model.

Then inputting the 1 st output word 'one' into the first language model to obtain the 1 st non-initial reference feature of the output of the first language model, then obtaining the aggregate feature input into the second language model according to the coding feature (image features), the global pooling feature (global features) and the 1 st non-initial reference feature of the first language model by a Grid selector, and obtaining the 2 nd output word 'individual' according to the initial reference feature output by the second language model.

Then inputting the 2 nd output word 'one' into the first language model to obtain the 2 nd non-initial reference feature of the output of the first language model, then obtaining the aggregation feature input into the second language model according to the coding feature (image features), the global pooling feature (global features) and the 2 nd non-initial reference feature of the first language model by a Grid selector, and obtaining the 3 rd output word 'apple' according to the initial reference feature output by the second language model.

According to the embodiment, according to the coding features, the global pooling features and the initial reference features of the first language model, initial aggregation features of the second language model are obtained, and then the 1 st output word is obtained according to the initial aggregation features of the second language model; according to the coding feature, the global pooling feature and the t-th reference feature of the first language model, the t-th aggregation feature of the second language model is obtained, then the t-th output word is obtained according to the t-th aggregation feature of the second language model, and the description sentence corresponding to the target image is generated, so that the flexible selection of the interested region of the image can be realized according to the generation of the aggregation feature.

In the image description task of the related art, a region of interest (ROI, region of interest) needs to be selected, and then the ROI region is described. The ROI areas start to be generated during the encoding of the image, and the encoding is completed, which means that the areas are generated and cannot be changed at a later stage. This limits the ability to focus on corresponding regions based on context and semantic information during image generation. The method of the embodiment can more completely retain the local information of the image and more flexibly select the image information.

302. And obtaining keywords according to the description sentences.

The method for obtaining the keywords according to the descriptive sentences includes:

the description sentences are filtered, for example, by a text filtering algorithm, to obtain keywords. For example, using a trend text filtering algorithm, i.e. calculating a trend index of the descriptive sentence, then generating a corresponding weight for each word in the descriptive sentence, and finally obtaining the keyword.

For example, the words in the description sentence are compared in a pre-stored database by a TF-IDF (term frequency-inverse text frequency index) algorithm, and the words with scores greater than a scoring threshold value are used as keywords.

If a word rarely appears in the tag database, but the frequency of the word appearing in the current description sentence is high, the word is considered to have good distinguishing capability and is suitable for distinguishing the target image corresponding to the current description sentence from other images, and the word is used as a keyword of the target image.

In particular use, keywords may be determined by setting a frequency of occurrence threshold. For example, if the frequency of occurrence of a word is lower than a set frequency threshold, the word is used as a keyword.

Taking a description statement that children slide on a target image as an example, searching small friends, ice slides in a database through a TF-IDF algorithm, and finally determining that the occurrence frequency of the ice slides is smaller than an occurrence frequency threshold, and extracting the ice slides in the description statement as keywords of the target image.

303. And taking the keywords and/or the descriptive sentences as labels of the target images, and storing the target images and the labels into a database.

It should be explained that more than one label is corresponding to each target image, and a plurality of labels including sentences and keywords are generally corresponding to each target image.

Further, in the database, the target image corresponding to each label may also be multiple, for example, a keyword "skating" may correspond to multiple images, so that the user may obtain multiple images for selection when searching the keyword.

In one case, the tag and the image to which the tag corresponds are stored together in a database.

In another case, the tag and the image may be stored in separate databases, respectively, and each tag and image attribute information corresponding to the tag, such as an image link, an image number, etc., are stored in the tag database. According to the image attribute information corresponding to the label, searching can be carried out in an image database.

202. And outputting the target image corresponding to the label obtained by matching.

Specifically, outputting the target image corresponding to the label obtained by matching, including: and determining the description sentences and/or keywords with the similarity larger than a threshold value with the search sentences and/or the search words, and outputting target images corresponding to the determined description sentences and/or keywords.

Specifically, in the case where a search term is included in the search instruction, the search term is subjected to similarity matching with the description term in the tag database. The similarity matching between two sentences may be achieved by natural language processing models, such as convolutional neural network (Convolutional Neural Network, CNN) models, vector space models (Vector Space Model, VSM), etc.

Compared with the case of searching through keywords, the method has the advantages that sentence similarity detection is directly carried out through the search sentences and the description sentences serving as labels, the searching mode is more intelligent, and the searching result is more accurate.

Specifically, in the case where the search term is included in the search instruction, the search term is matched with the keyword in the tag database. The matching modes of the two words comprise various matching modes, such as matching based on a knowledge graph or matching based on Word2vec Word vector tools and the like.

According to the image searching method, the description sentences and the keywords corresponding to the target images are generated, the description sentences and the keywords are stored into the database as labels of the target images, under the condition that the search instruction is obtained, the search sentences and/or the search words of the search instruction are matched in the database, the target images corresponding to the labels obtained through matching are output, and as the labels of the description sentences of the target images are contained in the database, the description sentences contain more complete semantic descriptions for image scenes, and therefore a user can search the target images through the description sentences with similar semantics. The method of the application supports sentence search, not only enriches the image searching mode, but also improves the image searching efficiency and quality, and further enhances the user image searching experience.

In addition, the method of the embodiment encodes and pools the target image through the convolutional neural network model to obtain the corresponding encoding feature and global pooling feature, then inputs the corresponding encoding feature and global pooling feature into a decoding layer comprising the first language model, the second language model and the grid selector to decode to finally obtain the label corresponding to the image, so that the existing image label of the database can be marked, the newly collected image, including the image uploaded by a user and the mass image on the network, can be marked and stored in the database in time and can be searched, the database expansion speed is accelerated, the manual marking cost is saved, the enterprise cost is saved, and the probability of searching the user interaction information is increased.

For ease of understanding, embodiments of the present application are schematically illustrated with one specific example. Referring to fig. 5, fig. 5 illustrates a motorcycle rider riding a bike on a road. The system depicted in fig. 5 includes an encoding layer and a decoding layer. The coding layer adopts hidden layer output of a CNN model to obtain coding features (image features) and global pooling features (global features) of the target image.

The decoding layer adopts 4 modules or models, namely a Grid selector, a first language model LSTM1, a second language model LSTM2 and a classifier classifer in sequence.

The image searching method comprises the following steps:

1) And inputting the target image into a CNN model, and obtaining coding features according to hidden layer output of the CNN model. And according to the pooling processing of the coding features, global pooling features (global features) are obtained.

2) The coding features (image features) and global pooling features (global features) are input to the Grid selector at the decoding layer side. Then according to the initialization word, obtaining the initial reference characteristic h of the LSTM1 ¹ ₁ 。

3) Based on global pooling features (global features) and initial reference features h by Grid selector ¹ ₁ Obtaining an initial affine transformation matrix, carrying out affine transformation on the coding features (images) according to the initial affine transformation matrix to obtain initial local features, carrying out association degree calculation on the initial local features (local features) to obtain processed associated initial local features, and splicing the associated initial local features and initial reference features to obtain initial aggregation features. The obtained initial aggregation characteristic is input into LSTM2, and LSTM2 Output initial reference feature h ² ₁ . Will initially refer to feature h ² ₁ Inputting the input data into a classifier, and obtaining a 1 st output word "motorcycle".

4) Inputting the 1 st output word 'motorcycle' into LSTM1 to obtain an output non-initial reference characteristic h ¹ ₂ Based on global pooling features (global features) and non-initial reference features h by Grid selector ¹ ₂ Obtaining an affine transformation matrix, carrying out affine transformation on the coding features according to the affine transformation matrix to obtain local features, carrying out relevance calculation on the local features (local features) to obtain processed associated initial local features, and splicing the associated initial local features and initial reference features to obtain an aggregation feature. Inputting the obtained aggregation characteristic into LSTM2, and outputting non-initial reference characteristic h by LSTM2 ² ₂ . Will not initially refer to feature h ² ₂ Inputting the input data into a classifier class to obtain a 2 nd output word driver.

5) And so on, obtaining the 3 rd output word "driving", the 4 th output word "on", the 4 th output word "the" and the 6 th output word "head".

6) From the output word, a description sentence "motorcycle driver driving on the road" of the target image is obtained.

7) And comparing the description statement with a database through a TF-IDF algorithm to determine keywords corresponding to the description statement, wherein the keywords comprise ' driving ' motorcycle driver '.

8) The description sentences and the keywords are used as labels of the target images and are stored in a database together with the target images.

9) Under the condition that the search instruction is obtained, analyzing search sentences and/or search words in the search instruction, matching the search words in a database, and outputting target images corresponding to the matched labels.

For example, the search instruction includes a search word "drive", and then a tag "driving" corresponding to the search word is searched in the database, and a target image corresponding to the tag "driving" is output.

For another example, the similarity threshold between the preset search term and the description term in the image tag is 0.7. The search instruction includes a search statement "drive on head", and then a description statement or a keyword with similarity to the search statement "drive on head" being greater than 0.7 is searched in the database. Optionally, calculating the similarity between the "drive on road" and the description sentence "motorcycle driver driving on the road" by using a Convolutional Neural Network (CNN), and outputting the target image corresponding to the description sentence if the calculated similarity result is greater than 0.7.

An embodiment of the present application further provides a system for searching an image, referring to fig. 6, including:

the matching module 601 is configured to match in a database according to a search sentence and/or a search word of a search instruction, where the database stores a target image and a tag generated according to the target image;

the image output module 602 is configured to output the target image corresponding to the label obtained by matching.

Optionally, the apparatus further comprises:

the descriptive statement generation module is configured to generate descriptive statements corresponding to the target image;

the keyword generation module is configured to obtain keywords according to the description sentences;

and the storage module is configured to take the keywords and/or the description sentences as labels of target images and store the target images and the labels into a database.

Optionally, the descriptive statement generation module is specifically configured to:

the coding module is configured to code the target image to obtain corresponding coding features and global pooling features;

the first output word generation module is configured to obtain initial aggregation features according to the coding features, the global pooling features and the initial reference features of the first language model, input the initial aggregation features into the second language model to generate initial reference features of the second language model, and generate a 1 st output word according to the initial reference features of the second language model;

The second output word generation module is configured to obtain a t aggregation feature according to the coding feature, the global pooling feature and the t output word, input the t aggregation feature into the second language model to generate a t reference feature of the second language model until an iteration termination condition is met, and obtain a t+1th output word, wherein t is more than or equal to 1 and t is a positive integer;

and the descriptive statement generation module is configured to generate descriptive statements corresponding to the target image according to the 1 st to t+1st output words.

Optionally, the encoding module is specifically configured to: coding the target image through a convolutional neural network model to obtain corresponding coding characteristics; and carrying out pooling treatment on the coding features through a pooling layer of the convolutional neural network model to obtain corresponding global pooling features.

Optionally, the first output word generation module is specifically configured to: processing the coding features according to the global pooling features and initial reference features of the first language model to obtain initial local features; and carrying out aggregation treatment on the initial local features and the initial reference features to obtain initial aggregation features.

Optionally, the first output word generation module is specifically configured to: obtaining an initial affine transformation matrix according to the global pooling feature and the initial reference feature of the first language model;

And carrying out affine transformation on the coding features according to the initial affine transformation matrix to obtain initial local features.

Optionally, the first output word generation module is specifically configured to: performing association degree calculation on the initial local features to obtain processed association initial local features; and splicing the associated initial local features and the initial reference features to obtain initial aggregation features.

Optionally, the first output word generation module is specifically configured to: multiplying the initial local feature and the initial reference feature of the first language model by corresponding weight coefficients respectively, and then adding to obtain an initial intermediate vector matrix;

multiplying the hyperbolic tangent value of the initial intermediate vector matrix by a corresponding weight coefficient to obtain an attention initial weight coefficient;

and obtaining the associated initial local characteristic according to the attention initial weight coefficient and the initial local characteristic.

Optionally, the initial reference feature of the first language model is generated by: inputting the initialized words into the first language model to obtain the 1 st output feature of the first language model as an initial reference feature.

Optionally, the second output word generation module is specifically configured to:

The first non-initial reference feature generation module is configured to input a t output word into the first language model to obtain a t non-initial reference feature of the first language model;

the local feature generation module is configured to process the coding feature according to the global pooling feature and the t non-initial reference feature to obtain the t local feature;

the aggregation feature generation module is configured to aggregate the t local feature and the t non-initial reference feature to obtain a t aggregation feature;

the second non-initial reference feature generation module is configured to input the t-th aggregated feature into the second language model to generate the t-th non-initial reference feature of the second language model, and generate the t+1th output word according to the t-th non-initial reference feature of the second language model;

the judging module is configured to judge whether the iteration termination condition is met, if not, the self-increasing module is executed, and if yes, the process is ended;

and the self-increasing module is configured to self-increase t by 1 and return to execute the first non-initial reference feature generating module.

obtaining a t affine transformation matrix according to the global pooling feature and the t non-initial reference feature;

And carrying out affine transformation on the coding feature according to the t affine transformation matrix to obtain the t local feature.

Optionally, the second output word generation module is specifically configured to: carrying out association degree calculation on the t-th local feature to obtain the processed t-th associated local feature; and splicing the t-th associated local feature and the t-th non-initial reference feature to obtain a t-th aggregation feature.

Optionally, the second output word generation module is specifically configured to: multiplying the t local feature and the t non-initial reference feature of the first language model by corresponding weight coefficients respectively, and then adding to obtain an intermediate vector matrix;

multiplying the hyperbolic tangent value of the intermediate vector matrix by a corresponding weight coefficient to obtain an attention weight coefficient;

and obtaining the t-th associated local feature according to the attention weight coefficient and the t-th local feature.

Optionally, the second output word generation module is specifically configured to: and classifying the t non-initial reference feature of the second language model to obtain a corresponding t+1th output word.

Optionally, the matching module 601 is specifically configured to: and comparing the words in the descriptive sentences in the database through a word frequency-inverse text frequency index algorithm, and taking the words with scores larger than a scoring threshold value as keywords.

Optionally, the matching module 601 is configured to: performing similarity matching on search sentences and/or search words in the search instructions and description sentences and/or keywords in the database;

the image output module 602 is configured to: and determining the description sentences and/or keywords with the similarity larger than a threshold value with the search sentences and/or the search words, and outputting target images corresponding to the determined description sentences and/or keywords.

The above is a schematic scheme of a system for image search of the present embodiment. It should be noted that, the technical solution of the system and the technical solution of the method for searching images belong to the same concept, and details of the technical solution of the system, which are not described in detail, can be referred to the description of the technical solution of the method for searching images.

An embodiment of the application also provides a computer-readable storage medium storing computer instructions which, when executed by a processor, implement the steps of a method of image searching as described above.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the method of image searching described above belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the method of image searching described above.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

An embodiment of the application also provides a chip storing computer instructions which, when executed by the chip, implement the steps of the method of image searching as described above.

It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the application disclosed above are intended only to assist in the explanation of the application. Alternative embodiments are not intended to be exhaustive or to limit the application to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and the practical application, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A method of image searching, the method comprising:

outputting a target image corresponding to the label obtained by matching;

the tag is obtained according to a description sentence corresponding to the target image, and the generation step of the description sentence comprises the following steps:

inputting the coding feature, the global pooling feature and the initial reference feature of the first language model to a grid selector to obtain an initial local feature;

performing aggregation treatment on the initial local features and the initial reference features to obtain initial aggregation features;

inputting the initial aggregation features into a second language model to generate initial reference features of the second language model, and generating a 1 st output word according to the initial reference features of the second language model;

inputting the t output word into the first language model to obtain the t non-initial reference feature of the first language model, wherein t is more than or equal to 1 and t is a positive integer;

inputting the coding feature, the global pooling feature and the t non-initial reference feature into the grid selector to obtain the t local feature;

the t local feature and the t non-initial reference feature are subjected to aggregation treatment to obtain a t aggregation feature;

inputting the t aggregation feature into the second language model to generate a t non-initial reference feature of the second language model, and generating a t+1 output word according to the t non-initial reference feature of the second language model;

Judging whether an iteration termination condition is met, if not, adding 1 to t, and executing the t-th non-initial reference feature step of inputting the t-th output word into the first language model to obtain the first language model;

if yes, ending;

2. The method of image searching of claim 1, wherein the method further comprises:

generating a description sentence corresponding to the target image;

obtaining keywords according to the description sentences;

3. The method of claim 1, wherein encoding the target image to obtain the corresponding encoded features and global pooling features comprises:

4. The method of image searching according to claim 2, wherein obtaining keywords from the description sentence comprises: and comparing the words in the descriptive sentences in the database through a word frequency-inverse text frequency index algorithm, and taking the words with scores larger than a scoring threshold value as keywords.

5. The method of image searching according to claim 2, wherein matching in the database according to the search statement and/or search word of the search instruction comprises: performing similarity matching on search sentences and/or search words in the search instructions and description sentences and/or keywords in the database;

6. A system for image searching, the system comprising:

the image output module is configured to output a target image corresponding to the label obtained by matching;

if yes, ending;

7. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein the processor, when executing the instructions, implements the steps of the method of any of claims 1-5.

8. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 5.

9. A chip storing computer instructions, which when executed by the chip, implement the steps of the method of any one of claims 1-5.