WO2020108234A1

WO2020108234A1 - Image index generation method, image search method and apparatus, and terminal, and medium

Info

Publication number: WO2020108234A1
Application number: PCT/CN2019/115411
Authority: WO
Inventors: 侯允; 刘耀勇; 陈岩
Original assignee: Oppo广东移动通信有限公司
Priority date: 2018-11-30
Filing date: 2019-11-04
Publication date: 2020-06-04
Also published as: CN109635135A

Abstract

Disclosed are an image index generation method, an image search method and apparatus, a terminal and a medium. The method comprises: acquiring a first image (101); performing image recognition on the first image to obtain the recognition result corresponding to the first image (102); according to the recognition result, generating a description sentence (103); and determining the description sentence as an index of the first image and storing the index and the first image correspondingly (104). The method recognizes the recognition results corresponding to various objects comprised in an image and generates a description sentence describing the image according to the recognition results, and determines the above-mentioned description sentence as an index of the image; later, when needing to search for the image, a user can input a word comprised in the index or a word with a meaning close to the word comprised in the index; and a terminal can accurately search for the image according to the word input by the user, improving searching efficiency of image searching in a photo album.

Description

Image index generation method, image search method, device, terminal and medium

This application requires the priority of the Chinese patent application with the application number 201811457455.0 and the invention titled "Image Index Generation Method, Device, Terminal, and Storage Medium" filed on November 30, 2018. in.

Technical field

This application relates to the field of search technology, in particular to an image index generation method, image search method, device, terminal and medium.

Background technique

At present, a photo album application is usually installed in the terminal, and the photo album application is generally used to store captured images, images saved from the network, and the like.

When there are many images saved in the album, if the user needs to find the images he needs from the saved images, he needs to find each album directory in the terminal and find the images he needs from the corresponding album directory.

Summary of the invention

Embodiments of the present application provide an image index generation method, image search method, device, terminal, and medium. The technical solution is as follows:

In one aspect, an image index generation method is provided. The method includes:

Get the first image;

Performing image recognition on the first image to obtain a recognition result corresponding to the first image;

Generating a description sentence according to the recognition result, where the description sentence is used to describe the first image;

The description sentence is determined as an index of the first image, and the index is stored in correspondence with the first image.

In another aspect, an image search method is provided, the method including:

Display the search box;

Receiving the first keyword entered in the search box;

Searching for a second image matching the first keyword in the album, the index corresponding to the second image includes a first target keyword, and the first target keyword matches the first keyword , The index corresponding to the second image is a description sentence generated according to the recognition result of the second image;

A search result is displayed, the search result including the second image.

On the other hand, an image index generation device is provided, the device comprising:

The image acquisition module is used to acquire the first image;

An image recognition module, configured to perform image recognition on the first image to obtain a recognition result corresponding to the first image;

A sentence generating module, configured to generate a description sentence according to the recognition result, and the description sentence is used to describe the first image;

The index generation module is configured to determine the description sentence as an index of the first image, and store the index corresponding to the first image.

In yet another aspect, an image search device is provided, the device including:

Search box display module, used to display the search box;

A keyword receiving module, configured to receive the first keyword input in the search box;

An image search module is used to search a photo album for a second image matching the first keyword, an index corresponding to the second image includes a first target keyword, and the first target keyword and the The first keywords match, and the index corresponding to the second image is a description sentence generated according to the recognition result of the second image;

The result display module is used to display search results, and the search results include the second image.

In still another aspect, an embodiment of the present application provides a terminal, the terminal includes a processor and a memory, and the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the foregoing image index generation method, Or implement the above image search method.

In still another aspect, an embodiment of the present application provides a computer-readable storage medium in which a computer program is stored, and the computer program is loaded and executed by a processor to implement the above image index generation method, or The above image search method.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained based on these drawings.

1 is a flowchart of an image index generation method provided by an embodiment of this application;

2 is a flowchart of an image index generation method provided by another embodiment of this application;

3 is a flowchart of an image search method provided by an embodiment of this application;

4 is a flowchart of an image search method provided by another embodiment of this application;

5 is a block diagram of an image index generation device provided by an embodiment of the present application;

6 is a block diagram of an image search device provided by an embodiment of the present application;

7 is a block diagram of a terminal provided by an embodiment of the present application.

detailed description

To make the objectives, technical solutions, and advantages of the present application clearer, the following describes the embodiments of the present application in further detail with reference to the accompanying drawings.

Embodiments of the present application provide an image index generation method, device, terminal, and storage medium. By identifying the corresponding recognition results of each object included in the image, and using the language description model to generate the above-mentioned recognition results, and using To describe the description sentence of the image, the above description sentence is determined as the index of the image, and then when the user needs to search for the image, he can input the words included in the index, or the meanings of the words included in the index are similar Words, the terminal can accurately find the image according to the words entered by the user, which improves the search efficiency of searching images in the album.

In the technical solution provided by the embodiments of the present application, the execution subject of each step is a terminal. Optionally, a photo album application is installed in the terminal, and the photo album application refers to an application for storing images. The image may be an image (including photos and videos) taken by the user, or an image (including photos and videos) saved by the user from other applications. The terminal may be a mobile phone, a tablet computer, a personal computer, a smart wearable device, a camera, a smart playback device, and so on.

An embodiment of the present application provides an image index generation method. The method includes:

Get the first image;

Optionally, the generating a description sentence based on the recognition result includes:

Convert the recognition result into a first word vector;

The first word vector is processed through a language description model to obtain the description sentence.

Optionally, after acquiring the first image, the method further includes:

Acquiring associated information of the first image, the associated information including at least one of the following: location information, time information, and scene information;

The generating a description sentence based on the recognition result includes:

Convert the recognition result into a first word vector;

Convert the related information into a second word vector;

The first word vector and the second word vector are processed through a language description model to obtain the description sentence.

Optionally, before determining the description sentence as an index of the first image and storing the index corresponding to the first image, the method further includes:

Displaying inquiry information, the inquiry information is used to inquire whether to determine the description sentence as the index;

When receiving the confirmation instruction corresponding to the inquiry information, performing the step of determining the description sentence as an index of the first image, and storing the index in correspondence with the first image.

Optionally, after displaying the inquiry information, the method further includes:

When the confirmation instruction is not received, an input box is displayed;

Receiving the sentence input in the input box;

Determining the input sentence as an index of the first image, and storing the index corresponding to the first image.

Optionally, performing image recognition on the first image to obtain a recognition result corresponding to the first image includes:

Performing image recognition on the first image through an image recognition model to obtain recognition results corresponding to at least one object in the first image respectively;

Wherein, the image recognition model is a neural network model trained by using multiple sample images, and the object in each sample image of the multiple sample images corresponds to a classification label.

Optionally, before generating the description sentence based on the recognition result, the method further includes:

Acquiring a training sample set, the training sample set including a plurality of sample images, the sample images corresponding to the expected description sentences corresponding to the recognition results;

For the sample image, the recognition result is processed through a language description model, and the actual description sentence is output;

Calculating the error between the actual description sentence and the expected description sentence;

When the error is greater than a preset threshold, the parameters of the language description model are adjusted, and the step of outputting the actual description sentence from the step of processing through the language description model for each sample image and outputting the actual description sentence; When the error is less than or equal to the preset threshold, the training is stopped, and the language description model that completes the training is obtained, and the language description model is used to generate the description sentence according to the recognition result.

An embodiment of the present application also provides an image search method. The method includes:

Display the search box;

Receiving the first keyword entered in the search box;

A search result is displayed, the search result including the second image.

Optionally, before displaying the search results, the method further includes:

When the number of the second images is greater than the preset number, prompt information is displayed, and the prompt information is used to prompt the input of the second keyword;

Acquiring the second keyword;

Searching for a third image matching the second keyword in the second image, an index corresponding to the third image includes a second target keyword, the second target keyword and the second key Match words;

Wherein, the search result includes the third image.

Please refer to FIG. 1, which shows a flowchart of an image index generation method provided by an embodiment of the present application. The method may include the following steps:

Step 101: Acquire a first image.

In a possible implementation manner, the first image may be an image collected by a camera on the terminal. Optionally, a camera is provided on the terminal and a shooting application is installed. The shooting application refers to an application used to capture an image, for example, a camera application, a beauty application, or other applications. When the shooting application is running, the terminal receives the trigger signal of the shooting control acting on the current shooting interface, and acquires the image collected by the camera as the first image.

In another possible implementation manner, the first image may not be an image collected by a camera on the terminal, but an image saved by the user from other application programs. Optionally, the first image is an image obtained from the network or a screenshot. Optionally, when an image is displayed on the display interface of the terminal, when the terminal receives a save instruction corresponding to the image, the image is acquired from the network as the first image according to the save instruction.

In addition, the embodiment of the present application does not limit the acquisition method and timing of the first image.

Step 102: Perform image recognition on the first image to obtain a recognition result corresponding to the first image.

The recognition result corresponding to the first image is used to indicate the object included in the first image. For example, the first image may include one or more objects, such as people, animals, buildings, landscapes, and so on. In the embodiment of the present application, the terminal determines the category to which each object belongs by the following steps. The category to which each object belongs is used to indicate the category to which the object belongs. For example, the object is a cat or dog or grass or human or other categories: The image recognition model performs image recognition on the first image to obtain recognition results corresponding to at least one object in the first image, respectively.

The image recognition model is a neural network model trained using multiple sample images. For example, the image recognition model may be obtained by training the deep learning network using multiple sample images. The objects in each sample image of the multiple sample images Corresponding to the classification label, the classification label is used to characterize the category to which the object belongs. In some embodiments of the present application, the image recognition model includes: an input layer and at least one convolutional layer (such as a total of 3 convolutional layers including a first convolutional layer, a second convolutional layer, and a third convolutional layer) , At least one fully connected layer (for example, including two fully connected layers including the first fully connected layer and the second fully connected layer) and one output layer. The input data of the input layer is the first image, and the output result of the output layer is the classification to which at least one object included in the first image belongs, respectively. The image recognition process is as follows: the first image is input to the input layer of the image recognition model, the features of the first image are extracted by the convolutional layer of the image recognition model, and then the above features are combined and abstracted by the fully connected layer of the image recognition model To obtain data suitable for classification in the output layer, and finally the output layer outputs the recognition results corresponding to the at least one object included in the first image, respectively.

In the embodiments of the present application, the specific structures of the convolution layer and the fully connected layer of the image recognition model are not limited. The image recognition model shown in the above embodiment is only exemplary and explanatory, and is not used to limit the present application. In general, the more layers of the convolutional neural network, the better the effect but the longer the calculation time. In practical applications, the convolutional neural network with the appropriate number of layers can be designed in conjunction with the requirements for recognition accuracy and efficiency.

The sample image refers to an image selected in advance for training the image recognition model. The sample image has a classification label. The classification label of the sample image is usually determined manually, and is used to describe the scene, item, person, etc. corresponding to the sample image.

Optionally, the neural network may be a deep learning network, and the deep learning network may use alexNet network, VGG-16 network, GoogleNet network, Deep Residual Learning (deep residual learning) network, etc., which is not limited in the embodiments of the present application. In addition, the algorithms used in training the deep learning network may be BP (Back-Propagation, back propagation algorithm), faster RCNN (Regions with Convolutional Neural Network, regional convolutional neural network) algorithm, etc., this embodiment of the application does not make limited.

The following uses the BP algorithm as an example to train the deep learning network as an example to explain the training process of the image recognition model: first initialize the parameters of each layer in the deep learning network; secondly, input the sample image into the deep learning network to obtain the sample image Corresponding recognition results; then compare the recognition results with the classification labels to obtain the error between the recognition results and the classification labels; finally adjust the parameters of each layer in the deep learning network based on the above errors, and repeat the above steps until the recognition results and the classification The error between the tags is less than the preset value. At this time, the trained deep learning network is obtained, that is, the image recognition model is obtained.

Step 103: Generate a description sentence according to the recognition result.

The description sentence is used to describe the first image. The description sentence includes the recognition results corresponding to at least one object respectively. Optionally, the description sentence also includes other words, which can be used to describe at least one of the following: the positional relationship between at least two objects, the action being performed by an object, the state of an object, etc. Wait. Exemplarily, the first image is recognized, and the objects in the first image include a dog and a grass, and the dog's posture on the grass is running, and the above recognition result is input into a language description model to obtain the first image The corresponding descriptive sentence is "dog running on the grass".

In some embodiments of the present application, the language description model includes: an input layer and at least one convolutional layer (such as a total of 3 convolutional layers including a first convolutional layer, a second convolutional layer, and a third convolutional layer) , At least one fully connected layer (for example, including two fully connected layers including the first fully connected layer and the second fully connected layer) and one output layer. The input data of the input layer is the first image and the recognition result to which the object in the first image belongs. The output result of the output layer is the description sentence corresponding to the first image. The generation process of the description sentence is as follows: the first image and the recognition results of the objects in the first image are input to the input layer of the language description model, the convolutional layer of the language description model extracts the features of the above input content, and then the language description model The fully connected layer of the group combines and abstracts the above features, and finally the output layer outputs the description sentence corresponding to the first image.

In the embodiments of the present application, the specific structures of the convolutional layer and the fully connected layer of the language description model are not limited. The language description model shown in the above embodiment is only exemplary and explanatory, and is not intended to limit the application. Generally speaking, the more layers of the convolutional neural network, the better the effect but the longer the calculation time. In practical applications, it is possible to design a convolutional neural network with an appropriate number of layers in accordance with the requirements for calculation accuracy and efficiency.

Optionally, step 103 may include the following sub-steps:

In an example, step 103 can be implemented as:

Step 103a, converting the recognition result into a first word vector;

Step 103b: Process the first word vector through the language description model to obtain a description sentence.

In the embodiment of the present application, the terminal converts the recognition result into a corresponding word vector through a word vector model. The word vector refers to a vector representing words, and the word vector model refers to a model that converts words into word vectors, and converts the word vector Input the language description model, and output the description sentence from the language description model. The above word vector model may be a word2vec model.

In another example, the terminal may also obtain the association information of the first image. At this time, step 103 can also be implemented as:

1. Convert the recognition result into the first word vector;

2. Convert the related information into the second word vector;

In the embodiment of the present application, the associated information includes at least one of the following: location information, time information, and scene information. Location information is used to indicate the geographic location when the first image was taken, for example, Shanghai, Beijing, Canada, etc. Time information is used to indicate the time when the first image was acquired, for example, spring, summer, autumn, winter, early morning, evening Etc.; the scene information is used to indicate the scene corresponding to the first image, for example, parks, beaches, shopping malls, schools, etc. The terminal can convert the related information into the corresponding word vector through the word vector model.

3. Process the first word vector and the second word vector through the language description model to obtain a description sentence.

The terminal inputs the first word vector and the second word vector into the language description model, so that the final description sentence is more abundant.

Exemplarily, the following uses the associated information as location information as an example for description.

First, obtain the position information of the first image.

Second, convert the location information into a second word vector;

Third, the first word vector and the second word vector are processed through the language description model to obtain a description sentence.

The location information is used to indicate the geographic location when the first image is taken. When the first image is an image collected by the terminal through the camera, the position information can be obtained by a positioning component in the terminal, for example, a GPS (Global Positioning System) component. Of course, in other possible implementation manners, the terminal may also obtain the position information of the first image by performing image recognition on the first image. For the method of converting the position information into a word vector, reference may be made to step 103a, which will not be repeated here. In the embodiment of the present application, the description sentence corresponding to the first image is generated by combining the geographic location where the first image is taken, so that the first image can be described more completely, and subsequent users can search for the first image through multiple different keywords An image to enhance the convenience of searching.

Exemplarily, the first image is identified, and it is obtained that the objects in the first image include a dog and a grass, and the posture of the dog on the grass is running, in addition, the geographic location where the first image is taken is XX Park, then The descriptive sentence corresponding to the first image is "dog running on the grass in xx park".

Step 104: Determine the description sentence as the index of the first image, and store the index corresponding to the first image.

The terminal determines the description sentence as the index of the first image, and stores the index in correspondence with the first image. Subsequently, if the user needs to search for the first image, he only needs to input at least one word included in the description sentence, or a word matching the word in the description sentence, for example, the similarity between the words in the description sentence For words greater than a preset threshold, the terminal may find the first image according to the words input by the user, and display the first image to the user.

In addition, the embodiment of the present application does not limit the path for storing the description sentence and the first image, which may be preset by the terminal or may be set by the user.

In summary, the technical solution provided by the embodiments of the present application recognizes the recognition results corresponding to each object included in the image, and generates a description sentence describing the image according to the recognition result, and determines the above description sentence as the image Index, when the user needs to search for the image later, he can input the words included in the index, or the words with similar meanings to the words included in the index, the terminal can accurately find the image according to the words entered by the user, Improve the search efficiency of searching images in the album.

In addition, by generating a description sentence for describing the image according to the recognition result of the image, and determining the description sentence as the index of the image, the generated index is accurate.

Please refer to FIG. 2, which shows a flowchart of an image index generation method provided by another embodiment of the present application. The method may include the following steps:

Step 201: Acquire a first image.

Step 202: Perform image recognition on the first image to obtain a recognition result corresponding to the first image.

Step 203: Generate a description sentence according to the recognition result.

In step 204, query information is displayed.

In the embodiment of the present application, the inquiry information is used to inquire whether to determine the description sentence as an index. Exemplarily, the inquiry message is "the description sentence corresponding to the image is "watching a concert in a bird's nest", are you sure?".

In the embodiment of the present application, the user can preview the description sentence generated by the language description model, and decide whether to determine the description sentence generated above as the index of the first image.

Step 205, when receiving the confirmation instruction corresponding to the inquiry information, determine the description sentence as the index of the first image, and store the index corresponding to the first image.

If the user determines that the generated description sentence is determined as the index of the image, a confirmation instruction can be issued to the query information. The confirmation instruction corresponding to the inquiry information is used to instruct confirmation to determine the generated description sentence as the index of the image. Optionally, a confirmation control is displayed on the peripheral side of the query information, and when the terminal receives a trigger signal acting on the confirmation control, the terminal receives a confirmation instruction corresponding to the query information.

Step 206, when the confirmation instruction is not received, an input box is displayed.

The input box is used to receive a description sentence corresponding to the first image input by the user. Optionally, when the terminal does not receive the trigger signal acting on the confirmation control within a preset time, the terminal does not receive the confirmation instruction. Optionally, a denial control is also displayed on the peripheral side of the query information. When the terminal receives a trigger signal corresponding to the denial control, the terminal does not receive the confirmation instruction, and the terminal may display an input box at this time.

Step 207: Receive the sentence input in the input box.

In the embodiment of the present application, when the user is not satisfied with the generated description sentence, he can input the description sentence of the target image by himself.

Step 208: Determine the input sentence as the index of the first image, and store the index corresponding to the first image.

In summary, in the technical solution provided by the embodiments of the present application, the user judges whether to confirm the generated description sentence as the index of the image, and if the user is not satisfied with the description sentence generated by the terminal, the user inputs the image by himself Corresponding description sentences, so that subsequent users can search the image according to the description sentences entered by themselves, which improves the accuracy of the index and further improves the final image indexing efficiency. After generating the index of the first image, the user can search the first image in the album according to the index. The following describes the search process. In an optional embodiment provided based on the embodiment shown in FIG. 1 or FIG. 2, after step 104, or after step 208, as shown in FIG. 3, an embodiment of the present application further provides an image search method , The image search method may include the following steps:

In step 301, a search box is displayed.

The search box is used for the user to input a search keyword, so that the terminal can find an image matching the search keyword. In a possible implementation, the search box is displayed on the main interface of the album application. In another possible implementation, the main interface of the album application program displays a search control. When the user triggers the search control, the terminal receives a trigger signal corresponding to the search control, and displays a search box according to the trigger signal. The embodiment of the present application does not limit the display manner of the search box.

Step 302: Receive the first keyword entered in the search box.

The first keyword is input by the user, and it may be "Forbidden City", "Cat", "Rose Flower", etc., which is not limited in this embodiment of the present application.

Step 303: Search the album for the second image that matches the first keyword.

The number of second images may be one, or multiple. The index corresponding to the second image is used to describe the second image. The index corresponding to the second image is a description sentence generated according to the recognition result of the second image. The index corresponding to the second image includes the first target keyword. The first target keyword may be a recognition result corresponding to the object included in the second image, or may be other words in the description sentence other than the recognition result, which is not limited in this embodiment of the present application. In this way, users can search the same image with different keywords, reducing the difficulty of searching for images.

Exemplarily, the first target keyword matches the first keyword, for example, the similarity between the first target keyword and the first keyword meets a preset condition. The preset condition may be that the similarity between the first target keyword and the first keyword is greater than a preset threshold, and the preset threshold may be set according to actual requirements, which is not limited in this embodiment of the present application.

Optionally, the terminal first calculates the similarity between the words included in each description sentence stored in the terminal and the first keyword, and then determines the words whose similarity with the first keyword meets the preset condition as The first target keyword, and finally, the image corresponding to the description sentence containing the first target keyword is used as the second image matching the first keyword.

In addition, the similarity between the first keyword and the words included in the description sentence can be calculated as follows: the terminal expresses the first keyword as the first vector through the word vector model, and represents the words included in the description sentence as the first Two vectors, and then calculate the similarity between the first keyword and the words included in the description sentence by calculating the cosine distance between the first vector and the second vector, the greater the cosine distance, indicating that the first keyword and the description The lower the similarity between the words included in the sentence; conversely, the smaller the cosine distance, indicating that the similarity between the first keyword and the words included in the description sentence is higher. After that, the terminal may determine words whose cosine distance satisfies the preset condition as the first target keyword.

Step 304: Display the search results.

The terminal displays the search result on the search result page, and the search result includes the above-mentioned second image. When there are multiple second images, the terminal may sort the second images according to the similarity between the first target keyword and the first keyword. Optionally, the greater the similarity between the first target keyword and the first keyword, the more the second image corresponding to the description sentence containing the first target keyword is arranged in the search result page; The smaller the similarity between the first target keyword and the first keyword, the lower the order of the second image corresponding to the description sentence containing the first target keyword in the search result page.

In summary, the technical solution provided by the embodiments of the present application performs image search through the image index generated according to the above embodiment, and the user only needs to input the words included in the index or the For words with similar meanings, the terminal can accurately search for the image according to the words entered by the user, which improves the search efficiency of searching images in the album.

When the user enters the first keyword, and the terminal searches for more second images based on the first keyword, the user needs to filter out the images he desires to search among more second images at this time, and the search efficiency is still Relatively low.

Please refer to FIG. 4, which shows a flowchart of an image search method provided by another embodiment of the present application. The image search method can be used to solve the problem of low search efficiency when there are many second images searched according to the first keyword. The method includes the following steps:

In step 401, a search box is displayed.

Step 402: Receive the first keyword entered in the search box.

Step 403: Search the album for the second image that matches the first keyword.

In step 404, when the number of second images is greater than the preset number, a prompt message is displayed.

The preset number can be set according to actual needs, which is not limited in the embodiments of the present application. Exemplarily, the preset number is 10 sheets. The prompt information is used to prompt the input of the second keyword. Optionally, the second keyword is different from the first keyword.

In the embodiment of the present application, when finding the second image matching the first keyword, the terminal first detects whether the number of the second image is greater than the preset number. If the number of the second image is less than or equal to the preset number, the second image is directly displayed. If the number of second images is greater than the preset number, the user is prompted to enter more keywords, so that the terminal continues to filter out the first keyword and the second key in the second image matching the first keyword The third image matches the words.

Step 405: Obtain the second keyword.

The second keyword is also input by the user, which is different from the first keyword. Exemplarily, the above prompt information includes an input box for the user to input the second keyword, and the user can input the second keyword in the input box, so that the terminal obtains the second keyword.

Step 406: Search for a third image matching the second keyword in the second image.

The index corresponding to the third image includes the second target keyword. The second target keyword matches the second keyword. Exemplarily, the similarity between the second target keyword and the second keyword meets the second preset condition. The second preset condition may be that the similarity between the second target keyword and the second keyword is greater than a preset threshold, and the preset threshold may be set according to actual requirements, which is not limited in this embodiment of the present application.

In one example, the terminal first calculates the similarity between the words included in each description sentence stored by the terminal and the first keyword, and between the words included in each description sentence stored by the terminal and the second keyword The similarity of; then the words whose similarity with the first keyword meets the first preset condition are determined as the first target keyword, and the similarity with the second keyword meets the second preset condition The word is determined as the second target keyword; finally, the image corresponding to the description sentence containing the first target keyword and the second target keyword is used as the third image that matches both the first keyword and the second keyword. In addition, for the calculation method of the similarity between the second keyword and the words included in the description sentence, reference may be made to step 303, and details are not described here.

In another example, the terminal calculates the similarity between the words included in the second image and the second keyword, and determines that the similarity between the second keyword and the second keyword meets the second preset condition as the female target keyword, The image including the second target keyword in the second image is determined as the third image.

Step 407, display the search results.

In the embodiment of the present application, the search result includes the above-mentioned third image.

In summary, the technical solution provided by the embodiments of the present application can prompt the user to input more keywords when there are too many search results, so that the terminal can perform image search based on the keywords entered twice, thereby improving the image search performance. Accuracy.

It is mentioned in the embodiment of FIG. 1 that the language description model is pre-trained, and is a model for encoding at least two words into a complete sentence. The following describes the training process of the language description model.

Step 501: Obtain a training sample set.

The training sample set includes multiple sample images, and the sample images correspond to the expected description sentences corresponding to the recognition results. The recognition result corresponding to the sample image can be marked manually or obtained through the image recognition model. It is expected that the description sentence may be manually marked.

Step 502: For the sample image, process the recognition result through the language description model, and output the actual description sentence.

The language description model may be a deep learning network, such as alexNet network, VGG-16 network, GoogleNet network, Deep Residual Learning (deep residual learning) network. The parameters of the language description model are initialized. Optionally, the parameters of the language description model may be set randomly, or may be set by relevant technical personnel based on experience. In the embodiment of the present application, each sample image is input into a language description model, and the language description model outputs an actual description sentence.

Step 503: Calculate the error between the actual description sentence and the expected description sentence.

Optionally, the terminal determines the distance between the actual description sentence and the expected description sentence as an error.

After calculating the error between the actual description sentence and the expected description sentence, the terminal detects whether the error is greater than a preset threshold. If the error is greater than the preset threshold, the parameters of the language description model are adjusted, and the steps of outputting the actual description sentence are processed from the language description model for each sample image, that is, steps 502 and 503 are repeated. When the error is less than or equal to the preset threshold, the training is stopped, and the language description model that has completed the training is obtained. .

The following is an embodiment of the device of the present application, which can be used to execute the method embodiment of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Please refer to FIG. 5, which shows a block diagram of an image index generation device provided by an embodiment of the present application. The device has the function of implementing the above method, and the function can be realized by hardware, or can be realized by hardware executing corresponding software. The device may be a terminal or may be provided on the terminal. The device includes:

The image acquisition module 601 is used to acquire the first image.

The image recognition module 602 is configured to perform image recognition on the first image to obtain a recognition result corresponding to the first image.

The sentence generating module 603 is configured to generate a description sentence according to the recognition result, and the description sentence is used to describe the first image.

The index generation module 604 is configured to determine the description sentence as an index of the first image, and store the index corresponding to the first image.

In an optional embodiment provided based on the embodiment shown in FIG. 5, the sentence generation module 603 is used to:

Convert the recognition result into a first word vector;

Optionally, the device further includes: an information acquisition module (not shown in the figure).

The information acquisition module is used to acquire the associated information of the first image, the associated information includes at least one of the following: location information, time information, scene information;

The sentence generation module 603 is used to:

Convert the recognition result into a first word vector;

Convert the related information into a second word vector;

In an optional embodiment provided based on the embodiment shown in FIG. 5, the device further includes: an information display module (not shown in the figure).

The information display module is used to display query information, and the query information is used to query whether the description sentence is determined as the index;

The index generation module 640 is further configured to, when receiving the confirmation instruction corresponding to the inquiry information, execute the determination of the description sentence as an index of the first image, and compare the index with the The first image corresponds to the stored step.

Optionally, the device further includes an input box display module and a sentence receiving module (not shown in the figure).

The input box display module is used to display the input box when the confirmation instruction is not received;

A sentence receiving module, configured to receive a sentence input in the input box;

The index generation module 640 is further configured to determine the input sentence as an index of the first image, and store the index corresponding to the first image.

In an optional embodiment provided based on the embodiment shown in FIG. 5, the image recognition module is configured to:

Optionally, the device further includes: a sample set acquisition module, a sentence output module, an error calculation module, and a model training module (not shown in the figure).

A sample set acquisition module, for acquiring a training sample set, the training sample set including a plurality of sample images, the sample images corresponding to the expected description sentences corresponding to the recognition results;

The sentence output module is used to process the recognition result through the language description model for the sample image and output the actual description sentence;

An error calculation module, used to calculate the error between the actual description sentence and the expected description sentence;

The model training module is used to adjust the parameters of the language description model when the error is greater than a preset threshold, and process from each of the sample images through the language description model to output the actual description sentence Steps begin to execute; until the error is less than or equal to the preset threshold, the training is stopped, and the language description model that has completed the training is obtained, and the language description model is used to generate the description sentence according to the recognition result.

Please refer to FIG. 6, which shows a block diagram of an image search apparatus provided by an embodiment of the present application. The device has the function of implementing the above method, and the function can be realized by hardware, or can be realized by hardware executing corresponding software. The device may be a terminal or may be provided on the terminal. The device includes:

The search box display module 710 is used to display the search box.

The keyword receiving module 720 is configured to receive the first keyword input in the search box.

The image search module 730 is configured to search a second image matching the first keyword in an album, and the index corresponding to the second image includes a first target keyword, and the first target keyword is The first keyword matches, and the index corresponding to the second image is a description sentence generated according to the recognition result of the second image.

The result display module 740 is configured to display search results, and the search results include the second image.

Optionally, the device further includes: an information display module and a keyword acquisition module (not shown in the figure).

The information display module is configured to display prompt information when the number of the second images is greater than a preset number, and the prompt information is used to prompt the input of the second keyword.

The keyword acquisition module is used to acquire the second keyword.

The image search module is further configured to search for a third image matching the second keyword in the second image, and an index corresponding to the third image includes a second target keyword, the second The target keyword matches the second keyword;

Wherein, the search result includes the third image.

It should be noted that when the device provided in the above embodiment realizes its function, it is only exemplified by the division of the above functional modules. In practical applications, the above functions can be allocated by different functional modules according to needs, that is, the equipment The internal structure of is divided into different functional modules to complete all or part of the functions described above. In addition, the device and method embodiments provided in the above embodiments belong to the same concept. For the specific implementation process, see the method embodiments, and details are not described here.

Referring to FIG. 7, it shows a structural block diagram of a terminal provided by an exemplary embodiment of the present application. The terminal in this application may include one or more of the following components: a processor 610 and a memory 620.

The processor 610 may include one or more processing cores. The processor 610 connects various parts of the entire terminal by using various interfaces and lines, and executes the terminal by executing or executing instructions, programs, code sets or instruction sets stored in the memory 620, and calling data stored in the memory 620 Various functions and processing data. Optionally, the processor 610 may adopt at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) Various hardware forms. The processor 610 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU) and a modem. Among them, CPU mainly deals with operating system and application program, etc.; modem is used to deal with wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 610, and may be implemented by a chip alone.

Optionally, when the processor 610 executes the program instructions in the memory 620, the image index generation method or the image search method provided by the foregoing method embodiments are implemented.

The memory 620 may include random access memory (Random Access Memory, RAM) or read-only memory (Read-Only Memory, ROM). Optionally, the memory 620 includes a non-transitory computer-readable storage medium. The memory 620 may be used to store instructions, programs, codes, code sets, or instruction sets. The memory 620 may include a storage program area and a storage data area, where the storage program area may store instructions for implementing an operating system, instructions for at least one function, instructions for implementing various method embodiments described above, etc.; storage data area It can store data created according to the use of the terminal.

The structure of the above terminal is only schematic. In actual implementation, the terminal may include more or fewer components, such as a display screen, etc., which is not limited in this embodiment.

A person skilled in the art may understand that the structure shown in FIG. 6 does not constitute a limitation on the terminal 600, and may include more or fewer components than illustrated, or combine certain components, or adopt different component arrangements.

An exemplary embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, which when loaded and executed by a processor implements the image index generation method or image search method provided by the above method embodiments .

An exemplary embodiment of the present application also provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the image index generation method or the image search method described in the above embodiments.

It should be understood that the "plurality" referred to herein refers to two or more. "And/or" describes the relationship of the related objects, indicating that there can be three relationships, for example, A and/or B, which can indicate: there are three conditions: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related object is a "or" relationship.

A person of ordinary skill in the art may understand that all or part of the steps for implementing the above-described embodiments may be completed by hardware, or may be completed by a program instructing related hardware. The program may be stored in a computer-readable storage medium. The mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above are only optional embodiments of this application and are not intended to limit this application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application should be included in the protection of this application Within range.

Claims

An image index generation method, characterized in that the method includes:

Get the first image;

Performing image recognition on the first image to obtain a recognition result corresponding to the first image;

Generating a description sentence according to the recognition result, where the description sentence is used to describe the first image;

The description sentence is determined as an index of the first image, and the index is stored in correspondence with the first image.
The method according to claim 1, wherein the generating a description sentence according to the recognition result comprises:

Convert the recognition result into a first word vector;

The first word vector is processed through a language description model to obtain the description sentence.
The method according to claim 1, wherein after acquiring the first image, the method further comprises:

Acquiring associated information of the first image, the associated information including at least one of the following: location information, time information, and scene information;

The generating a description sentence based on the recognition result includes:

Convert the recognition result into a first word vector;

Convert the related information into a second word vector;

The first word vector and the second word vector are processed through a language description model to obtain the description sentence.
The method according to claim 1, wherein before determining the description sentence as an index of the first image and storing the index corresponding to the first image, further comprising:

Displaying inquiry information, the inquiry information is used to inquire whether to determine the description sentence as the index;

When receiving the confirmation instruction corresponding to the inquiry information, performing the step of determining the description sentence as an index of the first image, and storing the index in correspondence with the first image.
The method according to claim 4, wherein after displaying the inquiry information, the method further comprises:

When the confirmation instruction is not received, an input box is displayed;

Receiving the sentence input in the input box;

Determining the input sentence as an index of the first image, and storing the index corresponding to the first image.
The method according to any one of claims 1 to 5, wherein the performing image recognition on the first image to obtain a recognition result corresponding to the first image includes:

Performing image recognition on the first image through an image recognition model to obtain recognition results corresponding to at least one object in the first image respectively;

Wherein, the image recognition model is a neural network model trained by using multiple sample images, and the object in each sample image of the multiple sample images corresponds to a classification label.
The method according to any one of claims 1 to 5, wherein before generating the description sentence based on the recognition result, the method further comprises:

Acquiring a training sample set, the training sample set including a plurality of sample images, the sample images corresponding to the expected description sentences corresponding to the recognition results;

For the sample image, the recognition result is processed through a language description model, and the actual description sentence is output;

Calculating the error between the actual description sentence and the expected description sentence;

When the error is greater than a preset threshold, the parameters of the language description model are adjusted, and the step of outputting the actual description sentence from the step of processing through the language description model for each sample image and outputting the actual description sentence; When the error is less than or equal to the preset threshold, the training is stopped, and the language description model that completes the training is obtained, and the language description model is used to generate the description sentence according to the recognition result.
An image search method, characterized in that the method includes:

Display the search box;

Receiving the first keyword entered in the search box;

Searching for a second image matching the first keyword in the album, the index corresponding to the second image includes a first target keyword, and the first target keyword matches the first keyword , The index corresponding to the second image is a description sentence generated according to the recognition result of the second image;

A search result is displayed, the search result including the second image.
The method according to claim 8, wherein before displaying the search results, the method further comprises:

When the number of the second images is greater than the preset number, prompt information is displayed, and the prompt information is used to prompt the input of the second keyword;

Acquiring the second keyword;

Searching for a third image matching the second keyword in the second image, an index corresponding to the third image includes a second target keyword, the second target keyword and the second key Match words;

Wherein, the search result includes the third image.
An image index generating device, characterized in that the device includes:

The image acquisition module is used to acquire the first image;

An image recognition module, configured to perform image recognition on the first image to obtain a recognition result corresponding to the first image;

A sentence generating module, configured to generate a description sentence according to the recognition result, and the description sentence is used to describe the first image;

The index generation module is configured to determine the description sentence as an index of the first image, and store the index corresponding to the first image.
The apparatus according to claim 10, wherein the sentence generation module is configured to:

Convert the recognition result into a first word vector;

The first word vector is processed through a language description model to obtain the description sentence.
The device of claim 10, wherein the device further comprises:

The information acquisition module is used to acquire the associated information of the first image, the associated information includes at least one of the following: location information, time information, scene information;

The sentence generation module is used to:

Convert the recognition result into a first word vector;

Convert the related information into a second word vector;

The first word vector and the second word vector are processed through a language description model to obtain the description sentence.
The device of claim 10, wherein the device further comprises:

The information display module is used to display query information, and the query information is used to query whether the description sentence is determined as the index;

The index generation module is further configured to execute the determination of the description sentence as the index of the first image when receiving the confirmation instruction corresponding to the inquiry information, and to compare the index with the first An image corresponds to the stored step.
The method according to claim 13, wherein the device further comprises:

The input box display module is used to display the input box when the confirmation instruction is not received;

A sentence receiving module, configured to receive a sentence input in the input box;

The index generation module is further configured to determine the input sentence as an index of the first image, and store the index corresponding to the first image.
The device according to any one of claims 10 to 14, wherein the image recognition module is configured to:

Performing image recognition on the first image through an image recognition model to obtain recognition results corresponding to at least one object in the first image respectively;

Wherein, the image recognition model is a neural network model trained by using multiple sample images, and the object in each sample image of the multiple sample images corresponds to a classification label.
The method according to any one of claims 10 to 14, wherein the device further comprises:

A sample set acquisition module, for acquiring a training sample set, the training sample set including a plurality of sample images, the sample images corresponding to the expected description sentences corresponding to the recognition results;

The sentence output module is used to process the recognition result through the language description model for the sample image and output the actual description sentence;

An error calculation module, used to calculate the error between the actual description sentence and the expected description sentence;

The model training module is used to adjust the parameters of the language description model when the error is greater than a preset threshold, and process from each of the sample images through the language description model to output the actual description sentence Steps begin to execute; until the error is less than or equal to the preset threshold, the training is stopped, and the language description model that has completed the training is obtained, and the language description model is used to generate the description sentence according to the recognition result.
An image search device, characterized in that the device includes:

Search box display module, used to display the search box;

A keyword receiving module, configured to receive the first keyword input in the search box;

An image search module is used to search a photo album for a second image matching the first keyword, an index corresponding to the second image includes a first target keyword, and the first target keyword and the The first keywords match, and the index corresponding to the second image is a description sentence generated according to the recognition result of the second image;

The result display module is used to display search results, and the search results include the second image.
The method of claim 17, wherein the device further comprises:

An information display module, configured to display prompt information when the number of the second images is greater than a preset number, and the prompt information is used to prompt input of a second keyword;

A keyword acquisition module for acquiring the second keyword;

The image search module is further configured to search for a third image matching the second keyword in the second image, and an index corresponding to the third image includes a second target keyword, the second The target keyword matches the second keyword;

Wherein, the search result includes the third image.
A terminal, characterized in that the terminal includes a processor and a memory, and the memory stores a computer program, and the computer program is loaded and executed by the processor to implement any one of claims 1 to 7. Image index generation method, or implement the image search method according to any one of claims 8 to 9.
A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to realize the image according to any one of claims 1 to 7. An index generation method, or an image search method according to any one of claims 8 to 9.