CN112084359A

CN112084359A - Picture retrieval method and device and electronic equipment

Info

Publication number: CN112084359A
Application number: CN202010988787.2A
Authority: CN
Inventors: 李琪
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-15

Abstract

The application discloses a picture retrieval method, a picture retrieval device and electronic equipment, and belongs to the technical field of communication. The method comprises the following steps: the method comprises the steps of obtaining first voice information input by a user, converting the first voice information into a first text sequence, identifying P first named entities in the first text sequence, determining N first comprehensive matching degrees of N pictures and the P first named entities, and displaying M target pictures matched with the first voice information based on the N first comprehensive matching degrees. The user does not need to browse and search in a large number of pictures, or input search words for searching, and the user can search the pictures which the user needs to search only by voice input, so that the user can search the pictures required by the user quickly, conveniently and accurately, and the picture searching efficiency is improved.

Description

Picture retrieval method and device and electronic equipment

Technical Field

The application belongs to the technical field of communication, and particularly relates to a picture retrieval method and device and electronic equipment.

Background

With the continuous expansion of the storage space of the electronic device and the continuous enhancement of the shooting effect of the electronic device, users increasingly rely on the electronic device for picture storage and management, which results in a large number of pictures being deposited on the electronic device, wherein the pictures include shooting, screen capturing, application program or web browser picture saving, and the like.

In the process of implementing the present application, the inventor finds that at least the following problems exist in the prior art: if a user needs to search for pictures stored on the electronic device, the user can manually input a certain search condition, for example, input a search condition "2019.5" to search for pictures shot in 5 months in 2019; or the search condition "kitten" is input to search for the photographed kitten photo. However, if the user needs to search for a kitten photo taken in 5 months in 2019, and if the search condition "kitten photo taken in 5 months in 2019" is input, even if pictures meeting the search condition are stored in the electronic device, pictures meeting the search condition cannot be searched, and the user needs to manually search for a picture required by the user among a large number of pictures. Therefore, there is a problem that the picture retrieval efficiency is low.

Disclosure of Invention

The embodiment of the application aims to provide a picture retrieval method, a picture retrieval device and electronic equipment, which can solve the problem of low efficiency of the existing picture retrieval.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides an image retrieval method, where the method includes:

acquiring first voice information input by a user;

converting the first voice information into a first text sequence;

identifying P first named entities in the first text sequence;

determining N first comprehensive matching degrees of the N pictures and the P first named entities;

displaying M target pictures matched with the first voice information based on the N first comprehensive matching degrees;

wherein N, M, P is a positive integer, M is less than or equal to N; and the first comprehensive matching degree of the M target pictures is greater than a preset matching degree threshold value.

In a second aspect, an embodiment of the present application provides an image retrieval apparatus, including;

the acquisition module is used for acquiring first voice information input by a user;

the conversion module is used for converting the first voice information into a first text sequence;

the identification module is used for identifying P first named entities in the first text sequence;

the first determination module is used for determining N first comprehensive matching degrees of the N pictures and the P first named entities;

the first display module is used for displaying M target pictures matched with the first voice information based on the N first comprehensive matching degrees;

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, first voice information input by a user is obtained, the first voice information is converted into a first text sequence, P first named entities in the first text sequence are identified, N first comprehensive matching degrees of N pictures and the P first named entities are determined, and M target pictures matched with the first voice information are displayed based on the N first comprehensive matching degrees. The user does not need to browse and search in a large number of pictures, or input search words for searching, and the user can search the pictures which the user needs to search only by voice input, so that the user can search the pictures required by the user quickly, conveniently and accurately, and the picture searching efficiency is improved.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for retrieving pictures provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a NER model identifying named entities in a text sequence according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a target picture display interface according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an image retrieval apparatus according to an embodiment of the present application;

fig. 5 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 is a schematic hardware structure diagram of another electronic device for implementing the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The following describes the image retrieval method provided by the embodiment of the present application in detail through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a picture retrieval method provided in an embodiment of the present application, where the method may include the following steps:

step 101, acquiring first voice information input by a user.

Step 102, converting the first voice information into a first text sequence.

For step 101 and step 102, after the user opens the gallery application program, the user may press the "press me to say" button displayed on the interface for a long time to perform voice input, for example, if the user wants to search for a dog rose shot in a certain province in the last year, the user may press the button for a long time to input the first voice information "the dog rose shot in the Guangdong province in the last year", and after the user finishes speaking, the user may release the button to complete the input of the first voice information. After the electronic equipment acquires the first voice information, the first voice information is converted into a first text sequence.

Wherein, the first voice information is converted into the first text sequence by adopting an Automatic Speech Recognition (ASR) technology. The ASR model after the first speech information input training can be input, the ASR model after the training can convert the first speech information into a first text sequence, and what needs to be explained is that the common hot word training ASR model in the field of picture browsing and retrieval can be increased, so that the accuracy of recognizing the speech information by the ASR model can be improved, and the accuracy of converting the speech information into the text sequence is improved.

Step 103, identifying P first named entities in the first text sequence.

Wherein P is a positive integer. Named entities of value to the search are identified in the first text sequence, named entities generally referring to entities in the text that have a particular meaning or strong reference, typically including time of day, place name, organization name, proper noun, and the like. Such as time named entities, geographic location named entities, and semantic named entities. Each first named entity in the first text sequence is identified by labeling each character in the first text sequence in this embodiment. B (begin) is used to denote the starting position of a named entity, i (inside) is used to denote the internal position (here also the end position) of a named entity, and o (out) is used to denote that it does not belong to any named entity. To further distinguish between different named entity types (e.g., temporal named entity, geo-location named entity, and other named entities), Bt is defined as the starting location of the temporal named entity, It is the internal location of the temporal named entity, Bp is the starting location of the geo-location named entity, Ip is the internal location of the geo-location named entity, Bo is the starting location of the other named entity (Object), and Io is the internal location of the other named entity.

In this embodiment, a BIO tag of each character in the first text sequence may be identified through a trained Named Entity Recognition (NER) model, so as to identify P first Named entities in the first text sequence. For example, referring to fig. 2, fig. 2 is a schematic diagram of an NER model identifying a named entity in a text sequence according to an embodiment of the present application. The NER model comprises an Embedding layer, a bidirectional Long Short-Term Memory network (BilsTM) model and a Conditional Random Field (CRF) model, wherein the Embedding layer converts each character in a first text sequence (such as the first text sequence is the certificate of the most recent beat) into a word vector, inputs the word vector into the BilsTM model, extracts the characteristics of the word vector through the BilsTM model to obtain a characteristic vector, labels the characteristic vector through the CRF model, such as the most part as Bt, the near part as It, the beat part as 0, the certificate part as Bo, the part as Io, and further identifies the most recent part as a time named entity, the non-beat part as any named entity according to the labeling result, i.e., the two named entities identified are "recent" and "certificate".

And step 104, determining N first comprehensive matching degrees of the N pictures and the P first named entities.

Wherein N, M is a positive integer, and M is less than or equal to N. Determining the N first comprehensive matching degrees of the N pictures and the P first named entities may be implemented as follows:

determining a first entity matching degree of each first named entity in the P first named entities and the N pictures;

and determining N first comprehensive matching degrees of the N pictures and the P first named entities according to the first individual matching degrees of the N pictures.

Determining the first entity matching degree of the N pictures with each of the P first named entities may be implemented as follows:

under the condition that the ith first named entity in the P first named entities is a time named entity, acquiring the ith time period of the ith first named entity and the creation time of N pictures;

determining the first individual matching degree of each picture in the N pictures and the ith first named entity according to the ith time period and the creation time of the N pictures;

wherein i is a positive integer, i is not more than P, and the ith first named entity is any one of the P first named entities; the ith time period is a preset time period corresponding to the ith first named entity or a time period determined based on the entity content of the ith first named entity.

For example, the first voice message is "most recently photographed certificate", in step 103, two named entities are identified, that is, P is equal to 2, and the 1 st first named entity of the P named entities is a time named entity, a preset time period corresponding to the time named entity is obtained, for example, a time period which is 3 months at the maximum from the current time is taken as the preset time period. For example, if the current time is 2020.05.010:00:00, the preset time period corresponding to the time-named entity is 2020.02.010:00: 00-2020.04.3023: 59: 59. If the first voice information is "rose puppy shot in Guangdong province in the last year" and the system time of the electronic device is 2020, the time-named entity "last year" is the entity content "last year", and the time period of the time-named entity "last year" may be determined to be 2019.01.010:00:00 to 2019.12.3123:59: 59.

Taking the time period of the time-named entity 'last year' from 2019.01.010:00:00 to 2019.12.3123:59:59 as an example, the starting time (2019.01.010:00:00) of the time period can be represented by t1, the ending time (2019.12.3123:59:59) can be represented by t2, and the time period of the time-named entity is greater than equal timeA time period at t1 and equal to or less than t 2; traversing all pictures to obtain the creation time t of each picture (for the picture obtained by user screenshot, the creation time refers to the time when the user screenshot, for the picture shot by the user, the creation time of the picture refers to the time when the user screenshot), if the creation time t of a certain picture is more than or equal to t1(2019.01.010:00:00) and less than or equal to t2(2019.12.3123:59:59), recording that the matching degree of the picture is equal to 1.0, otherwise, obtaining the distance between t and t1(2019.01.010:00:00) and the minimum value delta t (unit second) of the distance between t and t2(2019.12.3123:59:59), namely delta t is min (| t-t1|, | t-t2|), passing through a formula s_t＝e^-αΔtAnd calculating the individual matching degree of the picture and the time named entity, wherein alpha is a hyper-parameter and is set according to the actual service requirement, and the larger alpha represents the larger penalty coefficient. And outputting the individual matching degree of each picture and the time named entity by traversing all the pictures.

For example, for the time named entity of "yesterday", the time period corresponding to the time named entity may be normalized to be greater than or equal to t1 and less than or equal to t2, and if the current system time of the electronic device is 2019, 2, t1 is 2019, 2, 1, 00:00:00, and t2 is 2019, 2, 1, 23:59: 59. Then checking whether the creation time of each picture is in the time period, if the creation time of a certain picture is in the time period, directly returning to the highest score of 1.0, otherwise, calculating the distance between the creation time of the picture and t1 and the shortest distance between the creation time of the picture and t2, and using the formula s_t＝e^-αΔtAnd calculating the individual matching degree of each picture and the time named entity.

Optionally, determining the first entity matching degree between the N pictures and each of the P first named entities may be implemented by:

under the condition that the jth first named entity in the P first named entities is a geographical position named entity, acquiring the jth first longitude and latitude corresponding to the jth first named entity and the second longitude and latitude of the shooting place of the N pictures;

determining the matching degree of each picture in the N pictures and the first body of the jth first named entity according to the jth first longitude and latitude and the second longitude and latitude of the shooting places of the N pictures;

wherein j is a positive integer, j is not more than P, and the jth first named entity is any one of the P first named entities.

And the jth first longitude and latitude corresponding to the jth first named entity is the longitude and latitude of the geographic position represented by the jth first named entity. The latitude and longitude range of the geographic position named entity can be inquired by calling an Application Programming Interface (API), and the latitude and longitude L0 corresponding to the geographic position named entity is determined according to the latitude and longitude range of the geographic position named entity. For example, the first voice message is "doggie rose shot in guangzhou in last year", the time named entity "last year", the geographic location named entity "guangzhou", the semantic named entity 1 "dog", and the semantic named entity 2 "rose" are identified in step 103, and the 2 nd named entity "guangzhou" of the 4 first named entities is the geographic location named entity, in this case, j is equal to 2, that is, "guangzhou" is the 2 nd named entity of the 4 named entities, and the first longitude corresponding to the geographic location named entity may be an average value of the longitude and latitude range of guangzhou. For example, the longitude and latitude of the guangzhou are 112 degrees 57 to 114 degrees 3 minutes of east longitude, 22 degrees 26 to 23 degrees 56 minutes of north latitude, the average value of the longitude and latitude range of the guangzhou is 113 degrees 30 minutes of east longitude, and 23 degrees 11 minutes of north latitude is taken as the first longitude latitude L0 corresponding to the "guangzhou"; traversing all the pictures to obtain a second longitude and latitude L of the shooting place of each picture, calculating the actual physical straight-line distance d (unit meter) between L and L0 of each picture, and obtaining the second longitude and latitude L of the shooting place of each picture according to the formula s_p＝e^-βdAnd calculating the matching degree of each picture and the first individual of the geographic position named entity. Wherein, beta is a hyper-parameter, which is set according to the actual service requirement, and the larger beta represents the larger penalty coefficient.

Optionally, determining the first entity matching degree between the N pictures and each of the P first named entities may be implemented as follows:

under the condition that k first named entities in the P first named entities are semantic named entities, acquiring k word vectors of the k first named entities and image feature vectors of N pictures;

and determining the matching degree of each picture in the N pictures and the first entity of the k first named entities according to the k word vectors and the image feature vectors of the N pictures.

And under the condition that the k first named entities are Semantic named entities, determining the first entity matching degree of each picture and the k Semantic named entities by adopting a Deep Structured Semantic Model (DSSM), namely under the condition that the k first named entities are Semantic named entities, namely the k Semantic named entities exist, determining the Semantic matching degree of each picture and the k Semantic named entities. And inputting the image feature vector of the picture and the word vectors corresponding to the k semantic named entities into a DSSM model, and outputting the matching degree of the picture and the first entity of the k semantic named entities through the DSSM model. The image feature vector of the picture can be obtained through a Convolutional Neural Network (CNN) model, the picture is input into the CNN model, and the image feature vector of the picture is extracted through the CNN model. The word vector corresponding to the semantic named entity may be a word vector trained by using a GloVe model, if the number of the semantic named entities is multiple, that is, k is greater than 1, for example, in the case that the first voice message is "rose photos of puppies taken in guangzhou in last year", the semantic named entities include semantic named entity 1 "dog" and semantic named entity 2 "rose", in this case, that is, k is equal to 2, it is necessary to determine the first individual matching degree of each of N pictures with k first named entities according to the word vector of semantic named entity 1, the word vector of semantic named entity 2, and the image feature vectors of N pictures. Specifically, word vectors corresponding to the semantic named entities may be added to obtain text feature vectors of the semantic named entities, and finally, the image feature vectors and the text feature vectors are input into the DSSM model, and then the DSSM model outputs the first degree of matching between the image and k semantic named entities, where the closer to 0 the output is, the greater the degree of irrelevant image-text is, and the closer to 1 the output is, the greater the degree of relevant image-text is.

And determining the matching degree of each picture in the N pictures and the first body of the k first named entities. For example, if the first voice message is "rose photo of puppy taken in Guangzhou in the last year", the word vector trained by the GloVe model is used to initialize the word vector of the semantic named entity 1 and the word vector of the semantic named entity 2, and the word vector of the semantic named entity 1 and the word vector of the semantic named entity 2 are added to obtain a text feature vector of "rose and dog"; if the photo 1, the photo 2, the photo … … and the photo 100 in the album are all 100 photos, the image feature vector of the photo 1 and the text feature vector are input into the DSSM model, the DSSM model outputs the first volume matching degree of the photo 1 and the semantic named entity 2, similarly, the image feature vector of the photo 2 and the text feature vector are input into the DSSM model, the DSSM model outputs the first volume matching degree of the photo 2 and the semantic named entity 1 and the semantic named entity 2, and the first volume matching degree of each photo and the semantic named entity 1 and the semantic named entity 2 is calculated in sequence.

And 105, displaying M target pictures matched with the first voice information based on the N first comprehensive matching degrees.

And the first comprehensive matching degree of the M target pictures is greater than a preset matching degree threshold value.

If the P first named entities include the time named entity 1, the geographic location named entity 2 and the semantic named entity 3, the matching degrees of the picture and the first named entities of the three named entities need to be normalized and combined to obtain a first comprehensive matching degree of the picture. Wherein, the normalization combination formula is: s_final＝α_ts_t+α_ps_p+α_os_o

Wherein alpha is_t、α_pAnd alpha_oWeight coefficients which are respectively set according to business needs and are respectively used for controlling the weight of the first body matching degree of the picture and the time named entity, the first body matching degree of the picture and the geographic position named entity and the first body matching degree of the picture and the semantic named entity, s_t、s_pAnd s_oRespectively representing the first body matching degree of the picture and the time named entity,The matching degree of the picture and the first entity of the geographic position named entity and the matching degree of the picture and the first entity of the semantic named entity.

If the first target named entity does not include the time named entity, the first individual matching degree of each picture and the time named entity is 0, if the first target named entity does not include the geographic position named entity, the first individual matching degree of each picture and the geographic position named entity is 0, and if the first target named entity does not include the semantic named entity, the first individual matching degree of each picture and the semantic named entity is 0.

Based on the N first comprehensive matching degrees, M target pictures matched with the first voice information are displayed, and the method can be realized through the following steps:

sequencing the N pictures according to the magnitude of the N first comprehensive matching degrees;

sequentially selecting M target pictures from the pictures with the value of the first comprehensive matching degree in the range of the preset matching degree in the sequenced N pictures, and displaying the M target pictures in a first preset screen area;

the display area of a first target picture in the M target pictures is larger than that of a second target picture, and the first comprehensive matching degree of the first target picture is higher than that of the second target picture.

The N pictures can be sorted according to the magnitude of the first comprehensive matching degree, namely the first comprehensive matching degree of the ith picture is greater than or equal to the first comprehensive matching degree of the (i + 1) th picture, and i is less than or equal to N-1; the N pictures may also be sorted from small to large according to the first comprehensive matching degree, that is, the first comprehensive matching degree of the ith picture is less than or equal to the first comprehensive matching degree of the (i + 1) th picture.

Referring to fig. 3, fig. 3 is a schematic view of a target picture display interface provided in an embodiment of the present application, for example, the first voice message is "rose puppy taken in the last year in guangdong province", 500 pictures are stored in an album, and each picture is sorted according to the size of the first comprehensive matching degree of each picture. If the preset matching degree value range is 50% -100%, for example, and the value of the first comprehensive matching degree of the sorted first picture is 90%, the sorted first picture is located in the preset matching degree value range, and M target pictures are sequentially selected from the N sorted pictures. As shown in fig. 3, for example, 17 pictures are sequentially selected and displayed in the dashed box area. As shown in fig. 3, the first comprehensive matching degree of the picture 1 in the 17 selected pictures is the largest, the first comprehensive matching degree of the picture 2 is greater than the first comprehensive matching degree of the picture 3, the picture 1 is the first target picture 1, the picture 2 is the first target picture 2, the picture 3 is the first target picture 3, and the second target picture refers to the remaining pictures except for the first target picture in the M selected target pictures.

Optionally, the method may further include the following steps:

grouping the N pictures according to the shooting positions of the N pictures to obtain S picture groups;

determining the color depth of the thumbnail of each picture according to the comprehensive matching degree of each picture in the S picture groups;

according to the color depth of the thumbnail of each picture, displaying the thumbnail of at least one picture in the S picture groups in a partition manner in a second preset screen area;

the color depths of the pictures with different comprehensive matching degrees are different, and S is a positive integer.

As shown in fig. 3, the pictures taken in shenzhen are classified into a group of pictures 1, the pictures taken in guangzhou are classified into a group of pictures 2, the pictures taken in fogshan are classified into a group of pictures 3, the color depth of the thumbnail of each picture is determined according to the size of the first comprehensive matching degree of each picture in the S groups of pictures, the color depths of the thumbnails of the different pictures with the first comprehensive matching degree are different, for example, the thumbnail of the picture with the first comprehensive matching degree greater than or equal to the first preset threshold value is displayed in dark red, the thumbnail of the picture with the first comprehensive matching degree greater than or equal to the second preset threshold value and less than the first preset threshold value is displayed in light red, the thumbnail of the picture with the first comprehensive matching degree greater than or equal to the third preset threshold value and less than the second preset threshold value is displayed in light yellow, the thumbnail of the picture with the first comprehensive matching degree greater than or equal to the third preset threshold value and less than the third preset threshold value is displayed in light gray, the first preset threshold is larger than a second preset threshold, and the second preset threshold is larger than a third preset threshold.

In a second preset screen region (for example, a solid-line frame region on the right side of the screen region shown in fig. 3), the thumbnail of at least one picture in S picture groups is displayed in a partitioned manner, as shown in fig. 3, S (S equals to 3) picture groups include picture group 1, picture group 2, and picture group 3, the thumbnail of the picture in picture group 1 is displayed in a sub-region 301 of the second preset screen region, the thumbnail of the picture in picture group 2 is displayed in a sub-region 302 of the second preset screen region, and the thumbnail of the picture in picture group 3 is displayed in a sub-region 303 of the second preset screen region, so that the thumbnails of the pictures in each picture group are displayed in a partitioned manner. The N pictures are grouped according to the shooting positions of the N pictures to obtain S picture groups, thumbnails in the picture groups are displayed in a partition mode, and therefore a user can conveniently and quickly identify the thumbnails of the pictures at different shooting positions, the shooting positions of the pictures can be quickly identified, and a picture at a certain shooting position required by the user can be found.

Through the display modes of different color depths of the thumbnails of the pictures with different first comprehensive matching degrees, the user can distinguish the pictures with different first comprehensive matching degrees conveniently. In addition, the picture with the highest first comprehensive matching degree is amplified and displayed in the dashed line frame area, so that a user can conveniently position the picture with the higher matching degree at a glance, and the user can conveniently and quickly identify the picture required by the user.

It should be noted that, if the user finds that the retrieved first target picture does not meet the user requirement, the user may continue to press the "press me to say" button shown in fig. 3 to input the second voice information, and may convert the second voice information into a second text sequence, and identify G second named entities in the second text sequence; determining a second entity matching degree of the N pictures and each of the G second named entities; determining N second comprehensive matching degrees of the N pictures and P first named entities and G second named entities according to the first individual matching degree and the second individual matching degree of the N pictures; and displaying Y target pictures matched with the first voice information and the second voice information based on N second comprehensive matching degrees, wherein G, Y is a positive integer, and Y is less than or equal to N.

For example, after the first voice information input by the user is a "photograph in 2019", and after M target pictures corresponding to the "photograph in 2019" are retrieved by using the method in this embodiment, if the user finds that a photograph of a kitten taken in 2019 needs to be retrieved, then the user may press a "press i saying" button shown in fig. 3 to input second voice information, for example, the input second voice information is "kitten", and in a case that the second voice information is acquired within a preset time period (for example, 10 seconds) after the electronic device acquires the first voice information, in this case, it is considered that the user retrieval requirement in the same scene is met, the first voice information and the second voice information are combined, and Y target pictures meeting the first voice information and the second voice information are retrieved.

It should be noted that, if the second voice information input by the user for multiple times is received within the preset time period, the second voice information input by the user for multiple times is merged with the first voice information, so that the target picture meeting the first voice information and the multiple second voice information is retrieved, and the user does not need to repeatedly input the voice information including the first voice information and the second voice information.

Optionally, after identifying each first named entity in the first text sequence, the following steps may be further included:

and displaying the search terms corresponding to the P first named entities in a floating window form.

As shown in fig. 3, if the first voice information of the user is "a doggie rose shot in the last year in guangdong province", the search words displayed in the floating window below the first preset area include "the last year", "guangdong province", "dog", and "red rose".

Optionally, the method may further include the following steps:

displaying a preset progress bar, wherein a first position of the preset progress bar comprises a sliding block, and the first position indicates a first value of the first comprehensive matching degree;

receiving touch input of a user to a slider or a preset progress bar;

responding to touch input, updating the sliding block to a second position of a preset progress bar for displaying, and updating the M target pictures into T pictures;

and the T pictures are matched with the first voice information, and the comprehensive matching degree is greater than or equal to the second value.

The touch input can be an operation of sliding a slider, dragging the slider or clicking a certain position on the preset progress bar by a user, the slider is updated to a second position of the preset progress bar to be displayed in response to the touch input, and the M target pictures are updated to be the T pictures.

For example, as shown in fig. 3, if the first value is 60%, if the user clicks a second position on the predetermined progress bar 304, the slider may be updated to the second position of the predetermined progress bar 304, where the second position indicates that the first comprehensive matching degree is 80%, for example, and in this case, the M target pictures are updated to T pictures, where the T pictures are pictures that are matched with the first voice information and the comprehensive matching degree is greater than or equal to the second value (e.g., 80%). Therefore, the user can conveniently and quickly adjust the retrieval result, namely quickly adjust the displayed target picture by displaying the preset progress bar.

In the picture retrieval method provided by this embodiment, first voice information input by a user is acquired, the first voice information is converted into a first text sequence, P first named entities in the first text sequence are identified, N first comprehensive matching degrees of N pictures and the P first named entities are determined, and M target pictures matched with the first voice information are displayed based on the N first comprehensive matching degrees. The user does not need to browse and search in a large number of pictures, or input search words for searching, and the user can search the pictures which the user needs to search only by voice input, so that the user can search the pictures required by the user quickly, conveniently and accurately, and the picture searching efficiency is improved.

In addition, according to the picture retrieval method provided by the embodiment, the picture which the user needs to search can be retrieved only by the user through voice input, so that the user only needs to input the first voice information of the finer retrieval condition when the user needs to retrieve the finer retrieval condition, the user does not need to browse and retrieve in a large number of pictures, and the picture retrieval efficiency is improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an image retrieving apparatus according to an embodiment of the present application, where the apparatus 400 includes:

an obtaining module 410, configured to obtain first voice information input by a user;

a conversion module 420, configured to convert the first voice information into a first text sequence;

an identifying module 430 for identifying P first named entities in the first text sequence;

a first determining module 440, configured to determine N first comprehensive matching degrees between the N pictures and the P first named entities;

a first display module 450, configured to display, based on the N first comprehensive matching degrees, M target pictures matched with the first voice information.

The picture retrieval device provided by this embodiment determines the matching degree between each picture stored on the electronic device and the first target named entity by acquiring the first voice information of the user and converting the first voice information into the first text sequence, identifying each first named entity in the first text sequence, wherein the first target named entity is any one of all the first named entities, determining the first comprehensive matching degree of each picture according to the matching degree between each picture and the first target named entity, and displaying the first target picture corresponding to the first voice information according to the first comprehensive matching degree of each picture, thereby realizing the retrieval of the first target picture meeting the first voice information, and if the user needs to perform the retrieval of the finer retrieval condition, the user only needs to input the first voice information of the finer retrieval condition without browsing and retrieving in a large number of pictures, therefore, when the user needs to search for pictures with more precise searching conditions, the picture searching efficiency is improved.

Optionally, the first determining module 440 includes:

a first determining unit, configured to determine a first entity matching degree between the N pictures and each of the P first named entities;

a second determining unit, configured to determine, according to the first individual matching degrees of the N pictures, N first comprehensive matching degrees of the N pictures and the P first named entities.

Optionally, the first determining unit is specifically configured to, when an ith first named entity of the P first named entities is a time named entity, obtain an ith time period of the ith first named entity and creation times of the N pictures;

determining a first individual matching degree of each picture in the N pictures and the ith first named entity according to the ith time period and the creation time of the N pictures;

Optionally, the first determining unit is specifically configured to, when a jth first named entity in the P first named entities is a geographic location named entity, obtain a jth first longitude and latitude corresponding to the jth first named entity and a second longitude and latitude of the shooting location of the N pictures;

determining the matching degree of each picture in the N pictures and the first individual of the jth first named entity according to the jth first longitude and latitude and the second longitude and latitude of the shooting places of the N pictures;

Optionally, the first determining unit is specifically configured to, when k first named entities in the P first named entities are semantic named entities, obtain k word vectors of the k first named entities and image feature vectors of the N pictures;

determining a first individual matching degree of each picture in the N pictures and the k first named entities according to the k word vectors and the image feature vectors of the N pictures;

optionally, the first display module 450 is specifically configured to sort the N pictures according to the size of the N first comprehensive matching degrees;

sequentially selecting M target pictures from the pictures with the value of the first comprehensive matching degree in the sequenced N pictures in a preset matching degree value range, and displaying the M target pictures in a first preset screen area;

Optionally, the method further includes:

the grouping module is used for grouping the N pictures according to the shooting positions of the N pictures to obtain S picture groups;

the second determining module is used for determining the color depth of the thumbnail of each picture according to the first comprehensive matching degree of each picture in the S picture groups;

the second display module is used for displaying the thumbnail of at least one picture in the S picture groups in a partition manner in a second preset screen area according to the color depth of the thumbnail of each picture;

the color depths of the thumbnails of the different first comprehensive matching degree pictures are different, and S is a positive integer.

Optionally, the conversion module 420 is further configured to convert the second voice information into a second text sequence when the second voice information input by the user is acquired within a preset time length after the first voice information is acquired;

an identifying module 430, further configured to identify G second named entities in the second text sequence;

the first determining module 440 is further configured to determine a second individual matching degree between the N pictures and each of the G second named entities;

the first determining module 440 is further configured to determine, according to the first individual matching degree and the second individual matching degree of the N pictures, N second comprehensive matching degrees of the N pictures with the P first named entities and the G second named entities;

the first display module 450 is further configured to display Y target pictures matched with the first voice information and the second voice information based on the N second comprehensive matching degrees;

wherein G, Y is a positive integer, and Y is less than or equal to N.

Optionally, the method further includes:

the third display module is used for displaying a preset progress bar, wherein the first position of the preset progress bar comprises a sliding block, and the first position indicates a first value of the first comprehensive matching degree;

the receiving module is used for receiving touch input of a user to the slider or the preset progress bar;

the fourth display module is further configured to update the slider to a second position of the preset progress bar for display in response to the touch input, and update the M target pictures into T pictures;

the second position indicates a second value of the first comprehensive matching degree, and the T pictures are pictures which are matched with the first voice information and have the comprehensive matching degree larger than or equal to the second value.

The picture retrieval device in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The picture retrieval device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The picture retrieval device provided in the embodiment of the present application can implement each process implemented by the picture retrieval device in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Optionally, an electronic device is further provided in an embodiment of the present application, as shown in fig. 5, fig. 5 is a schematic diagram of a hardware structure of the electronic device provided in the embodiment of the present application. The electronic device 500 includes a processor 501, and a memory 502 stores a program or an instruction that is stored in the memory 502 and can be executed on the processor 501, and when the program or the instruction is executed by the processor 501, the program or the instruction realizes the processes of the above-mentioned information processing method embodiments, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

It should be noted that the electronic devices in the embodiments of the present application include the mobile electronic devices and the non-mobile electronic devices described above.

The electronic device 600 includes, but is not limited to: a radio frequency unit 601, a network module 602, an audio output unit 603, an input unit 604, a sensor 605, a display unit 606, a user input unit 607, an interface unit 608, a memory 609, a processor 610, and the like.

Those skilled in the art will appreciate that the electronic device 600 may further comprise a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 610 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

The processor 610 is configured to obtain first voice information input by a user;

converting the first voice information into a first text sequence;

identifying P first named entities in the first text sequence;

displaying, by a display unit 606, M target pictures matched with the first voice information based on the N first comprehensive matching degrees;

The method comprises the steps of converting first voice information input by a user into a first text sequence by obtaining the first voice information, identifying P first named entities in the first text sequence, determining N first comprehensive matching degrees of N pictures and the P first named entities, and displaying M target pictures matched with the first voice information based on the N first comprehensive matching degrees. The user does not need to browse and search in a large number of pictures, or input search words for searching, and the user can search the pictures which the user needs to search only by voice input, so that the user can search the pictures required by the user quickly, conveniently and accurately, and the picture searching efficiency is improved.

A processor 610, further configured to determine a first individual matching degree of the N pictures with each of the P first named entities;

The processor 610 is further configured to, when an ith first named entity in the P first named entities is a time named entity, obtain an ith time period of the ith first named entity and creation times of the N pictures;

The processor 610 is further configured to, when a jth first named entity in the P first named entities is a geographic location named entity, obtain a jth first longitude and latitude corresponding to the jth first named entity and a second longitude and latitude of the shooting location of the N pictures;

The processor 610 is further configured to, when k first named entities of the P first named entities are semantic named entities, obtain k word vectors of the k first named entities and image feature vectors of the N pictures;

the processor 610 is further configured to sort the N pictures according to the size of the N first comprehensive matching degrees;

sequentially selecting M target pictures from the pictures with the value of the first comprehensive matching degree in the range of the preset matching degree in the sequenced N pictures, and displaying the M target pictures in a first preset screen area through a display unit 606;

The processor 610 is further configured to group the N pictures according to the shooting positions of the N pictures to obtain S picture groups;

determining the color depth of the thumbnail of each picture according to the first comprehensive matching degree of each picture in the S picture groups;

according to the color depth of the thumbnail of each picture, the thumbnail of at least one picture in the S picture groups is displayed in a partition mode in a second preset screen area through a display unit 606;

The processor 610 is further configured to convert, in a preset duration after the first voice information is acquired, second voice information input by the user into a second text sequence when the second voice information is acquired;

identifying G second named entities in the second text sequence;

determining a second individual matching degree of the N pictures with each of the G second named entities;

determining N second comprehensive matching degrees of the N pictures and the P first named entities and the G second named entities according to the first individual matching degree and the second individual matching degree of the N pictures;

displaying Y target pictures matched with the first voice information and the second voice information through a display unit 606 based on the N second comprehensive matching degrees;

wherein G, Y is a positive integer, and Y is less than or equal to N.

The display unit 606 is further configured to display a preset progress bar, where a first position of the preset progress bar includes a slider, and the first position indicates a first value of the first comprehensive matching degree;

the radio frequency unit 601 is further configured to receive a touch input of a user to the slider or the preset progress bar;

the display unit 606 is further configured to update the slider to a second position of the preset progress bar for displaying in response to the touch input, and update the M target pictures to T pictures;

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above-mentioned embodiment of the image retrieval method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device in the above embodiment. Readable storage media, including computer-readable storage media, such as Read-Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, etc.

It is to be understood that, in the embodiment of the present application, the input Unit 604 may include a Graphics Processing Unit (GPU) 6041 and a microphone 6042, and the Graphics Processing Unit 6041 processes image data of a still picture or a video obtained by an image capturing apparatus (such as a camera) in a video capturing mode or an image capturing mode. The display unit 606 may include a display panel 6061, and the display panel 6061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 607 includes a touch panel 6071 and other input devices 6072. A touch panel 6071, also referred to as a touch screen. The touch panel 6071 may include two parts of a touch detection device and a touch controller. Other input devices 6072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 609 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 610 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 610.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above-mentioned embodiment of the picture retrieval method, and the same technical effect can be achieved.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A picture retrieval method, the method comprising:

acquiring first voice information input by a user;

converting the first voice information into a first text sequence;

identifying P first named entities in the first text sequence;

2. The method according to claim 1, wherein the determining N first global degrees of matching of the N pictures to the P first named entities comprises:

determining a first individual matching degree of the N pictures and each of the P first named entities;

3. The method according to claim 2, wherein the determining the first individual matching degree of the N pictures with each of the P first named entities comprises:

under the condition that the ith first named entity in the P first named entities is a time named entity, acquiring the ith time period of the ith first named entity and the creation time of the N pictures;

4. The method according to claim 1, wherein the determining the first individual matching degree of the N pictures with each of the P first named entities comprises:

under the condition that the jth first named entity in the P first named entities is a geographical position named entity, acquiring the jth first longitude and latitude corresponding to the jth first named entity and the second longitude and latitude of the shooting places of the N pictures;

5. The method according to claim 1, wherein the determining the first individual matching degree of the N pictures with each of the P first named entities comprises:

under the condition that k first named entities in the P first named entities are semantic named entities, acquiring k word vectors of the k first named entities and image feature vectors of the N pictures;

and determining the first individual matching degree of each picture in the N pictures and the k first named entities according to the k word vectors and the image feature vectors of the N pictures.

6. The method according to claim 1, wherein the displaying the M target pictures matched with the first voice message based on the N first comprehensive matching degrees comprises:

7. The method of claim 1, further comprising:

according to the color depth of the thumbnail of each picture, in a second preset screen area, displaying the thumbnail of at least one picture in the S picture groups in a partition mode;

8. The method of claim 1, wherein after obtaining the first voice information input by the user, further comprising:

converting second voice information input by the user into a second text sequence under the condition that the second voice information is acquired within a preset time length after the first voice information is acquired;

identifying G second named entities in the second text sequence;

displaying Y target pictures matched with the first voice information and the second voice information based on the N second comprehensive matching degrees;

wherein G, Y is a positive integer, and Y is less than or equal to N.

9. The method of claim 1, further comprising:

receiving touch input of a user to the slider or the preset progress bar;

responding to the touch input, updating the slider to a second position of the preset progress bar for displaying, and updating the M target pictures into T pictures;

10. An image retrieval apparatus, characterized in that the apparatus comprises:

11. The apparatus of claim 10, wherein the first determining module comprises:

12. The apparatus according to claim 11, wherein the first determining unit is specifically configured to, when an ith first named entity of the P first named entities is a time named entity, obtain an ith time period of the ith first named entity and creation times of the N pictures;

13. The apparatus according to claim 11, wherein the first determining unit is specifically configured to, when a jth first named entity of the P first named entities is a geographic location named entity, obtain a jth first longitude and latitude corresponding to the jth first named entity and a second longitude and latitude of the shooting location of the N pictures;

14. The apparatus according to claim 10, wherein the first determining unit is specifically configured to, in a case where k first named entities of the P first named entities are semantic named entities, obtain k word vectors of the k first named entities and image feature vectors of the N pictures;

15. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the picture retrieval method as claimed in any one of claims 1-9.