CN110674342B

CN110674342B - Method and device for inquiring target image

Info

Publication number: CN110674342B
Application number: CN201810615126.8A
Authority: CN
Inventors: 郭阶添
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2023-04-25
Anticipated expiration: 2038-06-14
Also published as: CN110674342A

Abstract

The invention discloses a method and a device for inquiring a target image, and belongs to the field of intelligent analysis. The method comprises the following steps: extracting an image of a monitoring object of a target type from the recorded video; inputting the images of the same monitoring object extracted from the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and storing the obtained semantic features; when a monitoring object query request carrying attribute information of a target monitoring object is received, inputting the attribute information into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information; and determining a second semantic feature which meets a preset similarity condition with the first semantic feature in the stored semantic features, acquiring at least one image corresponding to the second semantic feature, and feeding back the query request of the monitoring object. By adopting the invention, the accuracy of the query result can be improved.

Description

Method and device for inquiring target image

Technical Field

The invention relates to the field of intelligent analysis, in particular to a method and a device for inquiring a target image.

Background

With the development of the electronic technical field, monitoring equipment in public places is more and more comprehensive, and accordingly, intelligent analysis of monitoring videos is more and more important. For example, when a user loses an article in a public place with a monitoring device, the monitoring video can be analyzed and inquired through relevant information about the article provided by the user, and an image of a person taking the article is found in the monitoring video.

At present, an intelligent analysis method for a monitoring video generally extracts an image of a monitoring object of a target type in the monitoring video, wherein the target type can be a type preset by a technician, for example, the target type can be an automobile, and extracts images of all automobile types detected in an image frame of the monitoring video, which can include an image of a truck, an image of a bus, an image of a car, and the like. And then identifying which images are from the same automobile from all the extracted images, grouping all the images of the same automobile into a group, and finally obtaining a group of images of each of a plurality of monitoring objects. Selecting an optimal image from a group of images of each monitoring object according to the selection basis such as definition or integrity, calculating the similarity between each optimal image and the query target image, and obtaining the optimal image with the maximum similarity as a query result.

In carrying out the present invention, the inventors have found that the related art has at least the following problems:

although the selected optimal image is clear, the characteristics of the monitored object may not be well reflected, so that the accuracy of the query result is low.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides a method and a device for inquiring a target image. The technical scheme is as follows:

in a first aspect, a method of querying a target image is provided, the method comprising:

extracting an image of a monitoring object of a target type from the recorded video;

inputting the images of the same monitoring object extracted from the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and storing the obtained semantic features;

when a monitoring object query request carrying attribute information of a target monitoring object is received, inputting the attribute information into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information;

and determining a second semantic feature which meets a preset similarity condition with the first semantic feature in the stored semantic features, acquiring at least one image corresponding to the second semantic feature, and feeding back the query request of the monitoring object.

Optionally, the first semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model;

inputting the image of the same monitoring object extracted in the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, wherein the method comprises the following steps:

inputting the images of the same monitoring object extracted from the same video segment into the image feature extraction sub-model to obtain image features;

inputting the image features into the image semantic generation sub-model to obtain a semantic description character string;

and inputting the semantic description character string into the semantic feature extraction sub-model to obtain the semantic features corresponding to the monitoring objects in the video segment.

Optionally, inputting the attribute information into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information, including:

determining a second semantic extraction model corresponding to the attribute information according to the data type of the attribute information and the corresponding relation between the data type and the semantic extraction model;

and inputting the attribute information into the second semantic extraction model corresponding to the attribute information to obtain a first semantic feature corresponding to the attribute information.

Optionally, the data type of the attribute information includes one or more of an image type, an audio type, and a character type.

Optionally, if the attribute information includes attribute information of an image type, the second semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model and a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model to obtain a first semantic feature corresponding to the attribute information includes:

inputting the attribute information of the image type into the image feature extraction sub-model to obtain image features corresponding to the attribute information of the image type;

inputting image features corresponding to the attribute information of the image type into the image semantic generation sub-model to obtain a semantic description character string corresponding to the attribute information of the image type;

and inputting the semantic description character strings corresponding to the attribute information of the image types into the semantic feature extraction sub-model to obtain first semantic features corresponding to the attribute information of the image types.

Optionally, if the attribute information includes attribute information of an audio type, the second semantic extraction model includes an audio semantic generation sub-model and a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model to obtain a first semantic feature corresponding to the attribute information includes:

Inputting the attribute information of the audio type into the audio semantic generation sub-model to obtain a semantic description character string corresponding to the attribute information of the audio type;

and inputting the semantic description character strings corresponding to the attribute information of the audio type into the semantic feature extraction sub-model to obtain first semantic features corresponding to the attribute information of the audio type.

Optionally, if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model, to obtain a first semantic feature corresponding to the attribute information includes:

and inputting the attribute information of the character type into the semantic feature extraction sub-model to obtain a first semantic feature corresponding to the attribute information of the character type.

Optionally, the determining, in the stored semantic features, a second semantic feature that meets a preset similarity condition with the first semantic feature includes:

and determining a second semantic feature with similarity larger than a preset similarity threshold value from the stored semantic features.

In a second aspect, there is provided an apparatus for querying a target image, the apparatus comprising:

the extraction module is used for extracting the image of the monitoring object of the target type from the recorded video;

the first acquisition module is used for inputting the images of the same monitoring object extracted from the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and storing the obtained semantic features;

the second acquisition module is used for inputting the attribute information into a second semantic extraction model when receiving a monitoring object query request carrying the attribute information of a target monitoring object to obtain a first semantic feature corresponding to the attribute information;

the feedback module is used for determining second semantic features meeting a preset similarity condition with the first semantic features in the stored semantic features, acquiring at least one image corresponding to the second semantic features, and feeding back the query request of the monitoring object.

the first acquisition module is configured to:

Optionally, the second obtaining module is configured to:

Optionally, if the attribute information includes attribute information of an image type, the second semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model, and the second acquisition module is configured to:

Optionally, if the attribute information includes attribute information of an audio type, the second semantic extraction model includes an audio semantic generation sub-model and a semantic feature extraction sub-model, and the second acquisition module is configured to:

Optionally, if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, and the second obtaining module is configured to:

Optionally, the feedback module is configured to:

In a third aspect, a computer device is provided, the computer device comprising a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other via the bus; a memory for storing a computer program; a processor, configured to execute a program stored in a memory, and implement the method for querying a target image according to the first aspect.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes or a set of instructions loaded and executed by the processor to implement a method of querying a target image as described in the first aspect above.

The technical scheme provided by the embodiment of the invention has the beneficial effects that:

in the embodiment of the invention, the image of the monitoring object of the target type is extracted from the recorded video; inputting the images of the same monitoring object extracted from the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and storing the obtained semantic features; when a monitoring object query request carrying attribute information of a target monitoring object is received, inputting the attribute information into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information; and determining a second semantic feature which meets a preset similarity condition with the first semantic feature in the stored semantic features, acquiring at least one image corresponding to the second semantic feature, and feeding back the query request of the monitoring object. Thus, the obtained semantic features of each monitoring object can better reflect the characteristics of the monitoring object, and further, the accuracy of the query result can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for querying a target image according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for querying a target image provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a method for querying a target image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an interface for querying a target image according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an interface for querying a target image according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus for querying a target image according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a server structure according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention provides a method for inquiring a target image, which can be realized by a server.

The server may include a processor, memory, transceiver, etc. The processor, which may be a CPU (Central Processing Unit ) or the like, may be configured to extract an image of a target type of monitoring object, obtain a semantic feature corresponding to the monitoring object in the video segment, obtain a first semantic feature corresponding to attribute information, determine a second semantic feature that meets a preset similarity condition with the first semantic feature, and perform feedback processing on a query request of the monitoring object. The memory may be RAM (Random Access Memory ), flash (Flash memory), etc., and may be used to store received data, data required by a processing procedure, data generated in the processing procedure, etc., such as an image of a target type of a monitored object, a first semantic extraction model, a semantic feature corresponding to the monitored object, a monitored object query request, attribute information, a second semantic extraction model, a first semantic feature, a second semantic feature, a preset similarity condition, etc. The transceiver may be used for data transmission with the terminal or other servers (such as a positioning server), for example, the transceiver may include an antenna, a matching circuit, a modem, and the like, and receives a monitoring object query request sent by the terminal.

As shown in fig. 1, the process flow of the method may include the following steps:

in step 101, in a recorded video, an image of a monitoring object of a target type is extracted.

The target type is a type preset by a technician, for example, the target type may be an automobile type, and when the server extracts the image of the monitoring object of the target type, the server extracts the images of all the automobile types in the image frame of the video, including the image of a truck, the image of a bus, the image of a car, and the like. The target type may be one type or a plurality of types, and the present invention is not limited thereto.

In implementation, in order to facilitate querying a recorded video for a monitored object that a user wants to query, the recorded video may be processed preferentially.

The user can pre-train the target detection model, input the recorded video into the trained target detection model, and extract the image of the target type monitoring object. For example, the user trains the target detection model in advance to extract an image of the monitored object of the automobile type, then inputs the recorded video into the trained target detection, and when extracting the image of the monitored object of the automobile type, the server can identify the monitored object in the form of a detection frame to obtain the position information of the first occurrence of each monitored object in the video, wherein the position information can be the coordinate information of four vertex angles of the detection frame. The object detection model may be HOG (Histogram of Oriented Gradient, a histogram of directional gradients) model, SSD (Single Shot MultiBox Detector, an object detection algorithm), DPM (Deformable Parts Models, a variable component model), fast RCNN (Region Convolutional Neural Network, an algorithm for object detection using deep learning), YOLO (You Only Look Once, an object detection method), or other object detection methods, as the invention is not limited in this respect.

Taking the example that the object detection model is a YOLO model, the YOLO model includes 24 convolution layers and 2 fully connected layers, wherein the convolution layers are used for extracting image features, and the fully connected layers are used for predicting image positions and category probability values.

After the position information of the monitoring object is extracted, respectively carrying out target tracking on each extracted monitoring object to obtain a series of position information of the monitoring object in the video frame, thereby obtaining a coordinate sequence. Then, dividing the target component of each frame of image of the monitoring object, namely classifying the pixel points of the detection frame, and more finely marking which pixel points belong to which part of the characteristics of the monitoring object to obtain a series of attribute labels. For example, the monitored object is a human, and the head of the monitored object is identified by the detection frame through target detection, so that when the monitored object is subjected to target component segmentation, which part of the pixel points in the detection frame belong to the hair characteristics of the monitored object and which part of the pixel points belong to the eye characteristics can be identified. And then, grading the extracted series of images according to the basis of the definition, the proper illumination intensity, the integrity of the components and the like of the images, wherein the higher the grade is, the higher the definition is, the more proper the illumination intensity is and the higher the integrity of the components is.

Finally, as shown in fig. 2, a series of images of the monitoring objects are obtained after the above processing, and the obtained series of images of each monitoring object are stored. The stored image can be an image obtained by screening according to the scores, or an original image without the scores, and can be selected according to specific requirements, and the invention is not limited to the above.

It should be noted that, the above steps are to train in advance an extraction model that can extract images of a plurality of types of monitoring objects. Based on the extraction model, extracting images of all types of monitoring objects in advance, storing the extracted images of the monitoring objects, acquiring the images of the monitoring objects which are of the same type and stored in advance and belong to the same type as the target monitoring objects when a user wants to inquire the target monitoring objects in the video, and carrying out subsequent operation according to the images of the monitoring objects.

In addition, in addition to the above manner, a plurality of extraction models may be trained in advance, each extraction model may be used to extract an image of a monitoring object of one type, when a user wants to query a target monitoring object in a video, the type to which the target monitoring object belongs is determined, the type is determined as the target type, then an extraction model corresponding to the target type is acquired, and the image of the monitoring object of the target type is extracted in the video based on the extraction model. The invention is not limited in this regard.

In step 102, the images of the same monitoring object extracted from the same video segment are input into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and the obtained semantic features are stored.

The first semantic extraction model is used for extracting semantic features corresponding to the target type monitoring objects from the recorded video.

In implementation, after extracting multiple groups of images of multiple monitoring objects of a target type through the step 101, inputting a group of images of the same monitoring object into a pre-trained first semantic extraction model, and performing semantic extraction on the input group of images by the first semantic extraction model to obtain semantic features corresponding to the monitoring object, where the semantic features may be characters and describe the monitoring object, and take the monitoring object as a dog as an example, and the semantic features may be a yellow fur dog.

Inputting a group of images of each monitoring object into a semantic extraction model according to the steps to obtain semantic features corresponding to each monitoring object, and storing the obtained semantic features of each monitoring object in a corresponding database.

Alternatively, the first semantic extraction model may include an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model; the specific process of step 102 may be: inputting the images of the same monitoring object extracted from the same video segment into an image feature extraction sub-model to obtain image features; inputting the image features into an image semantic generation sub-model to obtain a semantic description character string; inputting the semantic description character strings into a semantic feature extraction sub-model to obtain semantic features corresponding to the monitoring objects in the video segment.

In implementation, the first semantic extraction model may include the following models: an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model.

Taking the same monitoring object in the same video segment as an example, inputting a group of images of the same monitoring object extracted in the steps into an image feature extraction sub-model, wherein the image feature extraction sub-model can be a convolutional neural network model, and the convolutional neural network model can extract image features of the group of images to respectively obtain global image features, local image features, motion features and related features of the group of images and other objects.

The global image features are used for describing the overall feature information of the monitoring object, such as information of shape, color, local feature distribution and the like; the local image features are used for describing part of detail features of the monitoring object, and take the monitoring object as an example, the local features can be detail features such as eye shapes, body scars, black nevi and the like; the motion characteristics are used for describing the motion trend of the monitored object, such as displacement change, motion track change and other characteristics of the central position; the association features with other objects are used to describe the association between the monitored object and other objects, such as the monitored object being a person holding a bag in one hand and a pet dog in one hand, the association features of the monitored object may describe the relationship between the monitored object and the bag held in the hand and the monitored object and the pet dog being pulled. The global image feature, the local image feature, the motion feature and the associated feature are respectively a feature vector, and after the convolutional neural network model generates the features, the feature vectors are synthesized into a feature vector, namely the image feature.

Inputting the obtained image features into an image semantic generation sub-model, wherein the image semantic generation sub-model can be a cyclic neural network model, a time recurrent neural network model or a gate-controlled cyclic neural network model and the like, the cyclic neural network model carries out semantic generation on the image features to obtain a semantic description character string, the semantic description character string can describe a monitoring object in a text form, and then the generated semantic description character string is input into a semantic feature extraction sub-model.

The semantic feature extraction sub-model may contain two modules: the word segmentation module and the natural language processing module. After the semantic description character strings are input into the semantic feature extraction submodel, the word segmentation module carries out word segmentation processing on the semantic description character strings, extracts keywords in the semantic description character strings, and inputs the extracted keywords into the natural language processing module. The natural language processing module may be word2vec (a tool for word vector calculation), and the word2vec may convert an input keyword into a vector form to obtain semantic features of the monitored object.

And processing a group of images of each monitoring object according to the steps to obtain the semantic features of each monitoring object, and storing the semantic features of each monitoring object.

When the semantic features of each monitoring object are stored in the steps, in order to enable the user to more simply acquire the related information of the monitoring object, the information such as the target type corresponding to the monitoring object, the start time information and the end time information of the group of pictures of the monitoring object in the video, the position information of the monitoring object in the video and the like can be stored together with the semantic features of the monitoring object, so that the user can be shown to the user when inquiring the monitoring object in the video, and the user can know the related information of the monitoring object more clearly.

It should be noted that, the first semantic extraction model may be a pre-trained model, that is, the image feature extraction sub-model, the image semantic generation sub-model, and the semantic feature extraction sub-model included in the first semantic extraction model may be pre-trained. The training process may be as follows:

first, a plurality of training samples are acquired, and each training sample may include a sample image, a sample tag, and sample feature information. The training process may be iterative training, with a plurality of training samples being input into the model, the first training sample training the model. Inputting a sample image in a first training sample into an image feature extraction sub-model to be trained, extracting image features of the sample image by the image feature extraction sub-model to be trained to obtain first sample image features, and inputting the first sample image features into an image semantic generation sub-model to be trained to obtain a semantic description character string of the first sample. The semantic description character string is input into a semantic feature extraction sub-model, training is not needed by the semantic feature extraction sub-model, and the first sample semantic features are output. Comparing the first sample semantic features with the sample feature information to obtain error values, and adjusting parameters in the image feature extraction sub-model and the image semantic generation sub-model according to the error values. And then training the model according to a second training sample, repeating the training process until the obtained error value is smaller than a preset error value threshold value, determining each parameter in the image feature extraction sub-model and the image semantic generation sub-model at the moment, and determining the image feature extraction sub-model and the image semantic generation sub-model at the moment as a trained image feature extraction sub-model and an image semantic generation sub-model to obtain a trained first semantic extraction model.

In step 103, when a monitoring object query request carrying attribute information of a target monitoring object is received, the attribute information is input into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information.

The second semantic extraction model is used for extracting semantic features corresponding to attribute information input by a user, and the second semantic extraction model can comprise semantic extraction models of different types because the data types of the attribute information are different. Extracting semantic features through attribute information of different data types, and inquiring a target monitoring object in the video according to the extracted semantic features, namely, a semantic retrieval method.

In an implementation, when a user has a query requirement on a recorded video, for example, when a pet dog of the user gets lost, the user wants to find a monitoring image of the pet dog through querying the monitoring video, and then judges a possible destination of the pet dog. When the server receives a monitoring object query request sent by the terminal, attribute information carried in the monitoring object query request is input into a second semantic extraction model, and the second semantic extraction model performs semantic extraction on the attribute information to obtain semantic features (namely first semantic features) corresponding to the attribute information of the target monitoring object.

It should be noted that, the attribute information may be one or more of attribute information of an image type, attribute information of an audio type, and attribute information of a character type, and when the attribute information is different, the corresponding second semantic extraction model is different, and the corresponding processing may be: determining a second semantic extraction model corresponding to the attribute information according to the data type of the attribute information and the corresponding relation between the data type of the attribute information and the semantic extraction model; and inputting the attribute information into a second semantic extraction model to obtain first semantic features corresponding to the attribute information.

In implementation, the inventor finds that in the current method for querying the target image, the query can only be performed in the video according to the picture provided by the user, but the query can not be performed according to the voice provided by the user or the text information provided by the user, so that when the user can not provide the image of the target monitoring object, the accuracy of the query is reduced. Therefore, the inventor thinks that the image of the target monitoring object can be queried in the video according to the attribute information of the audio type and the character type, so that the joint query of the attribute information can be performed, and even if the user cannot provide the image of the target monitoring object, the image of the target monitoring object can be queried according to the voice description or the text description of the user, and the query accuracy is improved.

In order to facilitate determination of the semantic extraction model corresponding to the attribute information of different data types, a technician may store in advance a correspondence relationship between the data type of the attribute information and the semantic extraction model in a server, which may be as shown in table 1 below.

TABLE 1

Data type of attribute information	Semantic extraction model
		Attribute information of image type	Semantic extraction model for image types
Attribute information for audio type	Semantic extraction model for audio types
		Attribute information of character type	Semantic extraction model for character types

And according to the acquired data type of the attribute information, inquiring in the corresponding relation table, and finding a semantic extraction model corresponding to the data type of the attribute information, namely a second semantic extraction model. Inputting the attribute information into a determined second semantic extraction model, and extracting the semantic features of the attribute information through the second semantic extraction model to obtain semantic features corresponding to the attribute information, namely the first semantic features.

Alternatively, if the attribute information includes attribute information of an image type, the determined second semantic extraction model may include a semantic extraction model of the image type, where the semantic extraction model of the image type includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model, and the processing step of generating the first semantic feature corresponding to the attribute information of the image type according to the semantic extraction model of the image type may be as follows: inputting the attribute information of the image type into an image feature extraction sub-model to obtain image features corresponding to the attribute information of the image type; inputting image characteristics corresponding to the attribute information of the image type into an image semantic generation sub-model to obtain a semantic description character string corresponding to the attribute information of the image type; inputting the semantic description character strings corresponding to the attribute information of the image types into a semantic feature extraction sub-model to obtain first semantic features corresponding to the attribute information of the image types.

In an implementation, if the attribute information includes attribute information of an image type and the attribute information of the image type is a picture or a group of pictures, the attribute information of the image type is input to a semantic extraction model of the image type, and a process of extracting semantic features of the attribute information of the image type according to the semantic extraction model of the image type refers to the processing of step 102, which is not described herein.

If the attribute information includes attribute information of an image type and the attribute information of the image type is a video, the processing of step 101 is referred to first, an image of the target monitoring object is extracted from the attribute information, the image is input to a semantic extraction model of the image type, and a process of extracting semantic features of the attribute information of the image type according to the semantic extraction model of the image type is referred to the processing of step 102, which is not described herein.

Optionally, if the attribute information includes attribute information of an audio type, the determined second semantic extraction model includes a semantic extraction model of the audio type, where the semantic extraction model of the audio type includes an audio semantic generation sub-model and a semantic feature extraction sub-model, and the processing step of generating the semantic feature corresponding to the attribute information of the audio type according to the semantic extraction model of the audio type may be as follows: inputting the attribute information of the audio type into an audio semantic generation sub-model to obtain a semantic description character string corresponding to the attribute information of the audio type; inputting the semantic description character string corresponding to the attribute information of the audio type into a semantic feature extraction sub-model to obtain a first semantic feature corresponding to the attribute information of the audio type.

In an implementation, if the attribute information includes attribute information of the audio type, the second semantic extraction model includes at least a semantic extraction model of the audio type. The attribute information of the audio type is input into an audio semantic generation sub-model in a semantic extraction model of the audio type, the audio semantic generation sub-model can identify words in the audio, and semantic description character strings corresponding to the attribute information of the audio type are generated according to the identified words, and the semantic description character strings can be character strings in a word form. The obtained semantic description character string is input into a semantic feature extraction sub-model, the structure of the semantic feature extraction sub-model is the same as that of the semantic feature extraction sub-model, the semantic feature extraction sub-model comprises a word segmentation module and a natural language processing module, and the process of specifically generating the semantic features refers to the process of generating the semantic features according to the semantic feature extraction sub-model in the above steps, and details are not repeated here.

Optionally, if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, the attribute information is input into the second semantic extraction model, and a first semantic feature corresponding to the attribute information is obtained, including:

Inputting the attribute information of the character type into a semantic feature extraction sub-model to obtain a first semantic feature corresponding to the attribute information of the character type.

In an implementation, if the attribute information includes at least one type of attribute information, and the attribute information includes attribute information of a character type, the second semantic extraction model for the attribute information of the character type may include a semantic feature extraction sub-model, and the process of extracting the semantic feature of the attribute of the character type according to the semantic feature extraction sub-model may be: the method comprises the steps of inputting attribute information of a character type into a semantic feature extraction sub-model, performing word segmentation on the attribute information of the character type by a word segmentation module in the semantic feature extraction sub-model, extracting keywords in the attribute information of the character type, inputting the extracted keywords into a natural language processing module in the semantic feature extraction sub-model, such as word2vec, and performing feature vector quantization on the keywords by the natural language processing module to convert the keywords into semantic features in a vector form, so that semantic features (namely first semantic features) corresponding to the attribute information of the character type are obtained.

It should be noted that the attribute information input by the user may be only one type of attribute information, and the attribute information may be attribute information of an image type, attribute information of an audio type, or attribute information of a character type, and besides, the attribute information input by the user may also be attribute information of a combination of multiple types, specifically, the attribute information may be attribute information of an image type and an audio type, attribute information of an image type and a character type, attribute information of an audio type and a character type, or attribute information of an image type and an audio type, and attribute information of a character type, and in this case, after the terminal sends the attribute information to the server, the server classifies the attribute information of different types, and then performs semantic feature extraction on the attribute information of different types according to the steps, so as to obtain multiple semantic features. And then the server fuses the obtained plurality of semantic features according to a preset fusion mode to obtain one semantic feature, namely a first semantic feature, as shown in fig. 3. The fusion mode may be a mode of taking an average value of a plurality of semantic features, and the like. Therefore, the method can provide users with various input modes, so that the users can flexibly apply the inquired attribute information to inquire, and the semantic features obtained after the fusion of the plurality of semantic features are more representative, thereby improving the utilization rate of the information and the accuracy of the inquiry.

In step 104, among the stored semantic features, determining a second semantic feature satisfying a preset similarity condition with the first semantic feature, acquiring at least one image corresponding to the second semantic feature, and feeding back a query request of the monitored object.

In implementation, after the first semantic feature is obtained through the steps, each stored semantic feature (which may be called each semantic feature to be selected) corresponding to each monitoring object is obtained, the similarity between each semantic feature to be selected and the first semantic feature is calculated one by one, whether the obtained similarity meets a preset similarity condition is judged, if the similarity meets the preset similarity condition, the semantic feature to be selected corresponding to the similarity is determined to be a second semantic feature, at least one image corresponding to the second semantic feature is obtained, and a monitoring object query request is fed back according to the obtained at least one image. If all the calculated similarities do not meet the preset similarity condition, a query request is sent to the monitoring object to feed back a message of query failure, as shown in fig. 4.

It should be noted that, when the attribute information input by the user is attribute information of an image type, in order to improve accuracy of the calculated similarity, the similarity of each semantic feature to be selected and the first semantic feature is calculated one by one, and at the same time, the similarity of the image feature corresponding to each semantic feature to be selected and the image feature corresponding to the first semantic feature (that is, the image feature generated by the attribute information of the image type input by the user) is calculated, that is, the first similarity is calculated according to the image feature obtained by the attribute information of the image type input by the user and the image feature of a certain monitoring object stored in advance, the second similarity is calculated according to the semantic feature of the attribute information of the image type input by the user and the semantic feature of the monitoring object, and the first similarity and the second similarity are fused according to a preset fusion mode, such as taking an average value, so as to obtain a final similarity, that is, the attribute information of the image type input by the user and the similarity between the image type and the monitoring object.

Alternatively, the above-mentioned preset similarity condition may be that the similarity is greater than a preset similarity threshold, and the corresponding operation may be as follows: and determining a second semantic feature with the similarity with the first semantic feature being greater than a preset similarity threshold value from the stored semantic features.

In the implementation, after the first semantic feature is obtained, each stored semantic feature (which may be referred to as each semantic feature to be selected) corresponding to each monitoring object is obtained, the similarity between each semantic feature to be selected and the first semantic feature is calculated one by one, the calculated similarity is compared with a preset similarity threshold, and if the similarity is greater than the preset similarity threshold, it is indicated that the semantic feature to be selected corresponding to the similarity is similar to the first semantic feature, so that the semantic feature can be determined as the second semantic feature.

It should be noted that, when the query request of the monitoring object is fed back according to the obtained at least one image, at least one similarity meeting the preset similarity condition may be ranked according to the similarity from large to small, the images corresponding to the ranked similarities are obtained, and then the obtained images and the similarities corresponding to each image are sent to the terminal, so that the terminal displays the images and the similarities corresponding to the images, as shown in fig. 5. In addition, the server can acquire the corresponding time of the image, the position of the image in the video, the semantic description character string and other information obtained through the image and feed the information back to the terminal, so that the terminal displays more detailed information to the user, and the user can know the information of the monitoring object corresponding to the image more clearly.

Based on the same technical concept, the embodiment of the present invention further provides an apparatus for querying a target image, where the apparatus may be a server in the foregoing embodiment, as shown in fig. 6, and the apparatus includes: the extraction module 610, the first acquisition module 620, the second acquisition module 630, and the feedback module 640.

The extracting module 610 is configured to extract an image of a monitoring object of a target type from the recorded video;

the first obtaining module 620 is configured to input the image of the same monitoring object extracted in the same video segment into a first semantic extraction model to obtain semantic features corresponding to the monitoring object in the video segment, and store the obtained semantic features;

the second obtaining module 630 is configured to, when receiving a monitoring object query request carrying attribute information of a target monitoring object, input the attribute information into a second semantic extraction model to obtain a first semantic feature corresponding to the attribute information;

the feedback module 640 is configured to determine, from among the stored semantic features, a second semantic feature that meets a preset similarity condition with the first semantic feature, obtain at least one image corresponding to the second semantic feature, and feed back the query request of the monitoring object.

the first obtaining module 620 is configured to:

Optionally, the second obtaining module 630 is configured to:

Optionally, if the attribute information includes attribute information of an image type, the second semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model, and the second acquisition module 630 is configured to:

Optionally, if the attribute information includes attribute information of an audio type, the second semantic extraction model includes an audio semantic generation sub-model and a semantic feature extraction sub-model, and the second acquisition module 630 is configured to:

Optionally, if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, and the second acquisition module 630 is configured to:

Optionally, the feedback module 640 is configured to:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

It should be noted that: the device for querying the target image provided in the above embodiment only uses the division of the above functional modules to illustrate when querying the target image, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to complete all or part of the functions described above. In addition, the device for querying the target image provided in the above embodiment belongs to the same concept as the method embodiment for querying the target image, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention, where the computer device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 701 and one or more memories 702, where at least one instruction is stored in the memories 702, and the at least one instruction is loaded and executed by the processors 701 to implement the following method steps of querying a target image:

Optionally, the at least one instruction is loaded and executed by the processor 701 to implement the following method steps:

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A method of querying a target image, the method comprising:

extracting a series of images of each monitoring object of a target type from the recorded video, wherein the series of images of the monitoring objects are obtained by the following steps: obtaining a coordinate sequence based on a series of position information of the monitoring object in video frames of the video, and carrying out target part segmentation on each video frame of the monitoring object according to the coordinate sequence so as to obtain a series of images of the monitoring object;

inputting a series of images of the same monitoring object extracted from the same video segment into an image feature extraction sub-model to obtain global image features, local image features, motion features and associated features with other objects of the monitoring object, and synthesizing the global image features, the local image features, the motion features and the associated features with other objects to obtain image features of the monitoring object;

Inputting the image features into an image semantic generation sub-model to obtain a semantic description character string;

inputting the semantic description character strings into a semantic feature extraction sub-model to obtain semantic features corresponding to the monitoring objects in the video segment, and storing the obtained semantic features; the semantic feature extraction sub-model comprises a word segmentation module and a natural language processing module, wherein the word segmentation module is used for extracting keywords after word segmentation of the semantic description character string, and the natural language processing module is used for converting the keywords extracted by the word segmentation module into semantic features;

when a monitoring object query request carrying attribute information of a target monitoring object is received, determining a second semantic extraction model corresponding to the attribute information according to the data type of the attribute information and the corresponding relation between the data type and the semantic extraction model; inputting the attribute information into the second semantic extraction model corresponding to the attribute information to obtain a first semantic feature corresponding to the attribute information; the data type of the attribute information includes one or more of an image type, an audio type, and a character type;

2. The method according to claim 1, wherein if the attribute information includes attribute information of an image type, the second semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model, to obtain a first semantic feature corresponding to the attribute information includes:

3. The method according to claim 1, wherein if the attribute information includes attribute information of an audio type, the second semantic extraction model includes an audio semantic generation sub-model and a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model, to obtain the first semantic feature corresponding to the attribute information, includes:

4. The method according to claim 1, wherein if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, and the inputting the attribute information into the second semantic extraction model, to obtain the first semantic feature corresponding to the attribute information, includes:

5. The method according to claim 1, wherein determining, among the stored semantic features, a second semantic feature satisfying a preset similarity condition with the first semantic feature comprises:

6. An apparatus for querying a target image, the apparatus comprising:

the extraction module is used for extracting a series of images of each monitoring object of the target type from the recorded video, wherein the series of images of the monitoring objects are obtained through the following processing: obtaining a coordinate sequence based on a series of position information of the monitoring object in video frames of the video, and carrying out target part segmentation on each video frame of the monitoring object according to the coordinate sequence so as to obtain a series of images of the monitoring object;

the first acquisition module is used for inputting a series of images of the same monitoring object extracted in the same video segment into an image feature extraction sub-model to obtain global image features, local image features, motion features and associated features with other objects of the monitoring object, and synthesizing the global image features, the local image features, the motion features and the associated features with other objects to obtain image features of the monitoring object; inputting the image features into an image semantic generation sub-model to obtain a semantic description character string; inputting the semantic description character strings into a semantic feature extraction sub-model to obtain semantic features corresponding to the monitoring objects in the video segment, and storing the obtained semantic features; the semantic feature extraction sub-model comprises a word segmentation module and a natural language processing module, wherein the word segmentation module is used for extracting keywords after word segmentation of the semantic description character string, and the natural language processing module is used for converting the keywords extracted by the word segmentation module into semantic features;

The second acquisition module is used for determining a second semantic extraction model corresponding to the attribute information according to the data type of the attribute information and the corresponding relation between the data type and the semantic extraction model when a monitoring object query request carrying the attribute information of the target monitoring object is received; inputting the attribute information into the second semantic extraction model corresponding to the attribute information to obtain a first semantic feature corresponding to the attribute information; the data type of the attribute information includes one or more of an image type, an audio type, and a character type;

7. The apparatus of claim 6, wherein if the attribute information includes attribute information of an image type, the second semantic extraction model includes an image feature extraction sub-model, an image semantic generation sub-model, and a semantic feature extraction sub-model, the second acquisition module to:

8. The apparatus of claim 6, wherein if the attribute information includes attribute information of an audio type, the second semantic extraction model includes an audio semantic generation sub-model and a semantic feature extraction sub-model, the second acquisition module to:

9. The apparatus of claim 6, wherein if the attribute information includes attribute information of a character type, the second semantic extraction model includes a semantic feature extraction sub-model, the second acquisition module to:

10. The apparatus of claim 6, wherein the feedback module is configured to:

11. A computer device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other via the bus; a memory for storing a computer program; a processor for executing a program stored on a memory, implementing the method steps of any one of claims 1-5.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the method of querying a target image as claimed in any one of claims 1 to 5.