WO2021017289A1

WO2021017289A1 - Method and apparatus for locating object in video, and computer device and storage medium

Info

Publication number: WO2021017289A1
Application number: PCT/CN2019/117702
Authority: WO
Inventors: 张磊; 宋晨; 李雪冰
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-08-01
Filing date: 2019-11-12
Publication date: 2021-02-04
Also published as: CN110633627A

Abstract

The present application belongs to the field of artificial intelligence. Disclosed are a method and apparatus for locating an object in a video, and a computer device and a storage medium. The method comprises the following steps: acquiring a first image feature of an object to be located, wherein the first image feature includes an image contour and/or image color feature; searching a preset video database according to the first image feature of the object to be located, and acquiring images of candidate objects matching the first image feature of the object to be located; acquiring a facial feature of the object to be located; and comparing the facial feature of the object to be located with the images of the candidate objects to determine an object, matching the facial feature of the object to be located, from among the candidate objects to be the object to be located. The video database is searched according to the first image feature, such that the candidate objects can be rapidly located, and the object to be located is then located according to the facial feature, thereby reducing the amount of calculation to a great extent, and improving the efficiency of locating an object in terms of time.

Description

Method, device, computer equipment and storage medium for locating objects in video

This application is based on the Chinese invention patent application filed on August 1, 2019 with the application number 201910707924.8, titled "Methods, devices, computer equipment and storage media for locating objects in videos", and claims priority.

Technical field

This application belongs to the field of artificial intelligence, and in particular relates to a method, device, computer equipment, and storage medium for locating an object in a video.

Background technique

With the development of social economy and the acceleration of urbanization construction, the population density in the city is increasing, and the mobility of social personnel is increasing day by day. It has triggered new problems in urban construction, such as transportation, social security, and prevention and prevention of terrorism in key areas. Increasing. The security system uses a large number of video capture equipment to monitor in real time through video and record video data for viewing in order to maintain public safety.

Analyzing the data collected by video surveillance equipment, identifying, locating and tracking specific objects is a common task of public security agencies. However, the huge amount of video data only relies on manual object recognition, which is time-consuming and labor-intensive, and has low accuracy.

Although some video surveillance systems introduce face recognition technology to locate objects, the inventor realized that face recognition requires high accuracy of video capture equipment. The higher the accuracy of video capture equipment, the larger the video data generated, and the calculation of face recognition The process is complicated. Therefore, to retrieve the face of the object to be located from the huge video data, the calculation time is longer or more calculation resources are required. For some occasions where the calculation resources are limited but the timeliness requirements are high, it is often not enough. Search requirements.

Summary of the invention

This application provides a method, device, computer equipment and storage medium for locating an object in a video to solve the problem of time-consuming locating an object.

To solve the above technical problems, this application proposes a method for locating an object in a video, which includes the following steps:

Acquire the first image feature of the object to be positioned, the first image feature includes the image contour and/or the image

Like color characteristics;

Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;

Obtain the facial features of the object to be located;

The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.

To solve the above technical problems, an embodiment of the present application also provides an apparatus for locating an object in a video, including:

The first acquisition module is configured to acquire a first image feature of the object to be positioned, the first image feature including image contour and/or image color feature;

A retrieval module, configured to retrieve a preset video database according to the first image feature of the object to be located, and obtain an image of a candidate object that matches the first image feature of the object to be located;

The second acquisition module is used to acquire the facial features of the object to be located;

A processing module, configured to compare the facial features of the object to be located with the image of the candidate object, and determine that the object that matches the facial feature of the object to be located among the candidate objects is the object to be located .

In order to solve the above technical problems, an embodiment of the present application further provides a computer device including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the The processor executes the steps of the method for locating an object in the video.

In order to solve the above technical problems, the embodiments of the present application further provide one or more non-volatile readable storage media, the non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processor is executed, the processor is caused to execute the steps of the method for locating an object in the video.

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic diagram of the basic flow of a method for locating an object in a video according to an embodiment of this application;

2 is a schematic diagram of a process of acquiring a first image feature of an object to be positioned according to an embodiment of the application;

FIG. 3 is a schematic diagram of a process for determining a candidate object according to an embodiment of the application;

4 is a schematic diagram of a training process of a convolutional neural network model according to an embodiment of the application;

FIG. 5 is a schematic diagram of a process of determining an object to be located according to an embodiment of the application;

6 is a block diagram of the basic structure of an apparatus for locating an object in a video according to an embodiment of this application;

Figure 7 is a block diagram of the basic structure of the computer equipment implemented in this application.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.

In some processes described in the specification and claims of this application and the above drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be in the order in which they appear in this text. Execution or parallel execution, the sequence numbers of operations, such as 101, 102, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. In addition, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, nor do they limit the "first" and "second" Are different types.

The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of this application.

Example

Those skilled in the art can understand that the term "terminal" and "terminal equipment" used herein include both wireless signal receiver equipment, equipment that only has wireless signal receivers without transmitting capability, and equipment receiving and transmitting hardware. A device, which has a device capable of performing two-way communication receiving and transmitting hardware on a two-way communication link. Such equipment may include: cellular or other communication equipment, which has a single-line display or multi-line display or cellular or other communication equipment without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notebooks, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device, which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device. The "terminal" and "terminal equipment" used here may be portable, transportable, installed in vehicles (aviation, sea and/or land), or suitable and/or configured to operate locally, and/or In a distributed form, it runs on the earth and/or any other location in space. The "terminal" and "terminal device" used here can also be communication terminals, Internet terminals, music/video playback terminals, such as PDA, MID (Mobile Internet Device, mobile Internet device) and/or music/video playback Functional mobile phones can also be devices such as smart TVs and set-top boxes.

The terminals involved in this embodiment are the aforementioned terminals.

Specifically, please refer to FIG. 1, which is a schematic flowchart of a method for locating an object in a video according to this embodiment.

As shown in Figure 1, a method for positioning an object in a video includes the following steps:

S101. Acquire a first image feature of an object to be positioned, where the first image feature includes an image contour

And\or image color characteristics;

The first image feature of the object to be positioned is received through an interactive interface, where the object to be positioned refers to a specific person, and the first image feature here refers to the contour feature or color feature of the image containing the object to be positioned, or a combination of the two.

Among them, contour features include people's height, short, fat and thin, and color features include people's skin color, hair color, and clothing color. Specifically, the aforementioned features can be input by the user through an interactive interface.

In the embodiment of the present application, by acquiring the image of the object to be located, the contour feature extraction algorithm and the color feature extraction algorithm are used to obtain the first image feature of the object to be located. Specifically, please refer to FIG. 2.

S102: Search a preset video database according to the first image feature of the object to be located, and obtain an image of the candidate object that matches the first image feature of the object to be located;

The preset video database is retrieved according to the first image feature, where the preset video database refers to the storage space where the video collected by the video surveillance device is saved. For image retrieval, the existing semantic-based retrieval requires pre-marking of the semantic attributes of the image. In the embodiment of this application, the image comes from the real-time collection of the video surveillance equipment, and pre-marking is not applicable. Here, similar feature comparison is used. The algorithm, specifically, please refer to Figure 3.

S103. Obtain the facial features of the object to be located.

The face feature of the object to be located is acquired through an interactive interface, where the face feature is an n-dimensional vector representing the feature of the face image. Image features are the corresponding (essential) characteristics or characteristics of a certain type of object that are different from other types of objects, or a collection of these characteristics and characteristics. Features are data that can be extracted through measurement or processing. For images, each image has its own characteristics that can be distinguished from other types of images. Some are natural features that can be intuitively felt, such as brightness, edges, texture, and color; others need to be transformed or processed Can be obtained, such as moments, histograms, and principal components.

There are many methods for feature vector extraction, such as the directional gradient histogram method, which composes features by calculating and counting the gradient direction histogram of the local area of the image. The main idea of this method is that in an image, the appearance and shape of a local target can be well described by the directional density distribution of the gradient or edge. The specific implementation method is to image an image:

1) Grayscale (think the image as a three-dimensional image of x, y, z (grayscale));

2) Use the Gamma correction method to standardize the color space (normalization) of the input image; the purpose is to adjust the contrast of the image, reduce the impact of local shadows and lighting changes in the image, and suppress noise interference;

3) Calculate the gradient of each pixel of the image (including size and direction); mainly to capture contour information, while further weakening the interference of light;

4) Divide the image into small cells (for example, 6*6 pixels/cell);

5) Count the gradient histogram of each cell (the number of different gradients) to form the descriptor of each cell;

6) Combine every few cells into a block (for example, 3*3 cells/block), and concatenate the feature descriptors of all cells in a block to obtain the gradient histogram feature descriptor of the block;

7) Connect the gradient histogram feature descriptors of all blocks in the image to get the gradient histogram feature descriptor of the image (the target you want to detect). This is the final feature vector for image recognition.

In the embodiments of this application, a pre-trained convolutional neural network is used to extract the feature vectors of a face image. Compared with other methods, the convolutional neural network extracts image features. The extracted features are less likely to be over-fitted. Use different convolution, pooling and the size of the final output feature vector to control the fitting ability of the overall model, which is more flexible. See Figure 4 for the training steps.

S104. Compare the facial feature of the object to be located with the image of the candidate object, and determine that an object matching the facial feature of the object to be located among the candidate objects is the object to be located.

In the embodiment of the present application, by comparing the facial features of the object to be located with the image of the candidate object obtained in step S102, it is determined that the object having the same facial feature as the object to be located among the candidate objects is the final object to be located.

Specifically, the face image of the candidate object is intercepted, the face feature vector of the candidate object is obtained in the same manner as in step 103, and the similarity between the two vectors is compared. Calculate the Euclidean distance or cosine similarity between the vectors to measure the similarity between the two. When the similarity is greater than the set threshold, confirm that the candidate target object is the target object to be located. The cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal. The specific calculation formula is:

Among them, Ai and Bi represent the components of vectors A and B, respectively.

In the embodiment of the present application, the facial features of the candidate object are obtained by the preset facial feature extraction model extraction model, and then the facial features of the object to be located are compared with the facial feature similarity of the candidate object. Please refer to FIG. 5 for details.

As shown in Fig. 2, in step S101, the following steps are further included:

S111. Obtain an image of an object to be positioned;

Obtain the image of the object to be positioned through an interactive interface.

S112: Process the image of the object to be positioned according to other algorithms of the image contour feature and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.

Image contour feature extraction is performed on the image of the object to be positioned. The contour feature extraction of the image can be extracted by the image gradient algorithm. The gradient of the image function f (x, y) at the point (x, y) is a vector with size and direction, set Gx and Gy to represent the x direction and y direction, respectively The gradient of this gradient can be expressed as:

In digital images, the gradient can be approximately expressed as:

G _x =f(x,y)-f(x-1,y)

G _y =f(x,y)-f(x,y-1)

Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), Gx and Gy are the image function f(x, y) in the x direction and y direction respectively gradient.

The direction of the gradient is the fastest changing direction of the function f(x,y). When there are edges in the image, there must be a larger gradient value. On the contrary, when there are relatively smooth parts in the image, the gray value changes less , The corresponding gradient is also smaller. The image gradient algorithm considers the grayscale change in a certain neighborhood of each pixel of the image, and uses the first-order or second-order derivative change law near the edge to set a gradient operator in a certain neighborhood of the pixel in the original image, such as the Sobel algorithm. Sub, Robinson operator, Laplace operator, etc., convolve the original image with the gradient operator to obtain the contour of the target object image.

When the first image feature is a color feature, the color is a global feature that describes the surface properties of the scene corresponding to the image or image area. The general color feature is based on the feature of the pixel. In order to make the positioning more accurate, when using color features to match candidate target objects, in order to avoid the background color information from interfering with the target object, first use the target detection algorithm to identify the target image in the image, and then crop the image, leaving only The target object itself. Among them, the target detection algorithm is realized by the cascaded convolutional neural network model. The color characteristics of the target image are obtained by calculating the color histogram of the cropped image. The color histogram can be calculated by the API function calcHist provided in OpenCV for calculating the image histogram. When matching the images in the video stream, calculate the color histogram of each target in the video stream, and then compare the similarity through the histogram comparison function compareHist() provided by OpenCV to obtain candidate target objects.

It is also possible to down-sample the image containing the target object first, and then perform contour feature extraction or color feature extraction. When performing candidate target object matching, the image in the video stream is also down-sampled by the same multiple, and the down-sampled image is used to match the contour feature or color feature of the extracted target object to obtain the candidate target object. After the image is down-sampled, the pixel data is reduced, which can reduce the amount of calculation and speed up the calculation.

As shown in FIG. 3, in step S102, the following steps are further included:

S121. Obtain a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;

The video image frame is the decomposition of the video, and third-party software can be used to decompose the video to obtain the video image frame.

S122. Input the video image frame into a preset target detection model, and obtain an image of the target object output by the target detection model in response to the video image frame, wherein the preset target detection model is based on a preset target detection model. A trained deep learning neural network, where the deep learning neural network performs target detection on the input video image frame and the output image of the target object is a human body image;

The obtained video image frames often contain more than the target image. In order to avoid background interference, the target detection is performed on the video image frame. The target detection in this application is to detect the human body, and the purpose is to remove other parts except the human body image. The image of the target object obtained after detection is a human body image. In the embodiments of the present application, a pre-trained deep learning neural network is used to detect the target object.

Specifically, the video image frame is first divided into equal parts. In the embodiment of the application, the input image is divided into 7*7 puzzle images. Then the puzzle image is input into the deep learning neural network, and for each puzzle grid deep learning neural network will predict 2 prediction boxes. The predicted prediction box contains 5 values: x, y, w, h and confidence. x and y are the center coordinates of the prediction box, and w and h are the width and height of the prediction box. We take one of the two prediction frames, the prediction frame of the target object, and finally the third convolutional neural network outputs a 7x7x (2x5+1) prediction tensor for the next step of determining the target object prediction frame. After obtaining the prediction tensor, filter by setting a confidence threshold. The prediction frames with a confidence less than the threshold will be filtered out, leaving only prediction frames with higher confidence as the remaining frames. Then for each remaining prediction box, calculate the IOU (coincidence) value of a remaining prediction box and the remaining box in turn. If the IOU value is greater than the preset threshold, then the prediction box will be eliminated and the remaining predictions The frame repeats the above process until all prediction frames are processed and the image of the target object is obtained.

S123: Calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;

The image of the target object is calculated according to the image contour feature extraction algorithm and/or the color feature extraction algorithm, and the specific algorithm is the same as in step S112.

S124. Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the target object. The candidate.

The contour feature matching is realized by the contour moment matching method. Contour moments can be spatial moments, central moments, etc. We take spatial moments as an example, as shown below:

mpq represents the (p+q) order moments of the image, and generally calculates all 3 order moments (p+q<=3).

Where I(x,y) is the value of the image contour pixel point (x,y), generally 1, n is the number of points on the contour, p and q are the moments in the x and y dimensions respectively, that is, m00 ,m10,m01...m03

The zero-order moment m00 is a simple accumulation of points on the contour, that is, how many points are there on the contour.

The first moments m10 and m01 are the accumulation in the x and y directions, respectively. The spatial moment can be calculated by the OpenCV function cvGetSpatialMoment().

Color feature matching uses the histogram comparison function compareHist() provided by OpenCV to compare the similarity.

As shown in Figure 4, the training of the pre-trained convolutional neural network model includes the following steps:

S131. Obtain training samples marked with identities, where the training samples are face images marked with different identities;

In this embodiment of the application, the training sample dimension is marked with the face image of the identity identifier.

S132. Input the training sample into the convolutional neural network model, and obtain the identity prediction result of the training sample;

The training samples are input to the convolutional neural network model, and the convolutional neural network model is input to the prediction result of the identity of each sample.

S133. Compare whether the identity prediction result of the training sample is consistent with the identity by a loss function, where the loss function is:

Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, and h=(h1, h2,...,hi) is the prediction result of sample i.

The loss function is used to compare whether the prediction result of the identity of the training sample is consistent with the marked identity, and the Softmax cross-entropy loss function is used in the embodiment of the application. In the training process, adjust the weights in the convolutional neural network model to make the Softmax cross-entropy loss function as convergent as possible, that is to say, continue to adjust the weights. When the value of the loss function is no longer reduced, but increased, it is considered that the first A convolutional neural network training can end.

S134: When the prediction result of the identity identifier is inconsistent with the identity identifier, iteratively update the weights in the convolutional neural network model repeatedly and iteratively, until the loss function converges.

As mentioned above, when the loss function does not converge, the weights in the convolutional neural network model are updated. In the embodiment of this application, the gradient descent method is used. The gradient descent method is an optimization algorithm used in machine learning and artificial intelligence. Approximate the minimum deviation model recursively.

As shown in Figure 5, in step 104, the following steps are further included:

S141. Obtain a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;

Through step S102, an image of the candidate object is obtained, and face detection is performed on the candidate object image, and the face image of the candidate object is intercepted. The face detection method is the same as the method described in step S122.

S142: Input the face image of the candidate object into the preset face feature extraction model, and obtain the face feature of the candidate object.

The face image of the candidate object is input to the preset face feature extraction model. In the embodiment of the present application, the preset face feature extraction model uses a pre-trained convolutional neural network model, and the training steps are the same as in FIG. 4.

S143. Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the pending object Bit object.

Compare the similarity between two vectors. Calculate the Euclidean distance or cosine similarity between the vectors to measure the similarity between the two. When the similarity is greater than the set threshold, confirm that the candidate target object is the target object to be located. The cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal. The specific calculation formula is

To solve the above technical problem, an embodiment of the present application also provides a device for locating an object in a video. Please refer to FIG. 6 for details. FIG. 6 is a basic structural block diagram of an apparatus for positioning an object in a video in this embodiment.

As shown in FIG. 6, an apparatus for locating an object in a video includes a first acquisition module 210, a retrieval module 220, a second acquisition module 230, and a processing module 240. The first acquisition module 210 is used to acquire the object to be located. The first image feature includes the image contour and/or the image color feature; the retrieval module 220 is configured to retrieve the preset video database according to the first image feature of the object to be located, and obtain the The image of the candidate object that matches the first image feature of the object to be located; the second acquisition module 230 is used to obtain the face feature of the object to be located; the processing module 240 is used to compare the face feature of the object to be located with the The image comparison of the candidate object determines that an object matching the facial feature of the object to be located among the candidate objects is the object to be located.

The embodiment of the application obtains the first image feature of the object to be located, the first image feature includes the image contour and/or the image color feature; according to the first image feature of the object to be located, the preset video database is retrieved to obtain The image of the candidate object that matches the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the Among the candidate objects, an object that matches the facial feature of the object to be located is the object to be located. Searching the video database through the first image feature can quickly locate the candidate object, and then locate the object to be located according to the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.

In some embodiments, the first acquisition module 210 further includes: a first acquisition sub-module for acquiring an image of an object to be positioned; a first processing sub-module for extracting an algorithm and/or The color feature extraction algorithm processes the image of the object to be positioned to obtain the first image feature of the object to be positioned.

In some embodiments, the second acquisition module 230 further includes: a second acquisition sub-module for acquiring a face image of the object to be located; a second processing sub-module for acquiring the image of the object to be located The face image is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.

In some implementation manners, the retrieval module 220 further includes: a third acquisition sub-module, configured to acquire video image frames, where the video image frames are decompositions of videos stored in the preset video database; A detection sub-module for inputting the video image frame into a preset target detection model, and obtaining an image of the target object output by the target detection model in response to the video image frame, wherein the preset The target detection model is based on a pre-trained deep learning neural network, the image of the target object includes a human body image, and the deep learning neural network performs target detection on the input video image frame to output the human body image; the first calculator The module is used to calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object; the third processing sub-module is used to calculate the object to be located When the degree of matching between the first image feature of and the first image feature of the target object is greater than a preset first threshold, the target object is determined to be the candidate object.

In some embodiments, the processing module 240 further includes: a fourth acquisition sub-module, configured to acquire the face image of the candidate object, the face image of the candidate object is intercepted from the image of the candidate object The second calculation sub-module is used to input the face image of the candidate object into the preset facial feature extraction model to obtain the facial features of the candidate object; the fourth processing sub-module is used to Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .

In some embodiments, in the second calculation sub-module, the preset facial feature extraction model is based on a pre-trained convolutional neural network model, wherein the second calculation sub-module further includes: Five acquisition sub-module, used to obtain training samples marked with identities, the training samples are face images marked with different identities; the first prediction sub-module, used to input the training samples into the convolutional neural network The model obtains the identification prediction result of the training sample; the first comparison sub-module is used to compare whether the identification prediction result of the training sample is consistent with the identification according to a loss function, wherein the loss function is :

Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, h=(h1, h2,...,hi) is the prediction result of sample i; the fifth processing sub-module is used for when When the identity identifier prediction result is inconsistent with the identity identifier, the weights in the convolutional neural network model are updated repeatedly and iteratively until the loss function converges.

In some embodiments, in the first calculation submodule, the image contour feature extraction algorithm adopts an image gradient algorithm, and the gradient is expressed as:

G _x =f(x,y)-f(x-1,y)

G _y =f(x,y)-f(x,y-1)

Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), G _x and G _y are the image function f(x, y) in the x direction and y respectively The gradient of the direction.

To solve the above technical problem, the embodiment of the present application also provides a computer device. Please refer to FIG. 7 for details. FIG. 7 is a block diagram of the basic structure of the computer device in this embodiment.

As shown in Figure 7, a schematic diagram of the internal structure of the computer equipment. As shown in FIG. 7, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein, the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences. When the computer-readable instructions are executed by the processor, the processor can realize a Kind of object positioning method. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. A computer readable instruction may be stored in the memory of the computer device. When the computer readable instruction is executed by the processor, the processor may cause the processor to execute a method for positioning an object. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or less parts than shown in the figure, or combining some parts, or having a different part arrangement.

In this embodiment, the processor is used to execute the specific content of the first acquisition module 210, the retrieval module 220, the second acquisition module 230, and the processing module 240 in FIG. 6, and the memory stores computer readable instructions and various types of instructions required to execute the above modules. data. The network interface is used for data transmission between user terminals or servers. The memory in this embodiment stores the computer-readable instructions and data required to execute all sub-modules in the method for locating objects in the video, and the server can call the computer-readable instructions and data of the server to perform the functions of all the sub-modules.

The computer device obtains the first image feature of the object to be located, and the first image feature includes the image outline and/or the image color feature; searches the preset video database according to the first image feature of the object to be located, and obtains the The image of the candidate object matched by the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the candidate object The object in which matches the facial feature of the object to be located is the object to be located. Retrieving the video database through the first image feature can quickly locate the candidate object, and then locate the object to be positioned based on the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.

The present application also provides one or more non-volatile storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute any of the foregoing implementations. The example describes the steps of the method of positioning objects in the video.

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. In the medium, when the readable instruction is executed, it may include the procedures of the above-mentioned method embodiments. Among them, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

It should be understood that, although the various steps in the flowchart of the drawings are shown in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order for the execution of these steps, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.

The above are only part of the implementation of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of this application, several improvements and modifications can be made, and these improvements and modifications are also Should be regarded as the scope of protection of this application.

Claims

A method for locating an object in a video, which is characterized in that it comprises the following steps:

Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;

Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;

Obtain the facial features of the object to be located;

The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
The method for locating an object in a video according to claim 1, wherein the step of acquiring the first image feature of the object to be positioned includes the following steps:

Acquiring an image of the object to be positioned;

The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
The method for locating an object in a video according to claim 1, wherein the step of acquiring the facial features of the object to be positioned includes the following steps:

Acquiring the face image of the object to be located;

The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
The method for locating an object in a video according to claim 1, wherein the preset video database is retrieved according to the first image feature of the object to be positioned, and the first image of the object to be positioned is obtained. The step of the image feature matching candidate image includes the following steps:

Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;

The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;

Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;

Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
The method for locating an object in a video according to claim 3, characterized in that, in said comparing the facial features of the object to be positioned with the image of the candidate object, it is determined that the candidate object is The step that the object matched by the face feature of the object to be positioned is the object to be positioned includes the following steps:

Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;

Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;

Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
The method for locating an object in a video according to claim 3, wherein the preset facial feature extraction model is based on a pre-trained convolutional neural network model, wherein the training of the convolutional neural network model It includes the following steps:

Acquiring training samples marked with identities, where the training samples are face images marked with different identities;

Input the training sample into a convolutional neural network model, and obtain an identity prediction result of the training sample;

Check whether the identity prediction result of the training sample is consistent with the identity according to the loss function, wherein the loss function is:

Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, and h=(h1,h2,...,hi) is the prediction result of sample i;

When the identity identifier prediction result is inconsistent with the identity identifier, the weights in the convolutional neural network model are updated repeatedly and iteratively until the loss function converges.
The method for locating an object in a video according to claim 2, wherein the image contour feature extraction algorithm adopts an image gradient algorithm, and the gradient is expressed as:

G x =f(x,y)-f(x-1,y)

G y =f(x,y)-f(x,y-1)

Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), G x and G y are the image function f(x, y) in the x direction and y respectively The gradient of the direction.
A device for locating an object in a video, characterized in that it comprises:

The first acquisition module is configured to acquire a first image feature of the object to be positioned, the first image feature including image contour and/or image color feature;

A retrieval module, configured to retrieve a preset video database according to the first image feature of the object to be located, and obtain an image of a candidate object that matches the first image feature of the object to be located;

The second acquisition module is used to acquire the facial features of the object to be located;

A processing module, configured to compare the facial features of the object to be located with the image of the candidate object, and determine that the object that matches the facial feature of the object to be located among the candidate objects is the object to be located .
The device for locating an object in a video according to claim 8, wherein the first acquiring module further comprises:

The first acquisition sub-module is used to acquire an image of the object to be positioned;

The first processing sub-module is configured to process the image of the object to be located according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be located;

The second acquisition module further includes:

The second acquisition sub-module is used to acquire the face image of the object to be located;

The second processing sub-module is configured to input the face image of the object to be located into a preset face feature extraction model to obtain the face feature of the object image to be located.
The device for locating an object in a video according to claim 8, wherein the retrieval module further comprises:

The third acquisition sub-module is configured to acquire video image frames, where the video image frames are the decomposition of the video stored in the preset video database;

The first detection submodule is configured to input the video image frame into a preset target detection model, and obtain an image of the target object output by the target detection model in response to the video image frame, wherein the preset The target detection model of is based on a pre-trained deep learning neural network, the image of the target object includes a human body image, and the deep learning neural network performs target detection on the input video image frame to output the human body image;

The first calculation sub-module is used to calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm;

The third processing sub-module is used to calculate the degree of matching between the first image feature of the object to be positioned and the first image feature of the target object, and determine when the degree of matching is greater than a preset first threshold The target object is the candidate object;

The processing module further includes:

The fourth acquisition submodule is used to acquire the face image of the candidate object, and the face image of the candidate object is intercepted from the image of the candidate object;

The second calculation sub-module is configured to input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;

The fourth processing sub-module is used to calculate the matching degree between the facial features of the object to be located and the facial features of the candidate object, and when the matching degree is greater than a preset second threshold, determine the The candidate object is the object to be positioned.
A computer device includes a memory and a processor. The memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes the following steps:

Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;

Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;

Obtain the facial features of the object to be located;

The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
The computer device according to claim 1, wherein the step of acquiring the first image feature of the object to be positioned comprises the following steps:

Acquiring an image of the object to be positioned;

The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
The computer device according to claim 1, wherein the step of obtaining the facial features of the object to be located comprises the following steps:

Acquiring the face image of the object to be located;

The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
The computer device according to claim 1, wherein the preset video database is retrieved according to the first image feature of the object to be located, and the candidate that matches the first image feature of the object to be located is obtained The steps of object image include the following steps:

Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;

The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;

Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;

Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
The computer device according to claim 13, wherein in said comparing the facial features of the object to be located with the image of the candidate object, it is determined that the person in the candidate object is the same as that of the object to be located. The step that the object to be matched by the face feature is the object to be located includes the following steps:

Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;

Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;

Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
One or more non-volatile readable storage media having computer readable instructions stored on the non-volatile readable storage media, and when the computer readable instructions are executed by a processor, the following steps are implemented:

Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;

Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;

Obtain the facial features of the object to be located;

The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
The non-volatile readable storage medium according to claim 1, wherein the step of obtaining the first image feature of the object to be positioned includes the following steps:

Acquiring an image of the object to be positioned;

The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
The non-volatile readable storage medium according to claim 1, wherein the step of acquiring the facial features of the object to be located comprises the following steps:

Acquiring the face image of the object to be located;

The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
The non-volatile readable storage medium according to claim 1, wherein in said searching a preset video database according to the first image feature of the object to be positioned, the first image of the object to be positioned is obtained. An image feature matching candidate image step includes the following steps:

Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;

The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;

Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;

Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
The non-volatile readable storage medium according to claim 18, wherein in said comparing the facial features of the object to be located with the image of the candidate object, it is determined that the candidate object is The step that the object matched by the face feature of the object to be positioned is the object to be positioned includes the following steps:

Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;

Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;

Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .