WO2021017289A1 - Method and apparatus for locating object in video, and computer device and storage medium - Google Patents

Method and apparatus for locating object in video, and computer device and storage medium Download PDF

Info

Publication number
WO2021017289A1
WO2021017289A1 PCT/CN2019/117702 CN2019117702W WO2021017289A1 WO 2021017289 A1 WO2021017289 A1 WO 2021017289A1 CN 2019117702 W CN2019117702 W CN 2019117702W WO 2021017289 A1 WO2021017289 A1 WO 2021017289A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
candidate
video
face
Prior art date
Application number
PCT/CN2019/117702
Other languages
French (fr)
Chinese (zh)
Inventor
张磊
宋晨
李雪冰
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021017289A1 publication Critical patent/WO2021017289A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/785Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using colour or luminescence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/7854Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using shape
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • This application belongs to the field of artificial intelligence, and in particular relates to a method, device, computer equipment, and storage medium for locating an object in a video.
  • the security system uses a large number of video capture equipment to monitor in real time through video and record video data for viewing in order to maintain public safety.
  • This application provides a method, device, computer equipment and storage medium for locating an object in a video to solve the problem of time-consuming locating an object.
  • this application proposes a method for locating an object in a video, which includes the following steps:
  • the first image feature of the object to be positioned includes the image contour and/or the image
  • the face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
  • an embodiment of the present application also provides an apparatus for locating an object in a video, including:
  • the first acquisition module is configured to acquire a first image feature of the object to be positioned, the first image feature including image contour and/or image color feature;
  • a retrieval module configured to retrieve a preset video database according to the first image feature of the object to be located, and obtain an image of a candidate object that matches the first image feature of the object to be located;
  • the second acquisition module is used to acquire the facial features of the object to be located
  • a processing module configured to compare the facial features of the object to be located with the image of the candidate object, and determine that the object that matches the facial feature of the object to be located among the candidate objects is the object to be located .
  • an embodiment of the present application further provides a computer device including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the steps of the method for locating an object in the video.
  • the embodiments of the present application further provide one or more non-volatile readable storage media, the non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processor is executed, the processor is caused to execute the steps of the method for locating an object in the video.
  • FIG. 1 is a schematic diagram of the basic flow of a method for locating an object in a video according to an embodiment of this application;
  • FIG. 2 is a schematic diagram of a process of acquiring a first image feature of an object to be positioned according to an embodiment of the application;
  • FIG. 3 is a schematic diagram of a process for determining a candidate object according to an embodiment of the application
  • FIG. 4 is a schematic diagram of a training process of a convolutional neural network model according to an embodiment of the application
  • FIG. 5 is a schematic diagram of a process of determining an object to be located according to an embodiment of the application
  • FIG. 6 is a block diagram of the basic structure of an apparatus for locating an object in a video according to an embodiment of this application;
  • FIG. 7 is a block diagram of the basic structure of the computer equipment implemented in this application.
  • terminal and “terminal equipment” used herein include both wireless signal receiver equipment, equipment that only has wireless signal receivers without transmitting capability, and equipment receiving and transmitting hardware.
  • a device which has a device capable of performing two-way communication receiving and transmitting hardware on a two-way communication link.
  • Such equipment may include: cellular or other communication equipment, which has a single-line display or multi-line display or cellular or other communication equipment without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notebooks, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device, which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • PCS Personal Communications Service, personal communication system
  • PDA Personal Digital Assistant
  • GPS Global Positioning System (Global Positioning System) receiver
  • a conventional laptop and/or palmtop computer or other device which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device.
  • terminal and terminal equipment used here may be portable, transportable, installed in vehicles (aviation, sea and/or land), or suitable and/or configured to operate locally, and/or In a distributed form, it runs on the earth and/or any other location in space.
  • the "terminal” and “terminal device” used here can also be communication terminals, Internet terminals, music/video playback terminals, such as PDA, MID (Mobile Internet Device, mobile Internet device) and/or music/video playback Functional mobile phones can also be devices such as smart TVs and set-top boxes.
  • the terminals involved in this embodiment are the aforementioned terminals.
  • FIG. 1 is a schematic flowchart of a method for locating an object in a video according to this embodiment.
  • a method for positioning an object in a video includes the following steps:
  • the first image feature of the object to be positioned is received through an interactive interface, where the object to be positioned refers to a specific person, and the first image feature here refers to the contour feature or color feature of the image containing the object to be positioned, or a combination of the two.
  • contour features include people's height, short, fat and thin
  • color features include people's skin color, hair color, and clothing color.
  • the aforementioned features can be input by the user through an interactive interface.
  • the contour feature extraction algorithm and the color feature extraction algorithm are used to obtain the first image feature of the object to be located. Specifically, please refer to FIG. 2.
  • S102 Search a preset video database according to the first image feature of the object to be located, and obtain an image of the candidate object that matches the first image feature of the object to be located;
  • the preset video database is retrieved according to the first image feature, where the preset video database refers to the storage space where the video collected by the video surveillance device is saved.
  • the existing semantic-based retrieval requires pre-marking of the semantic attributes of the image.
  • the image comes from the real-time collection of the video surveillance equipment, and pre-marking is not applicable.
  • similar feature comparison is used.
  • the algorithm specifically, please refer to Figure 3.
  • the face feature of the object to be located is acquired through an interactive interface, where the face feature is an n-dimensional vector representing the feature of the face image.
  • Image features are the corresponding (essential) characteristics or characteristics of a certain type of object that are different from other types of objects, or a collection of these characteristics and characteristics.
  • Features are data that can be extracted through measurement or processing. For images, each image has its own characteristics that can be distinguished from other types of images. Some are natural features that can be intuitively felt, such as brightness, edges, texture, and color; others need to be transformed or processed Can be obtained, such as moments, histograms, and principal components.
  • Grayscale think the image as a three-dimensional image of x, y, z (grayscale)
  • a pre-trained convolutional neural network is used to extract the feature vectors of a face image.
  • the convolutional neural network extracts image features. The extracted features are less likely to be over-fitted. Use different convolution, pooling and the size of the final output feature vector to control the fitting ability of the overall model, which is more flexible. See Figure 4 for the training steps.
  • step S102 by comparing the facial features of the object to be located with the image of the candidate object obtained in step S102, it is determined that the object having the same facial feature as the object to be located among the candidate objects is the final object to be located.
  • the face image of the candidate object is intercepted, the face feature vector of the candidate object is obtained in the same manner as in step 103, and the similarity between the two vectors is compared.
  • the cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal.
  • the specific calculation formula is:
  • Ai and Bi represent the components of vectors A and B, respectively.
  • the facial features of the candidate object are obtained by the preset facial feature extraction model extraction model, and then the facial features of the object to be located are compared with the facial feature similarity of the candidate object. Please refer to FIG. 5 for details.
  • step S101 the following steps are further included:
  • S112 Process the image of the object to be positioned according to other algorithms of the image contour feature and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
  • Image contour feature extraction is performed on the image of the object to be positioned.
  • the contour feature extraction of the image can be extracted by the image gradient algorithm.
  • the gradient of the image function f (x, y) at the point (x, y) is a vector with size and direction, set Gx and Gy to represent the x direction and y direction, respectively
  • the gradient of this gradient can be expressed as:
  • the gradient can be approximately expressed as:
  • f(x,y) is the image function of the image to be calculated contour
  • f(x-1,y) and f(x,y-1) are the image function f(x, y)
  • Gx and Gy are the image function f(x, y) in the x direction and y direction respectively gradient.
  • the direction of the gradient is the fastest changing direction of the function f(x,y).
  • f(x,y) When there are edges in the image, there must be a larger gradient value. On the contrary, when there are relatively smooth parts in the image, the gray value changes less , The corresponding gradient is also smaller.
  • the image gradient algorithm considers the grayscale change in a certain neighborhood of each pixel of the image, and uses the first-order or second-order derivative change law near the edge to set a gradient operator in a certain neighborhood of the pixel in the original image, such as the Sobel algorithm. Sub, Robinson operator, Laplace operator, etc., convolve the original image with the gradient operator to obtain the contour of the target object image.
  • the color is a global feature that describes the surface properties of the scene corresponding to the image or image area.
  • the general color feature is based on the feature of the pixel.
  • the target detection algorithm is realized by the cascaded convolutional neural network model.
  • the color characteristics of the target image are obtained by calculating the color histogram of the cropped image.
  • the color histogram can be calculated by the API function calcHist provided in OpenCV for calculating the image histogram.
  • the image in the video stream is also down-sampled by the same multiple, and the down-sampled image is used to match the contour feature or color feature of the extracted target object to obtain the candidate target object.
  • the pixel data is reduced, which can reduce the amount of calculation and speed up the calculation.
  • step S102 the following steps are further included:
  • the video image frame is the decomposition of the video, and third-party software can be used to decompose the video to obtain the video image frame.
  • S122 Input the video image frame into a preset target detection model, and obtain an image of the target object output by the target detection model in response to the video image frame, wherein the preset target detection model is based on a preset target detection model.
  • a trained deep learning neural network where the deep learning neural network performs target detection on the input video image frame and the output image of the target object is a human body image;
  • the obtained video image frames often contain more than the target image.
  • the target detection is performed on the video image frame.
  • the target detection in this application is to detect the human body, and the purpose is to remove other parts except the human body image.
  • the image of the target object obtained after detection is a human body image.
  • a pre-trained deep learning neural network is used to detect the target object.
  • the video image frame is first divided into equal parts.
  • the input image is divided into 7*7 puzzle images.
  • the puzzle image is input into the deep learning neural network, and for each puzzle grid deep learning neural network will predict 2 prediction boxes.
  • the predicted prediction box contains 5 values: x, y, w, h and confidence.
  • x and y are the center coordinates of the prediction box, and w and h are the width and height of the prediction box.
  • the third convolutional neural network outputs a 7x7x (2x5+1) prediction tensor for the next step of determining the target object prediction frame. After obtaining the prediction tensor, filter by setting a confidence threshold.
  • the prediction frames with a confidence less than the threshold will be filtered out, leaving only prediction frames with higher confidence as the remaining frames. Then for each remaining prediction box, calculate the IOU (coincidence) value of a remaining prediction box and the remaining box in turn. If the IOU value is greater than the preset threshold, then the prediction box will be eliminated and the remaining predictions The frame repeats the above process until all prediction frames are processed and the image of the target object is obtained.
  • S123 Calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;
  • the image of the target object is calculated according to the image contour feature extraction algorithm and/or the color feature extraction algorithm, and the specific algorithm is the same as in step S112.
  • contour feature matching is realized by the contour moment matching method.
  • Contour moments can be spatial moments, central moments, etc.
  • I(x,y) is the value of the image contour pixel point (x,y)
  • n is the number of points on the contour
  • p and q are the moments in the x and y dimensions respectively, that is, m00 ,m10,m01...m03
  • the zero-order moment m00 is a simple accumulation of points on the contour, that is, how many points are there on the contour.
  • the first moments m10 and m01 are the accumulation in the x and y directions, respectively.
  • the spatial moment can be calculated by the OpenCV function cvGetSpatialMoment().
  • Color feature matching uses the histogram comparison function compareHist() provided by OpenCV to compare the similarity.
  • the training of the pre-trained convolutional neural network model includes the following steps:
  • the training sample dimension is marked with the face image of the identity identifier.
  • the training samples are input to the convolutional neural network model, and the convolutional neural network model is input to the prediction result of the identity of each sample.
  • N is the number of training samples
  • yi corresponding to the i-th sample is the marked result
  • h (h1, h2,...,hi) is the prediction result of sample i.
  • the loss function is used to compare whether the prediction result of the identity of the training sample is consistent with the marked identity, and the Softmax cross-entropy loss function is used in the embodiment of the application.
  • the value of the loss function is no longer reduced, but increased, it is considered that the first A convolutional neural network training can end.
  • the gradient descent method is an optimization algorithm used in machine learning and artificial intelligence. Approximate the minimum deviation model recursively.
  • step 104 the following steps are further included:
  • step S102 an image of the candidate object is obtained, and face detection is performed on the candidate object image, and the face image of the candidate object is intercepted.
  • the face detection method is the same as the method described in step S122.
  • S142 Input the face image of the candidate object into the preset face feature extraction model, and obtain the face feature of the candidate object.
  • the face image of the candidate object is input to the preset face feature extraction model.
  • the preset face feature extraction model uses a pre-trained convolutional neural network model, and the training steps are the same as in FIG. 4.
  • the cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal.
  • the specific calculation formula is
  • Ai and Bi represent the components of vectors A and B, respectively.
  • FIG. 6 is a basic structural block diagram of an apparatus for positioning an object in a video in this embodiment.
  • an apparatus for locating an object in a video includes a first acquisition module 210, a retrieval module 220, a second acquisition module 230, and a processing module 240.
  • the first acquisition module 210 is used to acquire the object to be located.
  • the first image feature includes the image contour and/or the image color feature
  • the retrieval module 220 is configured to retrieve the preset video database according to the first image feature of the object to be located, and obtain the The image of the candidate object that matches the first image feature of the object to be located
  • the second acquisition module 230 is used to obtain the face feature of the object to be located
  • the processing module 240 is used to compare the face feature of the object to be located with the The image comparison of the candidate object determines that an object matching the facial feature of the object to be located among the candidate objects is the object to be located.
  • the embodiment of the application obtains the first image feature of the object to be located, the first image feature includes the image contour and/or the image color feature; according to the first image feature of the object to be located, the preset video database is retrieved to obtain The image of the candidate object that matches the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the Among the candidate objects, an object that matches the facial feature of the object to be located is the object to be located. Searching the video database through the first image feature can quickly locate the candidate object, and then locate the object to be located according to the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.
  • the first acquisition module 210 further includes: a first acquisition sub-module for acquiring an image of an object to be positioned; a first processing sub-module for extracting an algorithm and/or The color feature extraction algorithm processes the image of the object to be positioned to obtain the first image feature of the object to be positioned.
  • the second acquisition module 230 further includes: a second acquisition sub-module for acquiring a face image of the object to be located; a second processing sub-module for acquiring the image of the object to be located The face image is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
  • the retrieval module 220 further includes: a third acquisition sub-module, configured to acquire video image frames, where the video image frames are decompositions of videos stored in the preset video database; A detection sub-module for inputting the video image frame into a preset target detection model, and obtaining an image of the target object output by the target detection model in response to the video image frame, wherein the preset The target detection model is based on a pre-trained deep learning neural network, the image of the target object includes a human body image, and the deep learning neural network performs target detection on the input video image frame to output the human body image; the first calculator The module is used to calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object; the third processing sub-module is used to calculate the object to be located When the degree of matching between the first image feature of and the first image feature of the target object is greater than a preset first threshold, the target object is determined to be the candidate object.
  • the processing module 240 further includes: a fourth acquisition sub-module, configured to acquire the face image of the candidate object, the face image of the candidate object is intercepted from the image of the candidate object
  • the second calculation sub-module is used to input the face image of the candidate object into the preset facial feature extraction model to obtain the facial features of the candidate object
  • the fourth processing sub-module is used to Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
  • the preset facial feature extraction model is based on a pre-trained convolutional neural network model, wherein the second calculation sub-module further includes: Five acquisition sub-module, used to obtain training samples marked with identities, the training samples are face images marked with different identities; the first prediction sub-module, used to input the training samples into the convolutional neural network The model obtains the identification prediction result of the training sample; the first comparison sub-module is used to compare whether the identification prediction result of the training sample is consistent with the identification according to a loss function, wherein the loss function is :
  • N is the number of training samples
  • yi corresponding to the i-th sample is the marked result
  • the fifth processing sub-module is used for when When the identity identifier prediction result is inconsistent with the identity identifier, the weights in the convolutional neural network model are updated repeatedly and iteratively until the loss function converges.
  • the image contour feature extraction algorithm adopts an image gradient algorithm, and the gradient is expressed as:
  • f(x,y) is the image function of the image to be calculated contour
  • f(x-1,y) and f(x,y-1) are the image function f(x, y)
  • G x and G y are the image function f(x, y) in the x direction and y respectively The gradient of the direction.
  • FIG. 7 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus.
  • the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions.
  • the database may store control information sequences.
  • the processor can realize a kind of object positioning method.
  • the processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment.
  • a computer readable instruction may be stored in the memory of the computer device. When the computer readable instruction is executed by the processor, the processor may cause the processor to execute a method for positioning an object.
  • the network interface of the computer device is used to connect and communicate with the terminal.
  • FIG. 7 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or less parts than shown in the figure, or combining some parts, or having a different part arrangement.
  • the processor is used to execute the specific content of the first acquisition module 210, the retrieval module 220, the second acquisition module 230, and the processing module 240 in FIG. 6, and the memory stores computer readable instructions and various types of instructions required to execute the above modules. data.
  • the network interface is used for data transmission between user terminals or servers.
  • the memory in this embodiment stores the computer-readable instructions and data required to execute all sub-modules in the method for locating objects in the video, and the server can call the computer-readable instructions and data of the server to perform the functions of all the sub-modules.
  • the computer device obtains the first image feature of the object to be located, and the first image feature includes the image outline and/or the image color feature; searches the preset video database according to the first image feature of the object to be located, and obtains the The image of the candidate object matched by the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the candidate object The object in which matches the facial feature of the object to be located is the object to be located. Retrieving the video database through the first image feature can quickly locate the candidate object, and then locate the object to be positioned based on the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.
  • the present application also provides one or more non-volatile storage media storing computer-readable instructions.
  • the computer-readable instructions are executed by one or more processors, the one or more processors execute any of the foregoing implementations.
  • the example describes the steps of the method of positioning objects in the video.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The present application belongs to the field of artificial intelligence. Disclosed are a method and apparatus for locating an object in a video, and a computer device and a storage medium. The method comprises the following steps: acquiring a first image feature of an object to be located, wherein the first image feature includes an image contour and/or image color feature; searching a preset video database according to the first image feature of the object to be located, and acquiring images of candidate objects matching the first image feature of the object to be located; acquiring a facial feature of the object to be located; and comparing the facial feature of the object to be located with the images of the candidate objects to determine an object, matching the facial feature of the object to be located, from among the candidate objects to be the object to be located. The video database is searched according to the first image feature, such that the candidate objects can be rapidly located, and the object to be located is then located according to the facial feature, thereby reducing the amount of calculation to a great extent, and improving the efficiency of locating an object in terms of time.

Description

在视频中定位对象的方法、装置、计算机设备及存储介质Method, device, computer equipment and storage medium for locating objects in video
本申请以2019年8月01日提交的申请号为201910707924.8,名称为“在视频中定位对象的方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on August 1, 2019 with the application number 201910707924.8, titled "Methods, devices, computer equipment and storage media for locating objects in videos", and claims priority.
技术领域Technical field
本申请属于人工智能领域,尤其涉及一种在视频中定位对象的方法、装置、计算机设备及存储介质。This application belongs to the field of artificial intelligence, and in particular relates to a method, device, computer equipment, and storage medium for locating an object in a video.
背景技术Background technique
随着社会经济的发展,城镇化建设速度加快,导致城市中人口密度不断增加,社会人员流动性与日俱增,引发了城市建设中的交通、社会治安、重点区域防范防恐等新问题,社会管理难度不断增加。安防系统使用了大量的视频采集设备,通过视频实时监控,并记录视频数据以备查看,以便维护公共安全。With the development of social economy and the acceleration of urbanization construction, the population density in the city is increasing, and the mobility of social personnel is increasing day by day. It has triggered new problems in urban construction, such as transportation, social security, and prevention and prevention of terrorism in key areas. Increasing. The security system uses a large number of video capture equipment to monitor in real time through video and record video data for viewing in order to maintain public safety.
分析视频监控设备采集的数据,对特定对象进行识别、定位及跟踪是公安机关经常的工作。然而庞大的视频数据巨大仅依靠人工进行对象分辨识别,耗时费力,且精度低。Analyzing the data collected by video surveillance equipment, identifying, locating and tracking specific objects is a common task of public security agencies. However, the huge amount of video data only relies on manual object recognition, which is time-consuming and labor-intensive, and has low accuracy.
有些视频监控系统虽然引入了人脸识别技术对对象进行定位,发明人意识到,人脸识别要求视频采集设备精度高,视频采集设备精度越高,产生的视频数据越大,且人脸识别计算过程复杂,因而在庞大的视频数据中检索出待定位对象的人脸,需要的计算时间较长或需要较多的计算资源,对一些计算资源有限但对时效性要求较高的场合往往不能满足检索要求。Although some video surveillance systems introduce face recognition technology to locate objects, the inventor realized that face recognition requires high accuracy of video capture equipment. The higher the accuracy of video capture equipment, the larger the video data generated, and the calculation of face recognition The process is complicated. Therefore, to retrieve the face of the object to be located from the huge video data, the calculation time is longer or more calculation resources are required. For some occasions where the calculation resources are limited but the timeliness requirements are high, it is often not enough. Search requirements.
发明内容Summary of the invention
本申请提供一种在视频中定位对象的方法、装置、计算机设备及存储介质,以解决定位对象耗时的问题。This application provides a method, device, computer equipment and storage medium for locating an object in a video to solve the problem of time-consuming locating an object.
为解决上述技术问题,本申请提出一种在视频中定位对象的方法,包括如下步骤:To solve the above technical problems, this application proposes a method for locating an object in a video, which includes the following steps:
获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图Acquire the first image feature of the object to be positioned, the first image feature includes the image contour and/or the image
像颜色特征;Like color characteristics;
根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;
获取待定位对象的人脸特征;Obtain the facial features of the object to be located;
将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
为解决上述技术问题,本申请实施例还提供一种在视频中定位对象的装置,包括:To solve the above technical problems, an embodiment of the present application also provides an apparatus for locating an object in a video, including:
第一获取模块,用于获取待定位对象的第一图像特征,所述第一图像特征 包含图像轮廓和\或图像颜色特征;The first acquisition module is configured to acquire a first image feature of the object to be positioned, the first image feature including image contour and/or image color feature;
检索模块,用于根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;A retrieval module, configured to retrieve a preset video database according to the first image feature of the object to be located, and obtain an image of a candidate object that matches the first image feature of the object to be located;
第二获取模块,用于获取待定位对象的人脸特征;The second acquisition module is used to acquire the facial features of the object to be located;
处理模块,用于将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。A processing module, configured to compare the facial features of the object to be located with the image of the candidate object, and determine that the object that matches the facial feature of the object to be located among the candidate objects is the object to be located .
为解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述所述在视频中定位对象的方法的步骤。In order to solve the above technical problems, an embodiment of the present application further provides a computer device including a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the The processor executes the steps of the method for locating an object in the video.
为解决上述技术问题,本申请实施例还提供一个或多个非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,使得所述处理器执行上述所述在视频中定位对象的方法的步骤。In order to solve the above technical problems, the embodiments of the present application further provide one or more non-volatile readable storage media, the non-volatile readable storage medium stores computer readable instructions, and the computer readable instructions are When the processor is executed, the processor is caused to execute the steps of the method for locating an object in the video.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.
图1为本申请实施例一种在视频中定位对象的方法基本流程示意图;FIG. 1 is a schematic diagram of the basic flow of a method for locating an object in a video according to an embodiment of this application;
图2为本申请实施例获取待定位对象的第一图像特征流程示意图;2 is a schematic diagram of a process of acquiring a first image feature of an object to be positioned according to an embodiment of the application;
图3为本申请实施例确定候选对象流程示意图;FIG. 3 is a schematic diagram of a process for determining a candidate object according to an embodiment of the application;
图4为本申请实施例卷积神经网络模型训练流程示意图;4 is a schematic diagram of a training process of a convolutional neural network model according to an embodiment of the application;
图5为本申请实施例确定待定位对象流程示意图;FIG. 5 is a schematic diagram of a process of determining an object to be located according to an embodiment of the application;
图6为本申请实施例一种在视频中定位对象的装置基本结构框图;6 is a block diagram of the basic structure of an apparatus for locating an object in a video according to an embodiment of this application;
图7为本申请实施计算机设备基本结构框图。Figure 7 is a block diagram of the basic structure of the computer equipment implemented in this application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application.
在本申请的说明书和权利要求书及上述附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。In some processes described in the specification and claims of this application and the above drawings, multiple operations appearing in a specific order are included, but it should be clearly understood that these operations may not be in the order in which they appear in this text. Execution or parallel execution, the sequence numbers of operations, such as 101, 102, etc., are only used to distinguish different operations, and the sequence numbers themselves do not represent any execution order. In addition, these processes may include more or fewer operations, and these operations may be executed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, nor do they limit the "first" and "second" Are different types.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of this application.
实施例Example
本技术领域技术人员可以理解,这里所使用的“终端”、“终端设备”既包括无线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定位系统)接收器;常规膝上型和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”、“终端设备”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”、“终端设备”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。Those skilled in the art can understand that the term "terminal" and "terminal equipment" used herein include both wireless signal receiver equipment, equipment that only has wireless signal receivers without transmitting capability, and equipment receiving and transmitting hardware. A device, which has a device capable of performing two-way communication receiving and transmitting hardware on a two-way communication link. Such equipment may include: cellular or other communication equipment, which has a single-line display or multi-line display or cellular or other communication equipment without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice and data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which can include radio frequency receivers, pagers, Internet/Intranet access, web browsers, notebooks, calendars and/or GPS (Global Positioning System (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device, which has and/or includes a radio frequency receiver, a conventional laptop and/or palmtop computer or other device. The "terminal" and "terminal equipment" used here may be portable, transportable, installed in vehicles (aviation, sea and/or land), or suitable and/or configured to operate locally, and/or In a distributed form, it runs on the earth and/or any other location in space. The "terminal" and "terminal device" used here can also be communication terminals, Internet terminals, music/video playback terminals, such as PDA, MID (Mobile Internet Device, mobile Internet device) and/or music/video playback Functional mobile phones can also be devices such as smart TVs and set-top boxes.
本实施方式中的涉及到的终端即为上述的终端。The terminals involved in this embodiment are the aforementioned terminals.
具体地,请参阅图1,图1为本实施例一种在视频中定位对象的方法的基本流程示意图。Specifically, please refer to FIG. 1, which is a schematic flowchart of a method for locating an object in a video according to this embodiment.
如图1所示,一种在视频中定位对象的方法,包括下述步骤:As shown in Figure 1, a method for positioning an object in a video includes the following steps:
S101、获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓S101. Acquire a first image feature of an object to be positioned, where the first image feature includes an image contour
和\或图像颜色特征;And\or image color characteristics;
通过可交互的接口接收待定位对象的第一图像特征,这里的待定位对象指具体的人,第一图像特征这里指包含待定位对象的图像的轮廓特征或颜色特征、或者两者的结合。The first image feature of the object to be positioned is received through an interactive interface, where the object to be positioned refers to a specific person, and the first image feature here refers to the contour feature or color feature of the image containing the object to be positioned, or a combination of the two.
其中轮廓特征包括人的高矮胖瘦,颜色特征包括人的肤色、头发颜色、衣着的颜色。具体地,可通过可交互的界面由用户输入前述的特征。Among them, contour features include people's height, short, fat and thin, and color features include people's skin color, hair color, and clothing color. Specifically, the aforementioned features can be input by the user through an interactive interface.
本申请实施例中,通过获取待定位对象的图像,采用轮廓特征提取算法和颜色特征提取算法来获取待定位对象的第一图像特征,具体地,请参阅图2。In the embodiment of the present application, by acquiring the image of the object to be located, the contour feature extraction algorithm and the color feature extraction algorithm are used to obtain the first image feature of the object to be located. Specifically, please refer to FIG. 2.
S102、根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;S102: Search a preset video database according to the first image feature of the object to be located, and obtain an image of the candidate object that matches the first image feature of the object to be located;
根据第一图像特征检索预设的视频数据库,这里预设的视频数据库指保存了视频监控设备采集的视频的存储空间。图像的检索,现有的基于语义的检索,需要预先对图像进行语义属性标注,于本申请实施例中,图像来源于视频监控设备的实时采集,预先进行标注不适用,这里采用相似特征比对的算法,具体 地,请参阅图3。The preset video database is retrieved according to the first image feature, where the preset video database refers to the storage space where the video collected by the video surveillance device is saved. For image retrieval, the existing semantic-based retrieval requires pre-marking of the semantic attributes of the image. In the embodiment of this application, the image comes from the real-time collection of the video surveillance equipment, and pre-marking is not applicable. Here, similar feature comparison is used. The algorithm, specifically, please refer to Figure 3.
S103、获取待定位对象的人脸特征;S103. Obtain the facial features of the object to be located.
通过可交互的接口获取待定位对象的人脸特征,这里的人脸特征是表征人脸图像特征的一个n维的向量。图像特征是某一类对象区别于其他类对象的相应(本质)特点或特性,或是这些特点和特性的集合。特征是通过测量或处理能够抽取的数据。对于图像而言,每一幅图像都具有能够区别于其他类图像的自身特征,有些是可以直观地感受到的自然特征,如亮度、边缘、纹理和色彩等;有些则是需要通过变换或处理才能得到的,如矩、直方图以及主成份等。The face feature of the object to be located is acquired through an interactive interface, where the face feature is an n-dimensional vector representing the feature of the face image. Image features are the corresponding (essential) characteristics or characteristics of a certain type of object that are different from other types of objects, or a collection of these characteristics and characteristics. Features are data that can be extracted through measurement or processing. For images, each image has its own characteristics that can be distinguished from other types of images. Some are natural features that can be intuitively felt, such as brightness, edges, texture, and color; others need to be transformed or processed Can be obtained, such as moments, histograms, and principal components.
特征向量的提取有多种方法,例如采用方向梯度直方图法,它通过计算和统计图像局部区域的梯度方向直方图来构成特征。这种方法的主要思想是在一副图像中,局部目标的表象和形状(appearance and shape)能够被梯度或边缘的方向密度分布很好地描述。具体实现方法为将一个图像image:There are many methods for feature vector extraction, such as the directional gradient histogram method, which composes features by calculating and counting the gradient direction histogram of the local area of the image. The main idea of this method is that in an image, the appearance and shape of a local target can be well described by the directional density distribution of the gradient or edge. The specific implementation method is to image an image:
1)灰度化(将图像看作一个x,y,z(灰度)的三维图像);1) Grayscale (think the image as a three-dimensional image of x, y, z (grayscale));
2)采用Gamma校正法对输入图像进行颜色空间的标准化(归一化);目的是调节图像的对比度,降低图像局部的阴影和光照变化所造成的影响,同时可以抑制噪音的干扰;2) Use the Gamma correction method to standardize the color space (normalization) of the input image; the purpose is to adjust the contrast of the image, reduce the impact of local shadows and lighting changes in the image, and suppress noise interference;
3)计算图像每个像素的梯度(包括大小和方向);主要是为了捕获轮廓信息,同时进一步弱化光照的干扰;3) Calculate the gradient of each pixel of the image (including size and direction); mainly to capture contour information, while further weakening the interference of light;
4)将图像划分成小cells(例如6*6像素/cell);4) Divide the image into small cells (for example, 6*6 pixels/cell);
5)统计每个cell的梯度直方图(不同梯度的个数),即可形成每个cell的描述子;5) Count the gradient histogram of each cell (the number of different gradients) to form the descriptor of each cell;
6)将每几个cell组成一个block(例如3*3个cell/block),一个block内所有cell的特征描述子串联起来便得到该block的梯度直方图特征描述子;6) Combine every few cells into a block (for example, 3*3 cells/block), and concatenate the feature descriptors of all cells in a block to obtain the gradient histogram feature descriptor of the block;
7)将图像image内的所有block的梯度直方图特征描述子串联起来就可以得到该image(你要检测的目标)的梯度直方图特征描述子。这个就是最终的可供图像识别的特征向量了。7) Connect the gradient histogram feature descriptors of all blocks in the image to get the gradient histogram feature descriptor of the image (the target you want to detect). This is the final feature vector for image recognition.
本申请实施例中采用预先训练的卷积神经网络进行人脸图像特征向量的提取,卷积神经网络对图像特征的提取相对于其他方法来说,提取到的特征更不容易过拟合,可以利用不同的卷积、池化和最后输出的特征向量的大小控制整体模型的拟合能力,更灵活。训练的步骤请参阅图4。In the embodiments of this application, a pre-trained convolutional neural network is used to extract the feature vectors of a face image. Compared with other methods, the convolutional neural network extracts image features. The extracted features are less likely to be over-fitted. Use different convolution, pooling and the size of the final output feature vector to control the fitting ability of the overall model, which is more flexible. See Figure 4 for the training steps.
S104、将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。S104. Compare the facial feature of the object to be located with the image of the candidate object, and determine that an object matching the facial feature of the object to be located among the candidate objects is the object to be located.
本申请实施例中,通过将待定位对象的人脸特征与步骤S102得到的候选对象的图像比对,确定候选对象中与待定位对象具有相同人脸特征的对象即为最终要定位的对象。In the embodiment of the present application, by comparing the facial features of the object to be located with the image of the candidate object obtained in step S102, it is determined that the object having the same facial feature as the object to be located among the candidate objects is the final object to be located.
具体地,截取候选对象的人脸图像,按照与步骤103中相同的方式获得候选对象的人脸特征向量,比较两个向量之间的相似度。计算向量之间的欧氏距离或余弦相似度来衡量两者之间的相似度,当相似度大于设定的阈值时,确认该候选目标对象为待定位目标对象。其中余弦相似度,指两个向量之间夹角的余弦值取值范围在[-1,1]之间,值越趋近于1,代表两个向量的方向越接近,两 个向量越相似;越趋近于-1,他们的方向越相反;接近于0,表示两个向量近乎于正交。具体的计算公式为:Specifically, the face image of the candidate object is intercepted, the face feature vector of the candidate object is obtained in the same manner as in step 103, and the similarity between the two vectors is compared. Calculate the Euclidean distance or cosine similarity between the vectors to measure the similarity between the two. When the similarity is greater than the set threshold, confirm that the candidate target object is the target object to be located. The cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal. The specific calculation formula is:
Figure PCTCN2019117702-appb-000001
Figure PCTCN2019117702-appb-000001
其中,Ai、Bi分别代表向量A和B的各分量。Among them, Ai and Bi represent the components of vectors A and B, respectively.
本申请实施例中,通过预设的人脸特征提取模型提取模型获取候选对象的人脸特征,再比较待定位对象人脸特征与候选对象的人脸特征相似度,具体请参阅图5。In the embodiment of the present application, the facial features of the candidate object are obtained by the preset facial feature extraction model extraction model, and then the facial features of the object to be located are compared with the facial feature similarity of the candidate object. Please refer to FIG. 5 for details.
如图2所示,在步骤S101中,还包括下述步骤:As shown in Fig. 2, in step S101, the following steps are further included:
S111、获取待定位对象的图像;S111. Obtain an image of an object to be positioned;
通过可交互的接口获取待定位对象的图像。Obtain the image of the object to be positioned through an interactive interface.
S112、根据图像轮廓特征其他算法和\或颜色特征提取算法,对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征。S112: Process the image of the object to be positioned according to other algorithms of the image contour feature and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
对待定位对象的图像进行图像轮廓特征提取。图像的轮廓特征提取可以采用图像梯度算法提取,图像函数f(x,y)在点(x,y)的梯度是一个具有大小和方向的矢量,设为Gx和Gy分别表示x方向和y方向的梯度,这个梯度的矢量可以表示为:Image contour feature extraction is performed on the image of the object to be positioned. The contour feature extraction of the image can be extracted by the image gradient algorithm. The gradient of the image function f (x, y) at the point (x, y) is a vector with size and direction, set Gx and Gy to represent the x direction and y direction, respectively The gradient of this gradient can be expressed as:
Figure PCTCN2019117702-appb-000002
Figure PCTCN2019117702-appb-000002
在数字图像中,梯度可以近似表示为:In digital images, the gradient can be approximately expressed as:
G x=f(x,y)-f(x-1,y) G x =f(x,y)-f(x-1,y)
G y=f(x,y)-f(x,y-1) G y =f(x,y)-f(x,y-1)
其中,f(x,y)为待计算轮廓的图像的图像函数,f(x,y)、f(x-1,y)与f(x,y-1)分别是图像函数f(x,y)在点(x,y)、点(x-1,y)与点(x,y-1)的梯度,Gx、Gy分别为图像函数f(x,y)在x方向和y方向的梯度。Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), Gx and Gy are the image function f(x, y) in the x direction and y direction respectively gradient.
梯度的方向是函数f(x,y)变化最快的方向,当图像中存在边缘时,一定有较大的梯度值,相反,当图像中有比较平滑的部分时,灰度值变化较小,则相应的梯度也较小。图像梯度算法是考虑图像的每个像素的某个邻域内的灰度变化,利用边缘临近的一阶或二阶导数变化规律,对原始图像中像素某个邻域设置梯度算子,例如Sobel算子、Robinson算子、Laplace算子等,将原始图像与梯度算子进行卷积运算,得到目标对象图像的轮廓。The direction of the gradient is the fastest changing direction of the function f(x,y). When there are edges in the image, there must be a larger gradient value. On the contrary, when there are relatively smooth parts in the image, the gray value changes less , The corresponding gradient is also smaller. The image gradient algorithm considers the grayscale change in a certain neighborhood of each pixel of the image, and uses the first-order or second-order derivative change law near the edge to set a gradient operator in a certain neighborhood of the pixel in the original image, such as the Sobel algorithm. Sub, Robinson operator, Laplace operator, etc., convolve the original image with the gradient operator to obtain the contour of the target object image.
当第一图像特征为颜色特征时,颜色是一种全局特征,描述了图像或图像区域所对应的景物的表面性质。一般颜色特征是基于像素点的特征。为了使定位更准确,在利用颜色特征进行候选目标对象匹配时,为避免背景颜色信息对目标对象进行干扰,先通过目标检测算法,识别出图像中的目标图像,然后对图像进行裁剪,只保留目标对象本身。其中目标检测算法通过级联的卷积神经网络模型实现。通过计算裁剪后图像的颜色直方图获取目标图像的颜色特征,颜色直方图可以通过OpenCV里面提供的计算图像直方图的API函数calcHist计算。在对视频流中图像进行匹配时,计算视频流中图像各目标的颜色直方图, 然后通过OpenCV提供的直方图比较函数compareHist()进行相似度的比较,得到候选的目标对象。When the first image feature is a color feature, the color is a global feature that describes the surface properties of the scene corresponding to the image or image area. The general color feature is based on the feature of the pixel. In order to make the positioning more accurate, when using color features to match candidate target objects, in order to avoid the background color information from interfering with the target object, first use the target detection algorithm to identify the target image in the image, and then crop the image, leaving only The target object itself. Among them, the target detection algorithm is realized by the cascaded convolutional neural network model. The color characteristics of the target image are obtained by calculating the color histogram of the cropped image. The color histogram can be calculated by the API function calcHist provided in OpenCV for calculating the image histogram. When matching the images in the video stream, calculate the color histogram of each target in the video stream, and then compare the similarity through the histogram comparison function compareHist() provided by OpenCV to obtain candidate target objects.
也可以先对含有目标对象的图像进行降采样,然后进行轮廓特征提取或颜色特征提取。在进行候选目标对象匹配时,同样将视频流中的图像进行相同倍数的降采样,并使用降采样后的图像与提取的目标对象的轮廓特征或颜色特征进行匹配,获取候选的目标对象。图像经过降采样后,像素数据减少,可以减少计算量,加快计算速度。It is also possible to down-sample the image containing the target object first, and then perform contour feature extraction or color feature extraction. When performing candidate target object matching, the image in the video stream is also down-sampled by the same multiple, and the down-sampled image is used to match the contour feature or color feature of the extracted target object to obtain the candidate target object. After the image is down-sampled, the pixel data is reduced, which can reduce the amount of calculation and speed up the calculation.
如图3所示,在步骤S102中,还包括下述步骤:As shown in FIG. 3, in step S102, the following steps are further included:
S121、获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;S121. Obtain a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;
视频图像帧是视频的分解,可以采用第三方软件对视频进行分解,得到视频图像帧。The video image frame is the decomposition of the video, and third-party software can be used to decompose the video to obtain the video image frame.
S122、将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述深度学习神经网络对输入的所述视频图像帧进行目标检测而输出的所述目标对象的图像为人体图像;S122. Input the video image frame into a preset target detection model, and obtain an image of the target object output by the target detection model in response to the video image frame, wherein the preset target detection model is based on a preset target detection model. A trained deep learning neural network, where the deep learning neural network performs target detection on the input video image frame and the output image of the target object is a human body image;
得到的视频图像帧往往不只包含目标图像,为了避免背景的干扰,先对视频图像帧进行目标检测,本申请的目标检测是对人体进行检测,目的是去除除了人体图像外的其他部分,经过目标检测后得到的目标对象的图像为人体图像。本申请实施例中采用预先训练的深度学习神经网络对目标对象进行检测。The obtained video image frames often contain more than the target image. In order to avoid background interference, the target detection is performed on the video image frame. The target detection in this application is to detect the human body, and the purpose is to remove other parts except the human body image. The image of the target object obtained after detection is a human body image. In the embodiments of the present application, a pre-trained deep learning neural network is used to detect the target object.
具体地,先将视频图像帧进行等分切割。本申请实施例中,输入的图像划分为7*7的拼图图像。接着将拼图图像输入深度学习神经网络,对于每个拼图格子深度学习神经网络都会预测2个预测框。预测出的预测框包含5个值:x,y,w,h和置信度。x和y是预测框的中心坐标,w和h是预测框的宽与高。我们取两个预测框中的一个,即目标对象的预测框,最后第三卷积神经网络输出一个7x7x(2x5+1)的预测张量用于下一步目标对象预测框的确定。在获取到预测张量之后,通过设置置信度阈值进行筛选,置信度小于该阈值的预测框将被过滤掉,仅留下置信度比较高的预测框作为剩余框。然后对于剩下的每个预测框,依次计算一个剩下的预测框与剩余框的IOU(重合度)值,如果IOU值大于预设阈值,那么就将该预测框剔除,并对剩余的预测框重复上述过程,直到处理完所有的预测框,得到目标对象的图像。Specifically, the video image frame is first divided into equal parts. In the embodiment of the application, the input image is divided into 7*7 puzzle images. Then the puzzle image is input into the deep learning neural network, and for each puzzle grid deep learning neural network will predict 2 prediction boxes. The predicted prediction box contains 5 values: x, y, w, h and confidence. x and y are the center coordinates of the prediction box, and w and h are the width and height of the prediction box. We take one of the two prediction frames, the prediction frame of the target object, and finally the third convolutional neural network outputs a 7x7x (2x5+1) prediction tensor for the next step of determining the target object prediction frame. After obtaining the prediction tensor, filter by setting a confidence threshold. The prediction frames with a confidence less than the threshold will be filtered out, leaving only prediction frames with higher confidence as the remaining frames. Then for each remaining prediction box, calculate the IOU (coincidence) value of a remaining prediction box and the remaining box in turn. If the IOU value is greater than the preset threshold, then the prediction box will be eliminated and the remaining predictions The frame repeats the above process until all prediction frames are processed and the image of the target object is obtained.
S123、将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;S123: Calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;
将目标对象的图像按照图像轮廓特征提取算法和\或颜色特征提取算法计算目标对象的第一图像特征,具体的算法与步骤S112中相同。The image of the target object is calculated according to the image contour feature extraction algorithm and/or the color feature extraction algorithm, and the specific algorithm is the same as in step S112.
S124、计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象。S124. Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the target object. The candidate.
轮廓特征匹配通过轮廓矩匹配法来实现。轮廓矩可以是空间矩、中心矩等,我们以空间矩为例,如下所示:The contour feature matching is realized by the contour moment matching method. Contour moments can be spatial moments, central moments, etc. We take spatial moments as an example, as shown below:
Figure PCTCN2019117702-appb-000003
Figure PCTCN2019117702-appb-000003
mpq表示图像的(p+q)阶矩,一般计算所有3阶的矩(p+q<=3)。mpq represents the (p+q) order moments of the image, and generally calculates all 3 order moments (p+q<=3).
其中I(x,y)是图像轮廓象素点(x,y)的值,一般是1,n是轮廓上点的个数,p和q分别是x维度和y维度上的矩,即m00,m10,m01…m03Where I(x,y) is the value of the image contour pixel point (x,y), generally 1, n is the number of points on the contour, p and q are the moments in the x and y dimensions respectively, that is, m00 ,m10,m01...m03
零阶矩m00是轮廓上点的简单累加,即轮廓上有多少个点。The zero-order moment m00 is a simple accumulation of points on the contour, that is, how many points are there on the contour.
一阶矩m10,m01分别是x和y方向上的累加。可以通过OpenCV的函数cvGetSpatialMoment()计算空间矩。The first moments m10 and m01 are the accumulation in the x and y directions, respectively. The spatial moment can be calculated by the OpenCV function cvGetSpatialMoment().
颜色特征匹配通过OpenCV提供的直方图比较函数compareHist()进行相似度的比较。Color feature matching uses the histogram comparison function compareHist() provided by OpenCV to compare the similarity.
如图4所述,预先训练的卷积神经网络模型的训练包括下述步骤:As shown in Figure 4, the training of the pre-trained convolutional neural network model includes the following steps:
S131、获取标记有身份标识的训练样本,所述训练样本为标记有不同身份标识的人脸图像;S131. Obtain training samples marked with identities, where the training samples are face images marked with different identities;
本申请实施例中,训练样本维标记了身份标识的人脸图像。In this embodiment of the application, the training sample dimension is marked with the face image of the identity identifier.
S132、将所述训练样本输入到卷积神经网络模型中,获取所述训练样本的身份标识预测结果;S132. Input the training sample into the convolutional neural network model, and obtain the identity prediction result of the training sample;
将训练样本输入到卷积神经网络模型中,卷积神经网络模型输入每个样本的身份标识预测结果。The training samples are input to the convolutional neural network model, and the convolutional neural network model is input to the prediction result of the identity of each sample.
S133、通过损失函数比对所述训练样本的身份标识预测结果与所述身份标识是否一致,其中,所述损失函数为:S133. Compare whether the identity prediction result of the training sample is consistent with the identity by a loss function, where the loss function is:
Figure PCTCN2019117702-appb-000004
Figure PCTCN2019117702-appb-000004
其中,N为训练样本数,针对第i个样本对应的yi是标记的结果,h=(h1,h2,...,hi)为样本i的预测结果。Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, and h=(h1, h2,...,hi) is the prediction result of sample i.
通过损失函数比对训练样本的身份标识预测结果与标注的身份标识是否一致,本申请实施例采用Softmax交叉熵损失函数。在训练过程中,调整卷积神经网络模型中的权重,使Softmax交叉熵损失函数尽可能收敛,也就是说继续调整权重,在得到的损失函数的值不再缩小,反而增大时,认为第一卷积神经网络训练可以结束。The loss function is used to compare whether the prediction result of the identity of the training sample is consistent with the marked identity, and the Softmax cross-entropy loss function is used in the embodiment of the application. In the training process, adjust the weights in the convolutional neural network model to make the Softmax cross-entropy loss function as convergent as possible, that is to say, continue to adjust the weights. When the value of the loss function is no longer reduced, but increased, it is considered that the first A convolutional neural network training can end.
S134、当所述身份标识预测结果与所述身份标识不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述损失函数收敛时结束。S134: When the prediction result of the identity identifier is inconsistent with the identity identifier, iteratively update the weights in the convolutional neural network model repeatedly and iteratively, until the loss function converges.
如前所述当损失函数没有收敛时,更新卷积神经网络模型中的权重,本申请实施例中采用梯度下降法,梯度下降法是一个最优化算法,用于机器学习和人工智能当中用来递归性地逼近最小偏差模型。As mentioned above, when the loss function does not converge, the weights in the convolutional neural network model are updated. In the embodiment of this application, the gradient descent method is used. The gradient descent method is an optimization algorithm used in machine learning and artificial intelligence. Approximate the minimum deviation model recursively.
如图5所示,在步骤104中,还包括下述步骤:As shown in Figure 5, in step 104, the following steps are further included:
S141、获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;S141. Obtain a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;
通过步骤S102得到了候选对象的图像,对候选对象图像进行人脸检测,截取候选对象的人脸图像。人脸检测方法与步骤S122中所述的方法相同。Through step S102, an image of the candidate object is obtained, and face detection is performed on the candidate object image, and the face image of the candidate object is intercepted. The face detection method is the same as the method described in step S122.
S142、将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;S142: Input the face image of the candidate object into the preset face feature extraction model, and obtain the face feature of the candidate object.
将候选对象的人脸图像输入到预设的人脸特征提取模型,本申请实施例中,预设的人脸特征提取模型采用预先训练的卷积神经网络模型,训练步骤图4相同。The face image of the candidate object is input to the preset face feature extraction model. In the embodiment of the present application, the preset face feature extraction model uses a pre-trained convolutional neural network model, and the training steps are the same as in FIG. 4.
S143、计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。S143. Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the pending object Bit object.
比较两个向量之间的相似度。计算向量之间的欧氏距离或余弦相似度来衡量两者之间的相似度,当相似度大于设定的阈值时,确认该候选目标对象为待定位目标对象。其中余弦相似度,指两个向量之间夹角的余弦值取值范围在[-1,1]之间,值越趋近于1,代表两个向量的方向越接近,两个向量越相似;越趋近于-1,他们的方向越相反;接近于0,表示两个向量近乎于正交。具体的计算公式为Compare the similarity between two vectors. Calculate the Euclidean distance or cosine similarity between the vectors to measure the similarity between the two. When the similarity is greater than the set threshold, confirm that the candidate target object is the target object to be located. The cosine similarity means that the cosine value of the angle between two vectors ranges from [-1,1]. The closer the value is to 1, the closer the two vectors are, the more similar the two vectors are. ; The closer they are to -1, the more opposite their directions; the closer to 0, it means that the two vectors are nearly orthogonal. The specific calculation formula is
Figure PCTCN2019117702-appb-000005
Figure PCTCN2019117702-appb-000005
其中,Ai、Bi分别代表向量A和B的各分量。Among them, Ai and Bi represent the components of vectors A and B, respectively.
为解决上述技术问题,本申请实施例还提供一种在视频中定位对象的装置。具体请参阅图6,图6为本实施例在视频中定位对象的装置的基本结构框图。To solve the above technical problem, an embodiment of the present application also provides a device for locating an object in a video. Please refer to FIG. 6 for details. FIG. 6 is a basic structural block diagram of an apparatus for positioning an object in a video in this embodiment.
如图6所示,一种在视频中定位对象的装置,包括第一获取模块210、检索模块220、第二获取模块230和处理模块240,其中第一获取模块210,用于获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;检索模块220,用于根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;第二获取模块230,用于获取待定位对象的人脸特征;处理模块240,用于将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。As shown in FIG. 6, an apparatus for locating an object in a video includes a first acquisition module 210, a retrieval module 220, a second acquisition module 230, and a processing module 240. The first acquisition module 210 is used to acquire the object to be located. The first image feature includes the image contour and/or the image color feature; the retrieval module 220 is configured to retrieve the preset video database according to the first image feature of the object to be located, and obtain the The image of the candidate object that matches the first image feature of the object to be located; the second acquisition module 230 is used to obtain the face feature of the object to be located; the processing module 240 is used to compare the face feature of the object to be located with the The image comparison of the candidate object determines that an object matching the facial feature of the object to be located among the candidate objects is the object to be located.
本申请实施例通过获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;获取待定位对象的人脸特征;将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。通过第一图像特征检索视频数据库,可以快速定位候选对象,再根据人脸特征定位待定位对象,很大程度的减少了计算量,提高了对象定位的时效性。The embodiment of the application obtains the first image feature of the object to be located, the first image feature includes the image contour and/or the image color feature; according to the first image feature of the object to be located, the preset video database is retrieved to obtain The image of the candidate object that matches the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the Among the candidate objects, an object that matches the facial feature of the object to be located is the object to be located. Searching the video database through the first image feature can quickly locate the candidate object, and then locate the object to be located according to the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.
在一些实施方式中,所述第一获取模块210中,还包括:第一获取子模块,用于获取待定位对象的图像;第一处理子模块,用于根据图像轮廓特征提取算法和\或颜色特征提取算法对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征。In some embodiments, the first acquisition module 210 further includes: a first acquisition sub-module for acquiring an image of an object to be positioned; a first processing sub-module for extracting an algorithm and/or The color feature extraction algorithm processes the image of the object to be positioned to obtain the first image feature of the object to be positioned.
在一些实施方式中,所述第二获取模块230中,还包括:第二获取子模块,用于获取待定位对象的人脸图像;第二处理子模块,用于将所述待定位对象的人脸图像输入到预设的人脸特征提取模型中,获取所述待定位对象图像的人脸特征。In some embodiments, the second acquisition module 230 further includes: a second acquisition sub-module for acquiring a face image of the object to be located; a second processing sub-module for acquiring the image of the object to be located The face image is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
在一些实施方式中,所述检索模块220中,还包括:第三获取子模块,用于获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;第一检测子模块,用于将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述目标对象的图像包括人体图像,所述深度学习神经网络对输入的所述视频图像帧进行目标检测而输出所述人体图像;第一计算子模块,用于将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;第三处理子模块,用于计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象。In some implementation manners, the retrieval module 220 further includes: a third acquisition sub-module, configured to acquire video image frames, where the video image frames are decompositions of videos stored in the preset video database; A detection sub-module for inputting the video image frame into a preset target detection model, and obtaining an image of the target object output by the target detection model in response to the video image frame, wherein the preset The target detection model is based on a pre-trained deep learning neural network, the image of the target object includes a human body image, and the deep learning neural network performs target detection on the input video image frame to output the human body image; the first calculator The module is used to calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object; the third processing sub-module is used to calculate the object to be located When the degree of matching between the first image feature of and the first image feature of the target object is greater than a preset first threshold, the target object is determined to be the candidate object.
在一些实施方式中,所述处理模块240中,还包括:第四获取子模块,用于获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;第二计算子模块,用于将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;第四处理子模块,用于计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。In some embodiments, the processing module 240 further includes: a fourth acquisition sub-module, configured to acquire the face image of the candidate object, the face image of the candidate object is intercepted from the image of the candidate object The second calculation sub-module is used to input the face image of the candidate object into the preset facial feature extraction model to obtain the facial features of the candidate object; the fourth processing sub-module is used to Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
在一些实施方式中,所述第二计算子模块中,所述预设的人脸特征提取模型基于预先训练的卷积神经网络模型,其中,所述第二计算子模块中,还包括:第五获取子模块,用于获取标记有身份标识的训练样本,所述训练样本为标记有不同身份标识的人脸图像;第一预测子模块,用于将所述训练样本输入到卷积神经网络模型获取所述训练样本的身份标识预测结果;第一比对子模块,用于根据损失函数比对所述训练样本的身份标识预测结果与所述身份标识是否一致,其中,所述损失函数为:In some embodiments, in the second calculation sub-module, the preset facial feature extraction model is based on a pre-trained convolutional neural network model, wherein the second calculation sub-module further includes: Five acquisition sub-module, used to obtain training samples marked with identities, the training samples are face images marked with different identities; the first prediction sub-module, used to input the training samples into the convolutional neural network The model obtains the identification prediction result of the training sample; the first comparison sub-module is used to compare whether the identification prediction result of the training sample is consistent with the identification according to a loss function, wherein the loss function is :
Figure PCTCN2019117702-appb-000006
Figure PCTCN2019117702-appb-000006
其中,N为训练样本数,针对第i个样本对应的yi是标记的结果,h=(h1,h2,...,hi)为样本i的预测结果;第五处理子模块,用于当所述身份标识预测结果与所述身份标识不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述损失函数收敛时结束。Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, h=(h1, h2,...,hi) is the prediction result of sample i; the fifth processing sub-module is used for when When the identity identifier prediction result is inconsistent with the identity identifier, the weights in the convolutional neural network model are updated repeatedly and iteratively until the loss function converges.
在一些实施方式中,在所述第一计算子模块中,所述图像轮廓特征提取算法采取图像梯度算法,梯度表示为:In some embodiments, in the first calculation submodule, the image contour feature extraction algorithm adopts an image gradient algorithm, and the gradient is expressed as:
G x=f(x,y)-f(x-1,y) G x =f(x,y)-f(x-1,y)
G y=f(x,y)-f(x,y-1) G y =f(x,y)-f(x,y-1)
其中,f(x,y)为待计算轮廓的图像的图像函数,f(x,y)、f(x-1,y)与f(x,y-1)分别是图像函数f(x,y)在点(x,y)、点(x-1,y)与点(x,y-1)的梯度,G x、G y分别为图像函 数f(x,y)在x方向和y方向的梯度。 Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), G x and G y are the image function f(x, y) in the x direction and y respectively The gradient of the direction.
为解决上述技术问题,本申请实施例还提供一种计算机设备。具体请参阅图7,图7为本实施例计算机设备基本结构框图。To solve the above technical problem, the embodiment of the present application also provides a computer device. Please refer to FIG. 7 for details. FIG. 7 is a block diagram of the basic structure of the computer device in this embodiment.
如图7所示,计算机设备的内部结构示意图。如图7所示,该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种对象的定位的方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种对象的定位的方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图7中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in Figure 7, a schematic diagram of the internal structure of the computer equipment. As shown in FIG. 7, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein, the non-volatile storage medium of the computer device stores an operating system, a database, and computer-readable instructions. The database may store control information sequences. When the computer-readable instructions are executed by the processor, the processor can realize a Kind of object positioning method. The processor of the computer equipment is used to provide calculation and control capabilities, and supports the operation of the entire computer equipment. A computer readable instruction may be stored in the memory of the computer device. When the computer readable instruction is executed by the processor, the processor may cause the processor to execute a method for positioning an object. The network interface of the computer device is used to connect and communicate with the terminal. Those skilled in the art can understand that the structure shown in FIG. 7 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or less parts than shown in the figure, or combining some parts, or having a different part arrangement.
本实施方式中处理器用于执行图6中第一获取模块210、检索模块220、第二获取模块230和处理模块240的具体内容,存储器存储有执行上述模块所需的计算机可读指令和各类数据。网络接口用于向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有在视频中定位对象的方法中执行所有子模块所需的计算机可读指令及数据,服务器能够调用服务器的计算机可读指令及数据执行所有子模块的功能。In this embodiment, the processor is used to execute the specific content of the first acquisition module 210, the retrieval module 220, the second acquisition module 230, and the processing module 240 in FIG. 6, and the memory stores computer readable instructions and various types of instructions required to execute the above modules. data. The network interface is used for data transmission between user terminals or servers. The memory in this embodiment stores the computer-readable instructions and data required to execute all sub-modules in the method for locating objects in the video, and the server can call the computer-readable instructions and data of the server to perform the functions of all the sub-modules.
计算机设备通过获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;获取待定位对象的人脸特征;将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。通过第一图像特征检索视频数据库,可以快速定位候选对象,再根据人脸特征定位待定位对象,很大程度地减少了计算量,提高了对象定位的时效性。The computer device obtains the first image feature of the object to be located, and the first image feature includes the image outline and/or the image color feature; searches the preset video database according to the first image feature of the object to be located, and obtains the The image of the candidate object matched by the first image feature of the object to be located; obtain the facial feature of the object to be located; compare the facial feature of the object to be located with the image of the candidate object to determine the candidate object The object in which matches the facial feature of the object to be located is the object to be located. Retrieving the video database through the first image feature can quickly locate the candidate object, and then locate the object to be positioned based on the facial feature, which greatly reduces the amount of calculation and improves the timeliness of object positioning.
本申请还提供一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述任一实施例所述在视频中定位对象的方法的步骤。The present application also provides one or more non-volatile storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute any of the foregoing implementations. The example describes the steps of the method of positioning objects in the video.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一非易失性可读存储介质中,该可读指令在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile readable storage. In the medium, when the readable instruction is executed, it may include the procedures of the above-mentioned method embodiments. Among them, the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的 说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowchart of the drawings are shown in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated in this article, there is no strict order for the execution of these steps, and they can be executed in other orders. Moreover, at least part of the steps in the flowchart of the drawings may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the order of execution is also It is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above are only part of the implementation of this application. It should be pointed out that for those of ordinary skill in the art, without departing from the principle of this application, several improvements and modifications can be made, and these improvements and modifications are also Should be regarded as the scope of protection of this application.

Claims (20)

  1. 一种在视频中定位对象的方法,其特征在于,包括下述步骤:A method for locating an object in a video, which is characterized in that it comprises the following steps:
    获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;
    根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;
    获取待定位对象的人脸特征;Obtain the facial features of the object to be located;
    将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
  2. 根据权利要求1所述的在视频中定位对象的方法,其特征在于,在所述获取待定位对象的第一图像特征的步骤中,包括下述步骤:The method for locating an object in a video according to claim 1, wherein the step of acquiring the first image feature of the object to be positioned includes the following steps:
    获取所述待定位对象的图像;Acquiring an image of the object to be positioned;
    根据图像轮廓特征提取算法和\或颜色特征提取算法对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征。The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
  3. 根据权利要求1所述的在视频中定位对象的方法,其特征在于,在所述获取所述待定位对象的人脸特征的步骤中,包括下述步骤:The method for locating an object in a video according to claim 1, wherein the step of acquiring the facial features of the object to be positioned includes the following steps:
    获取所述待定位对象的人脸图像;Acquiring the face image of the object to be located;
    将所述待定位对象的人脸图像输入到预设的人脸特征提取模型中,获取所述待定位对象图像的人脸特征。The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
  4. 根据权利要求1所述的在视频中定位对象的方法,其特征在于,在所述根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像的步骤中,包括下述步骤:The method for locating an object in a video according to claim 1, wherein the preset video database is retrieved according to the first image feature of the object to be positioned, and the first image of the object to be positioned is obtained. The step of the image feature matching candidate image includes the following steps:
    获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;
    将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述目标对象的图像为人体图像;The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;
    将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;
    计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象。Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
  5. 根据权利要求3所述的在视频中定位对象的方法,其特征在于,在所述将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象的步骤中,包括下述步骤:The method for locating an object in a video according to claim 3, characterized in that, in said comparing the facial features of the object to be positioned with the image of the candidate object, it is determined that the candidate object is The step that the object matched by the face feature of the object to be positioned is the object to be positioned includes the following steps:
    获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;
    将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;
    计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
  6. 根据权利要求3所述的在视频中定位对象的方法,其特征在于,所述预设的人脸特征提取模型基于预先训练的卷积神经网络模型,其中,所述卷积神经网络模型的训练包括下述步骤:The method for locating an object in a video according to claim 3, wherein the preset facial feature extraction model is based on a pre-trained convolutional neural network model, wherein the training of the convolutional neural network model It includes the following steps:
    获取标记有身份标识的训练样本,所述训练样本为标记有不同身份标识的人脸图像;Acquiring training samples marked with identities, where the training samples are face images marked with different identities;
    将所述训练样本输入到卷积神经网络模型中,获取所述训练样本的身份标识预测结果;Input the training sample into a convolutional neural network model, and obtain an identity prediction result of the training sample;
    根据损失函数比对所述训练样本的身份标识预测结果与所述身份标识是否一致,其中,所述损失函数为:Check whether the identity prediction result of the training sample is consistent with the identity according to the loss function, wherein the loss function is:
    Figure PCTCN2019117702-appb-100001
    Figure PCTCN2019117702-appb-100001
    其中,N为训练样本数,针对第i个样本对应的yi是标记的结果,h=(h1,h2,...,hi)为样本i的预测结果;Among them, N is the number of training samples, yi corresponding to the i-th sample is the marked result, and h=(h1,h2,...,hi) is the prediction result of sample i;
    当所述身份标识预测结果与所述身份标识不一致时,反复循环迭代的更新所述卷积神经网络模型中的权重,至所述损失函数收敛时结束。When the identity identifier prediction result is inconsistent with the identity identifier, the weights in the convolutional neural network model are updated repeatedly and iteratively until the loss function converges.
  7. 根据权利要求2所述的在视频中定位对象的方法,其特征在于,所述图像轮廓特征提取算法采取图像梯度算法,梯度表示为:The method for locating an object in a video according to claim 2, wherein the image contour feature extraction algorithm adopts an image gradient algorithm, and the gradient is expressed as:
    G x=f(x,y)-f(x-1,y) G x =f(x,y)-f(x-1,y)
    G y=f(x,y)-f(x,y-1) G y =f(x,y)-f(x,y-1)
    其中,f(x,y)为待计算轮廓的图像的图像函数,f(x,y)、f(x-1,y)与f(x,y-1)分别是图像函数f(x,y)在点(x,y)、点(x-1,y)与点(x,y-1)的梯度,G x、G y分别为图像函数f(x,y)在x方向和y方向的梯度。 Among them, f(x,y) is the image function of the image to be calculated contour, f(x,y), f(x-1,y) and f(x,y-1) are the image function f(x, y) The gradient at point (x, y), point (x-1, y) and point (x, y-1), G x and G y are the image function f(x, y) in the x direction and y respectively The gradient of the direction.
  8. 一种在视频中定位对象的装置,其特征在于,包括:A device for locating an object in a video, characterized in that it comprises:
    第一获取模块,用于获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;The first acquisition module is configured to acquire a first image feature of the object to be positioned, the first image feature including image contour and/or image color feature;
    检索模块,用于根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;A retrieval module, configured to retrieve a preset video database according to the first image feature of the object to be located, and obtain an image of a candidate object that matches the first image feature of the object to be located;
    第二获取模块,用于获取待定位对象的人脸特征;The second acquisition module is used to acquire the facial features of the object to be located;
    处理模块,用于将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。A processing module, configured to compare the facial features of the object to be located with the image of the candidate object, and determine that the object that matches the facial feature of the object to be located among the candidate objects is the object to be located .
  9. 根据权利要求8所述的在视频中定位对象的装置,其特征在于,所述第一获取模块还包括:The device for locating an object in a video according to claim 8, wherein the first acquiring module further comprises:
    第一获取子模块,用于获取待定位对象的图像;The first acquisition sub-module is used to acquire an image of the object to be positioned;
    第一处理子模块,用于根据图像轮廓特征提取算法和\或颜色特征提取算法 对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征;The first processing sub-module is configured to process the image of the object to be located according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be located;
    所述第二获取模块还包括:The second acquisition module further includes:
    第二获取子模块,用于获取待定位对象的人脸图像;The second acquisition sub-module is used to acquire the face image of the object to be located;
    第二处理子模块,用于将所述待定位对象的人脸图像输入到预设的人脸特征提取模型中,获取所述待定位对象图像的人脸特征。The second processing sub-module is configured to input the face image of the object to be located into a preset face feature extraction model to obtain the face feature of the object image to be located.
  10. 根据权利要求8所述的在视频中定位对象的装置,其特征在于,所述检索模块还包括:The device for locating an object in a video according to claim 8, wherein the retrieval module further comprises:
    第三获取子模块,用于获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;The third acquisition sub-module is configured to acquire video image frames, where the video image frames are the decomposition of the video stored in the preset video database;
    第一检测子模块,用于将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述目标对象的图像包括人体图像,所述深度学习神经网络对输入的所述视频图像帧进行目标检测而输出所述人体图像;The first detection submodule is configured to input the video image frame into a preset target detection model, and obtain an image of the target object output by the target detection model in response to the video image frame, wherein the preset The target detection model of is based on a pre-trained deep learning neural network, the image of the target object includes a human body image, and the deep learning neural network performs target detection on the input video image frame to output the human body image;
    第一计算子模块,用于将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;The first calculation sub-module is used to calculate the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm;
    第三处理子模块,用于计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象;The third processing sub-module is used to calculate the degree of matching between the first image feature of the object to be positioned and the first image feature of the target object, and determine when the degree of matching is greater than a preset first threshold The target object is the candidate object;
    所述处理模块还包括:The processing module further includes:
    第四获取子模块,用于获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;The fourth acquisition submodule is used to acquire the face image of the candidate object, and the face image of the candidate object is intercepted from the image of the candidate object;
    第二计算子模块,用于将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;The second calculation sub-module is configured to input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;
    第四处理子模块,用于计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。The fourth processing sub-module is used to calculate the matching degree between the facial features of the object to be located and the facial features of the candidate object, and when the matching degree is greater than a preset second threshold, determine the The candidate object is the object to be positioned.
  11. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如下步骤:A computer device includes a memory and a processor. The memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor executes the following steps:
    获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;
    根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;
    获取待定位对象的人脸特征;Obtain the facial features of the object to be located;
    将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
  12. 根据权利要求1所述的计算机设备,其特征在于,在所述获取待定位对象的第一图像特征的步骤中,包括下述步骤:The computer device according to claim 1, wherein the step of acquiring the first image feature of the object to be positioned comprises the following steps:
    获取所述待定位对象的图像;Acquiring an image of the object to be positioned;
    根据图像轮廓特征提取算法和\或颜色特征提取算法对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征。The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
  13. 根据权利要求1所述的计算机设备,其特征在于,在所述获取所述待定位对象的人脸特征的步骤中,包括下述步骤:The computer device according to claim 1, wherein the step of obtaining the facial features of the object to be located comprises the following steps:
    获取所述待定位对象的人脸图像;Acquiring the face image of the object to be located;
    将所述待定位对象的人脸图像输入到预设的人脸特征提取模型中,获取所述待定位对象图像的人脸特征。The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
  14. 根据权利要求1所述的计算机设备,其特征在于,在所述根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像的步骤中,包括下述步骤:The computer device according to claim 1, wherein the preset video database is retrieved according to the first image feature of the object to be located, and the candidate that matches the first image feature of the object to be located is obtained The steps of object image include the following steps:
    获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;
    将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述目标对象的图像为人体图像;The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;
    将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;
    计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象。Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
  15. 根据权利要求13所述计算机设备,其特征在于,在所述将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象的步骤中,包括下述步骤:The computer device according to claim 13, wherein in said comparing the facial features of the object to be located with the image of the candidate object, it is determined that the person in the candidate object is the same as that of the object to be located. The step that the object to be matched by the face feature is the object to be located includes the following steps:
    获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;
    将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;
    计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
  16. 一个或多个非易失性可读存储介质,所述非易失性可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤:One or more non-volatile readable storage media having computer readable instructions stored on the non-volatile readable storage media, and when the computer readable instructions are executed by a processor, the following steps are implemented:
    获取待定位对象的第一图像特征,所述第一图像特征包含图像轮廓和\或图像颜色特征;Acquiring a first image feature of the object to be positioned, where the first image feature includes image contour and/or image color feature;
    根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像;Searching a preset video database according to the first image feature of the object to be located, and acquiring an image of the candidate object that matches the first image feature of the object to be located;
    获取待定位对象的人脸特征;Obtain the facial features of the object to be located;
    将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象。The face feature of the object to be located is compared with the image of the candidate object, and an object in the candidate object that matches the face feature of the object to be located is determined as the object to be located.
  17. 根据权利要求1所述的非易失性可读存储介质,其特征在于,在所述 获取待定位对象的第一图像特征的步骤中,包括下述步骤:The non-volatile readable storage medium according to claim 1, wherein the step of obtaining the first image feature of the object to be positioned includes the following steps:
    获取所述待定位对象的图像;Acquiring an image of the object to be positioned;
    根据图像轮廓特征提取算法和\或颜色特征提取算法对所述待定位对象的图像进行处理,获取所述待定位对象的第一图像特征。The image of the object to be positioned is processed according to the image contour feature extraction algorithm and/or the color feature extraction algorithm to obtain the first image feature of the object to be positioned.
  18. 根据权利要求1所述的非易失性可读存储介质,其特征在于,在所述获取所述待定位对象的人脸特征的步骤中,包括下述步骤:The non-volatile readable storage medium according to claim 1, wherein the step of acquiring the facial features of the object to be located comprises the following steps:
    获取所述待定位对象的人脸图像;Acquiring the face image of the object to be located;
    将所述待定位对象的人脸图像输入到预设的人脸特征提取模型中,获取所述待定位对象图像的人脸特征。The face image of the object to be located is input into a preset face feature extraction model, and the face feature of the object image to be located is obtained.
  19. 根据权利要求1所述的非易失性可读存储介质,其特征在于,在所述根据所述待定位对象的第一图像特征检索预设的视频数据库,获取与所述待定位对象的第一图像特征匹配的候选对象的图像的步骤中,包括下述步骤:The non-volatile readable storage medium according to claim 1, wherein in said searching a preset video database according to the first image feature of the object to be positioned, the first image of the object to be positioned is obtained. An image feature matching candidate image step includes the following steps:
    获取视频图像帧,所述视频图像帧为所述预设的视频数据库中保存的视频的分解;Acquiring a video image frame, where the video image frame is a decomposition of a video stored in the preset video database;
    将所述视频图像帧输入到预设的目标检测模型中,获取所述目标检测模型响应所述视频图像帧而输出的目标对象的图像,其中,所述预设的目标检测模型基于预先训练的深度学习神经网络,所述目标对象的图像为人体图像;The video image frame is input into a preset target detection model, and an image of a target object output by the target detection model in response to the video image frame is obtained, wherein the preset target detection model is based on a pre-trained Deep learning neural network, the image of the target object is a human body image;
    将所述目标对象的图像根据图像轮廓特征提取算法和\或颜色特征提取算法,计算所述目标对象的第一图像特征;Calculating the first image feature of the target object according to the image contour feature extraction algorithm and/or the color feature extraction algorithm on the image of the target object;
    计算所述待定位对象的第一图像特征与所述目标对象的第一图像特征之间的匹配度,当所述匹配度大于预设的第一阈值时,确定所述目标对象为所述候选对象。Calculate the matching degree between the first image feature of the object to be located and the first image feature of the target object, and when the matching degree is greater than a preset first threshold, determine that the target object is the candidate Object.
  20. 根据权利要求18所述非易失性可读存储介质,其特征在于,在所述将所述待定位对象的人脸特征与所述候选对象的图像比对,确定所述候选对象中与所述待定位对象的人脸特征匹配的对象为所述待定位对象的步骤中,包括下述步骤:The non-volatile readable storage medium according to claim 18, wherein in said comparing the facial features of the object to be located with the image of the candidate object, it is determined that the candidate object is The step that the object matched by the face feature of the object to be positioned is the object to be positioned includes the following steps:
    获取所述候选对象的人脸图像,所述候选对象的人脸图像截取自所述候选对象的图像;Acquiring a face image of the candidate object, where the face image of the candidate object is intercepted from the image of the candidate object;
    将所述候选对象的人脸图像输入到所述预设的人脸特征提取模型中,获取所述候选对象的人脸特征;Input the face image of the candidate object into the preset face feature extraction model to obtain the face feature of the candidate object;
    计算所述待定位对象的人脸特征与所述候选对象的人脸特征之间的匹配度,当所述匹配度大于预设的第二阈值时,确定所述候选对象为所述待定位对象。Calculate the degree of matching between the facial features of the object to be located and the facial features of the candidate object, and when the degree of matching is greater than a preset second threshold, determine that the candidate object is the object to be located .
PCT/CN2019/117702 2019-08-01 2019-11-12 Method and apparatus for locating object in video, and computer device and storage medium WO2021017289A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910707924.8A CN110633627A (en) 2019-08-01 2019-08-01 Method, device, computer equipment and storage medium for positioning object in video
CN201910707924.8 2019-08-01

Publications (1)

Publication Number Publication Date
WO2021017289A1 true WO2021017289A1 (en) 2021-02-04

Family

ID=68969147

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117702 WO2021017289A1 (en) 2019-08-01 2019-11-12 Method and apparatus for locating object in video, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110633627A (en)
WO (1) WO2021017289A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023024749A1 (en) * 2021-08-24 2023-03-02 腾讯科技(深圳)有限公司 Video retrieval method and apparatus, device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170015639A (en) * 2015-07-29 2017-02-09 대한민국(관리부서: 행정자치부 국립과학수사연구원장) Personal Identification System And Method By Face Recognition In Digital Image
CN106845385A (en) * 2017-01-17 2017-06-13 腾讯科技(上海)有限公司 The method and apparatus of video frequency object tracking
CN109190561A (en) * 2018-09-04 2019-01-11 四川长虹电器股份有限公司 Face identification method and system in a kind of video playing
CN109299642A (en) * 2018-06-08 2019-02-01 嘉兴弘视智能科技有限公司 Logic based on Identification of Images is deployed to ensure effective monitoring and control of illegal activities early warning system and method
CN109308463A (en) * 2018-09-12 2019-02-05 北京奇艺世纪科技有限公司 A kind of video object recognition methods, device and equipment
CN109344713A (en) * 2018-08-31 2019-02-15 电子科技大学 A kind of face identification method of attitude robust

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170015639A (en) * 2015-07-29 2017-02-09 대한민국(관리부서: 행정자치부 국립과학수사연구원장) Personal Identification System And Method By Face Recognition In Digital Image
CN106845385A (en) * 2017-01-17 2017-06-13 腾讯科技(上海)有限公司 The method and apparatus of video frequency object tracking
CN109299642A (en) * 2018-06-08 2019-02-01 嘉兴弘视智能科技有限公司 Logic based on Identification of Images is deployed to ensure effective monitoring and control of illegal activities early warning system and method
CN109344713A (en) * 2018-08-31 2019-02-15 电子科技大学 A kind of face identification method of attitude robust
CN109190561A (en) * 2018-09-04 2019-01-11 四川长虹电器股份有限公司 Face identification method and system in a kind of video playing
CN109308463A (en) * 2018-09-12 2019-02-05 北京奇艺世纪科技有限公司 A kind of video object recognition methods, device and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023024749A1 (en) * 2021-08-24 2023-03-02 腾讯科技(深圳)有限公司 Video retrieval method and apparatus, device, and storage medium

Also Published As

Publication number Publication date
CN110633627A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
WO2021139324A1 (en) Image recognition method and apparatus, computer-readable storage medium and electronic device
US10515275B2 (en) Intelligent digital image scene detection
US12100192B2 (en) Method, apparatus, and electronic device for training place recognition model
CN109284733B (en) Shopping guide negative behavior monitoring method based on yolo and multitask convolutional neural network
CN110532970B (en) Age and gender attribute analysis method, system, equipment and medium for 2D images of human faces
WO2019001481A1 (en) Vehicle appearance feature identification and vehicle search method and apparatus, storage medium, and electronic device
US20190325197A1 (en) Methods and apparatuses for searching for target person, devices, and media
CN109101602A (en) Image encrypting algorithm training method, image search method, equipment and storage medium
WO2020228181A1 (en) Palm image cropping method and apparatus, computer device and storage medium
CN113392866A (en) Image processing method and device based on artificial intelligence and storage medium
WO2022247539A1 (en) Living body detection method, estimation network processing method and apparatus, computer device, and computer readable instruction product
CN114550053A (en) Traffic accident responsibility determination method, device, computer equipment and storage medium
CN109711443A (en) Floor plan recognition methods, device, equipment and storage medium neural network based
WO2022161302A1 (en) Action recognition method and apparatus, device, storage medium, and computer program product
Werner et al. DeepMoVIPS: Visual indoor positioning using transfer learning
CN104520848A (en) Searching for events by attendants
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN112258254A (en) Internet advertisement risk monitoring method and system based on big data architecture
CN112651381A (en) Method and device for identifying livestock in video image based on convolutional neural network
CN117854156B (en) Training method and related device for feature extraction model
CN112232422A (en) Target pedestrian re-identification method and device, electronic equipment and storage medium
WO2021017289A1 (en) Method and apparatus for locating object in video, and computer device and storage medium
CN108694411A (en) A method of identification similar image
CN114299539B (en) Model training method, pedestrian re-recognition method and device
CN115393755A (en) Visual target tracking method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940104

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940104

Country of ref document: EP

Kind code of ref document: A1