WO2019242416A1 - 视频图像处理方法及装置、计算机可读介质和电子设备 - Google Patents

视频图像处理方法及装置、计算机可读介质和电子设备 Download PDF

Info

Publication number
WO2019242416A1
WO2019242416A1 PCT/CN2019/085604 CN2019085604W WO2019242416A1 WO 2019242416 A1 WO2019242416 A1 WO 2019242416A1 CN 2019085604 W CN2019085604 W CN 2019085604W WO 2019242416 A1 WO2019242416 A1 WO 2019242416A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
target object
tracking
feature map
neural network
Prior art date
Application number
PCT/CN2019/085604
Other languages
English (en)
French (fr)
Inventor
王亚彪
葛彦昊
甘振业
黄渊
邓长友
赵亚峰
黄飞跃
吴永坚
黄小明
梁小龙
汪铖杰
李季檩
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019242416A1 publication Critical patent/WO2019242416A1/zh
Priority to US16/922,196 priority Critical patent/US11436739B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present application relates to the field of image processing technology, and in particular, to a video image processing method, a video image processing apparatus, a computer-readable medium, and an electronic device.
  • image processing technology detection, tracking, and identification of various objects in video images have been widely used in various fields such as human-computer interaction, intelligent monitoring, security inspection, data entertainment, digital cameras, and the like.
  • face recognition technology can be used to perform beauty treatment on the faces identified in the video.
  • the example of the present application provides a video image processing method, which is executed by an electronic device and includes: determining a target object position area in a current frame image in a video; determining a target object tracking corresponding to the target object position area in a next frame image Image; successively perform multiple sets of convolution processing on the target object tracking image to determine the target object location area in the next frame of the image; wherein the number of convolutions of the first set of convolution processing in the plurality of sets of convolution processing is smaller than that of the other sets of volumes The number of convolutions of the product processing.
  • the example of the present application also provides a video image processing device, which includes a position determination module, a tracking image acquisition module, and a next position determination module.
  • the position determination module may be used to determine a target object position area in the current frame image in the video; the tracking image acquisition module may be used to sequentially perform multiple sets of convolution processing on the target object tracking image to determine the target in the next frame image Object position area; the next position determination module may be used to sequentially perform multiple sets of convolution processing on the target object tracking image to determine the target object position area in the next frame image; wherein, the first set of volumes in the multiple set of convolution processing The number of convolutions of the convolution process is smaller than the number of convolutions of other groups of convolution processes.
  • the example of the present application also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, the video image processing method according to any one of the foregoing is implemented.
  • An example of the present application further provides an electronic device, including: one or more processors; a storage device for storing one or more programs, when the one or more programs are executed by the one or more processors When causing the one or more processors to implement the video image processing method according to any one of the above.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which an image processing method or an image processing apparatus of an example of the present application can be applied;
  • FIG. 2 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an example of the present application
  • FIG. 3 schematically illustrates a flowchart of a video image processing method according to an exemplary embodiment of the present application
  • FIG. 4 schematically illustrates a structural diagram of a basic neural network according to an exemplary embodiment of the present application
  • FIG. 5 schematically illustrates a structural diagram of a convolution processing module according to an exemplary embodiment of the present application
  • FIG. 6 schematically illustrates a comparison diagram of a separable convolution process and a general convolution process according to an exemplary embodiment of the present application
  • FIG. 7 schematically shows a model schematic diagram of detecting a neural network according to an exemplary embodiment of the present application
  • FIG. 8 schematically illustrates a candidate region according to an exemplary embodiment of the present application
  • FIG. 9 schematically illustrates a structural diagram of a tracking neural network according to an exemplary embodiment of the present application.
  • FIG. 10 schematically illustrates a structure diagram of a deep residual network according to an exemplary embodiment of the present application.
  • FIG. 11 is a schematic diagram illustrating an example of a gesture category using gesture recognition as an example according to the present application.
  • FIG. 12 shows a logical schematic diagram of the entire flow of a video image processing method according to an exemplary embodiment of the present application
  • FIG. 13 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application
  • FIG. 14 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application
  • FIG. 15 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application.
  • FIG. 16 schematically illustrates a block diagram of a tracking image acquisition module according to an exemplary embodiment of the present application
  • FIG. 17 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application.
  • FIG. 18 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application.
  • FIG. 19 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application.
  • FIG. 20 schematically illustrates a block diagram of a position determination module according to an exemplary embodiment of the present application.
  • gesture recognition in the video stream is taken as an example.
  • gesture segmentation technology is used to implement gesture recognition. This method requires gesture segmentation at each frame, and it is difficult to obtain good real-time performance. And the effect is poor; in some examples, skin color detection combined with gesture recognition technology can also be used to determine the gestures in the video. In this solution, because the skin color model is susceptible to interference from light, erroneous detection of gestures is prone to occur.
  • model used in the video image processing in the above example is large and the calculation speed is slow.
  • examples of the present application propose a video image processing method and device, a computer-readable medium, and an electronic device.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which a video image processing method or a video image processing apparatus of an example of the present application can be applied.
  • the system architecture 100 may include one or more of terminal devices 101, 102, and 103, a network 104, and a server 105.
  • the network 104 is used as a medium for providing a communication link between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the numbers of terminal devices, networks, and servers in FIG. 1 are merely exemplary. Depending on the implementation needs, there can be any number of terminal devices, networks, and servers.
  • the server 105 may be a server cluster composed of multiple servers.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and so on.
  • the server 105 may be a server that provides various services.
  • the server 105 may obtain the video uploaded by the terminal devices 101, 102, and 103, and determine a target object location area in the current frame image in the video; determine a target object tracking image corresponding to the target object location area in the next frame image;
  • the target object tracking image is sequentially subjected to multiple sets of convolution processing to determine the target object position area in the next frame image; wherein the number of convolutions of the first group of convolution processing in the multi-group convolution processing is smaller than that of the other groups of convolution processing. Number of convolutions.
  • determining a target object position region in the current frame image may include inputting a feature map of the current frame image into a basic neural network for processing.
  • the basic neural network may include multiple stacked convolution processing modules, and each convolution processing module performs processing on the input feature map including: performing 1 ⁇ 1 dimensionality reduction convolution processing on the input feature map to obtain a first Feature map; 1 ⁇ 1 extended convolution processing on the first feature map to obtain a second feature map; depth separable convolution processing on the first feature map to obtain a third feature map; and a second feature map and a third feature map Feature map stitching to obtain the feature map output by the convolution processing module.
  • Performing deep separable convolution processing on the first feature map to obtain a third feature map may include: performing 3 ⁇ 3 convolution processing on each dimension of the first feature map to obtain intermediate features having the same dimensions as the first feature map. Figure; 1 ⁇ 1 convolution processing is performed on the intermediate feature map to obtain a third feature map.
  • server 105 may also identify the target object in the target object position area of the next frame image to determine the type of the target object.
  • the server 105 specifically executes the video image processing method of the present application.
  • the video image processing apparatus is generally provided in the server 105.
  • the video image processing methods described in this application can also be directly performed by the terminal devices 101, 102, and 103.
  • the terminal devices 101, 102, and 103 can directly process the video image by using the method described below to detect and track the target object in the video.
  • the present application may not rely on a server.
  • the video image processing apparatus may also be provided in the terminal devices 101, 102, 103.
  • FIG. 2 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an example of the present application.
  • the computer system 200 includes a central processing unit (CPU) 201 that can be loaded into a random access memory (RAM) 203 according to a program stored in a read-only memory (ROM) 202 or a program loaded from a storage section 208 Instead, perform various appropriate actions and processes.
  • RAM 203 various programs and data required for system operation are also stored.
  • the CPU 201, the ROM 202, and the RAM 203 are connected to each other through a bus 204.
  • An input / output (I / O) interface 205 is also connected to the bus 204.
  • the following components are connected to the I / O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output portion 207 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the speaker; a storage portion 208 including a hard disk and the like; a communication section 209 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 209 performs communication processing via a network such as the Internet.
  • the driver 210 is also connected to the I / O interface 205 as needed.
  • a removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 210 as needed, so that a computer program read therefrom is installed into the storage section 208 as needed.
  • an example of the present application includes a computer program product including a computer program borne on a computer-readable medium, the computer program containing program code for performing a method shown in a flowchart.
  • the computer program may be downloaded and installed from a network through the communication section 209, and / or installed from a removable medium 211.
  • this computer program is executed by the central processing unit (CPU) 201, various functions defined in the system of the present application are executed.
  • the computer-readable medium shown in the present application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the foregoing.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programming read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal that is included in baseband or propagated as part of a carrier wave, and which carries computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of code, which contains one or more of the logic functions used to implement the specified logic. Executable instructions.
  • the functions labeled in the blocks may also occur in a different order than those labeled in the drawings. For example, two blocks represented one after the other may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram or flowchart, and combinations of blocks in the block diagram or flowchart can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with A combination of dedicated hardware and computer instructions.
  • the units described in the examples of this application can be implemented by software or hardware.
  • the described units can also be set in the processor.
  • the names of these units do not, in some cases, define the unit itself.
  • the present application also provides a computer-readable medium, which may be included in the electronic device described in the above examples; or may exist alone without being assembled into the electronic device .
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by one electronic device, the electronic device is caused to implement a method as described in the following example.
  • the video image processing scheme described below can be adapted to the needs of mobile Internet products. Specifically, it can be applied to camera face recognition, human body detection in portrait self-portraits, body feature (for example, gesture) detection of interesting short videos, and vehicle detection of photo recognition vehicles.
  • FIG. 3 schematically illustrates a flowchart of a video image processing method according to an exemplary embodiment of the present application.
  • the video image processing method may include the following steps:
  • the target object may include, but is not limited to, a human face, a gesture, a car, a tree, and the like in the image. It should be understood that any element in the video image may be a target object described in this application.
  • the target object position area can be determined by the width W, height H, and the specific position (x, y) of the target object in the image.
  • the current frame image may refer to the first frame image of the video image, that is, the target object position region in the first frame image of the original video is detected in step S32.
  • the current frame image may refer to an image in which a target object is newly detected when an abnormality may occur during video image processing.
  • the abnormality described here may include that no target object is detected in the image, and may also include that the target object is not tracked in the following target object tracking scheme.
  • the present application only has a process of detecting the target object location area, and then the following tracking scheme is used to determine the position of the target object in the image.
  • a detection neural network may be used to determine the location area of the target object, and the detection neural network may include a basic neural network and an output neural network.
  • the following describes the process of detecting a target object location area in an image according to an exemplary embodiment of the present application.
  • the server performs the detection process as an example.
  • a mobile terminal device is used as an example.
  • the detection process also belongs to the concept of this application.
  • the server can input the feature map of the image into the basic neural network for processing.
  • the basic neural network may include multiple stacked convolution processing modules.
  • the processing process of each convolution processing module may include: first, the feature map of the input image may be subjected to 1 ⁇ 1 dimensionality reduction convolution processing to obtain a first feature map; then, the first feature map may be 1 ⁇ 1 extended Convolution processing to obtain the second feature map; In addition, the first feature can be deeply separable convolution processing to obtain the third feature map; next, the second feature map and the third feature map can be stitched to obtain the Feature map output by the convolution processing module.
  • the specific process of deep separable convolution may include: performing 3 ⁇ 3 convolution processing on each dimension of the first feature map to obtain an intermediate feature map with the same dimensions as the first feature map.
  • the feature map is subjected to a 1 ⁇ 1 convolution process to obtain a third feature map.
  • the intermediate feature map may be sequentially subjected to batch normalization processing and linear rectification processing.
  • batch normalization processing and linear rectification processing may also be performed on the intermediate feature map after the 1 ⁇ 1 convolution processing to obtain a third feature map.
  • the feature map may be subjected to maximum pooling processing.
  • the predetermined convolution processing module is related to the actual detection scenario, that is, the detection scenarios are different, and the positions and numbers of the predetermined convolution processing modules in multiple stacked convolution processing modules may be different, and the predetermined convolution processing module can It is configured by the developer, which is not particularly limited in this exemplary embodiment.
  • the role of the maximum pooling process is: on the one hand, it can achieve the effect of dimensionality reduction and facilitate the subsequent convolution process; on the other hand, it can ensure the invariance of features and make the detection process more robust.
  • This application does not specifically limit the process of the maximum pooling process.
  • the step size of the maximum pooling process may be set to two.
  • FIG. 4 illustrates a schematic diagram of a network structure of a basic neural network described in this application.
  • the exemplary network structure may include a convolution layer 401, maximum pooling layers 403 and 409, a convolution processing module 405, 407, 411, 413, and 415, a mean pooling layer 417, and a softmax layer 419.
  • the structure shown in FIG. 4 is only an exemplary description, and other convolutional layers may be included in the network.
  • the position and number of the convolution processing module and the maximum pooling layer will also change depending on the actual application scenario.
  • the structure of the convolution processing module will be exemplarily described below with reference to FIG. 5.
  • the dimension of the input feature map can be 128, that is, there are 128 feature maps input into the convolution processing module.
  • the input feature map may be processed by the first convolution unit 501 to generate a feature map with a dimension of 16, that is, a first feature map.
  • the first convolution unit 501 may perform a 1 ⁇ 1 dimensionality reduction convolution processing with a dimension of 16, wherein the value corresponding to the 1 ⁇ 1 convolution kernel may be different according to actual detection needs; next, on the one hand,
  • the first feature map may be processed by the second convolution unit 502 to generate a feature map with a dimension of 64, that is, a second feature map.
  • the second convolution unit 502 may perform 1 ⁇ 1 extended convolution processing with a dimension of 64; on the other hand, the first feature map may be processed by the third convolution unit 503 to generate a feature map with a dimension of 64. That is, the third feature map; then, the second feature map generated by the second convolution unit 502 and the third feature map generated by the third convolution unit 503 may be input to the feature map stitching unit 504, The feature map stitching unit 504 may stitch the second feature map and the third feature map according to dimensions to obtain a feature map with a dimension of 128, that is, a feature map output by the convolution processing module.
  • the third convolution unit 503 may further include a first convolution sub-unit 5031 and a second convolution sub-unit 5032.
  • the first convolution subunit 5031 may perform 3 ⁇ 3 convolution processing on each dimension of the first feature map to obtain an intermediate feature map with the same dimensions as the first feature map;
  • the second convolution subunit 5032 may A 1 ⁇ 1 convolution process is performed on the intermediate feature map to obtain a third feature map.
  • the third convolution unit 503 may further include a batch normalization unit and a linear rectification unit.
  • the batch normalization unit can be a BN layer (Batch Normalization layer), which is used to speed up the network learning speed;
  • the linear rectification unit can be a ReLU (Rectified Linearar Unit, a linear rectification unit), which is used for Increase network sparsity and increase training speed.
  • the third convolution unit 503 may further include a batch normalization unit and a linear rectification unit.
  • the first convolution sub-unit 5031 performs a process of depthwise convolution according to depth.
  • FIG. 6 schematically shows a convolution effect comparison chart of a 3 ⁇ 3 ordinary convolution and a 3 ⁇ 3 depthwise convolution.
  • the size of the convolution kernel is D K ⁇ D K
  • the number of input feature maps is M
  • the number of output feature maps is N
  • the size of the output feature maps is D F ⁇ D F
  • the computational complexity can be expressed as: D K ⁇ D K ⁇ M ⁇ N ⁇ D F ⁇ D F.
  • the computational complexity can only be: D K ⁇ D K ⁇ M ⁇ D F ⁇ D F.
  • N 1 ⁇ 1 convolutions the total computational complexity of the deep separable convolution (ie, the process performed by the first convolution subunit 5031) can be expressed as:
  • the computational complexity of the depth separable convolution is only 1/9 to 1/8 of the ordinary convolution. Therefore, the depth separable convolution can effectively improve the detection speed.
  • the convolutional neural network involved in the image processing method of the present application may include an output neural network.
  • the image processing method may further include: sending the feature map output by the basic neural network to an output neural network.
  • the output neural network is used to determine the position of the target object based on the feature map output by the basic neural network in a manner of predicting the candidate region.
  • a candidate region may be used to predict the coordinate position of the target object.
  • the candidate region may be understood as a position coordinate (reference box) defined in advance on the feature map. Referring to a dotted line in FIG. 8, these preset settings
  • the fixed position coordinates can be used as the initial position coordinates of the target object.
  • the position of the target object can be accurately determined through network learning.
  • the number of candidate regions corresponding to each pixel on the feature map can be set by the developer, for example, the number of candidate regions corresponding to each pixel is nine.
  • the video image processing method of the present application may further include a process of adjusting network parameters. details as follows:
  • the loss function of the convolutional neural network consisting of the basic neural network and the output neural network can be calculated; then, the parameters of the convolutional neural network that minimize the loss function can be determined; next, the volume of the convolutional neural network that minimizes the loss function can be determined.
  • the parameters of the product neural network are applied to the convolutional neural network to implement the network weight adjustment process.
  • the process of calculating the loss function may include: First, the classification loss function L conf and the position loss function L loc may be calculated separately.
  • the position loss function L loc can be calculated by Equation 1:
  • H and W are the height and width of the feature map respectively;
  • K is the total number of the candidate regions;
  • I ijk is the detection evaluation parameter, when the k-th candidate region at the position (i, j) and the IOU of the real region (Intersection Over Union, detection and evaluation function) is greater than a predetermined threshold (for example, 0.7), I ijk is 1, otherwise 0; ⁇ x ijk , ⁇ y ijk , ⁇ w ijk , and ⁇ h ijk are respectively A coordinate offset relative to the candidate region; The offsets of the real area of the target object with respect to the candidate area, respectively.
  • a predetermined threshold for example, 0.7
  • H and W are the height and width of the feature map
  • K is the total number of the candidate regions
  • C is the category to which the target object belongs
  • I ijk is the detection evaluation parameter. Represents the distribution of the true area of the target object
  • p c is the probability of the category to which the target object belongs.
  • a detection result with a confidence level higher than a predetermined confidence level may be determined as the coordinates of the target object.
  • the target object corresponding to the confidence level is output.
  • the position is determined as the coordinates of the target object. This is not particularly limited in this exemplary embodiment.
  • the convolutional neural network that implements the image processing method described above, on the one hand, it can achieve better detection of the target object; on the other hand, the convolutional neural network model is small (about 1.8MB), and the detection speed Blocks (speeds up to 60ms / frame on PC). Thereby, the needs of detection of target objects such as faces, gestures, pedestrians, vehicles, etc. can be met.
  • the area can be used to obtain the target object tracking image from the next frame image.
  • the target object location area in the current frame image can be recorded as (x, y, w, h), where x and y represent the center points of the location area ( Or any specified point) in the current frame image, w and h represent the width and height corresponding to the position area, respectively.
  • the position region can also be represented by a position representation manner other than a rectangular frame, for example, an oval position frame, a circular position frame, and the like.
  • the target tracking area can be obtained by enlarging the target object position area of the current frame image by a predetermined factor.
  • the predetermined multiple may be 1.5 to 2 times, and the predetermined multiple may be enlarged based on the center point of the rectangular frame.
  • the target tracking area may be recorded as (x ', y', w ', h').
  • an image corresponding to the target tracking area in the next frame of images may be determined as the target object tracking image.
  • S36 Perform multiple sets of convolution processing on the target object tracking image in order to determine the target object location area in the next frame of images; wherein the number of convolutions of the first set of convolution processing in the plurality of sets of convolution processing is smaller than that of the other sets of volumes The number of convolutions of the product processing.
  • a tracking neural network may be used to sequentially perform multiple sets of convolution processes on the target object tracking image.
  • the tracking neural network may include multiple stacked convolution blocks, each convolution block may include a convolution layer and a maximum pooling layer, and each convolution block correspondingly performs a set of convolution processing.
  • the number of convolution layers of the first convolution block among the plurality of stacked convolution blocks is smaller than the number of convolution layers of other convolution blocks.
  • an image resolution suitable for network input (for example, 72 ⁇ 72, 100 ⁇ 100) may be determined according to the requirements of the network training structure.
  • the server can determine whether the resolution of the target object tracking image matches the resolution required by the network input. If it does not match, the resolution of the target object tracking image can be adjusted to adapt the target object tracking image to the tracking neural network.
  • the tracking neural network may include a first convolution block, a second convolution block, and a third convolution block. It should be noted that the tracking neural network may also include other convolutional blocks according to the actual video image tracking scene.
  • the first convolution block may include 1 convolution layer, and the second and third convolution blocks may each include 2 convolution layers.
  • the first convolution block may include a convolution layer 901 and a maximum pooling layer 903.
  • the size of the largest pooling layer 903 Is 2 ⁇ 2 and the step size is 4.
  • the convolutional layer in the second convolution block includes a convolutional layer 905 composed of 16 convolution kernels of size 3 ⁇ 3 and step size 1 and a convolutional layer of 24 3 ⁇ 3 and step size 1 Convolutional layer 907 composed of kernels;
  • the convolutional layer in the third convolution block includes a convolutional layer 911 composed of 40 convolution kernels with a size of 3 ⁇ 3 and a step size of 1, and 60 convolutional layers of size 3
  • a convolution layer 913 composed of a convolution kernel of ⁇ 3 and a step size of 1.
  • the maximum pooling layer 909 and the maximum pooling layer 915 are the same as the maximum pooling layer 903.
  • the dimension of the convolution layer 901 is set to 8, the value is relatively small, which helps the overall network calculation speed. Promotion.
  • the convolution kernel of the convolutional layer 901 by configuring the convolution kernel of the convolutional layer 901 to have a size of 7 ⁇ 7 and a step size of 4, more features can be extracted at the initial stage of the network without consuming a large amount of computing resources; on the other hand, by The above structure and parameters set the second convolution block and the third convolution block. While satisfying the tracking target object, the model is small and the calculation speed is fast.
  • the tracking neural network described in this application may further include an inner product layer 917 with a dimension of 96 arranged between the third convolution block and the output of the tracking neural network in order. And an inner product layer 919 of 128 dimensions.
  • the inner product layer here has a full connection function, and this configuration of two full connections contributes to the improvement of the overall network computing speed.
  • the tracking neural network of the present application has two output branches, namely the inner product layer 921 and the inner product layer 923 in FIG. 9.
  • the result is to determine the confidence level, that is, the probability, of the target object in the tracking image of the target object.
  • the range of this confidence is [0,1].
  • This application can compare the output confidence with a predetermined threshold (for example, 0.9). If it is less than the predetermined threshold, it can be determined that there is no target object in the target tracking image. At this time, it can be included in the entire next frame image. Detecting target objects. The specific detection process has been described in detail in step S32, and is not repeated here.
  • the significance of the output neural network confidence is that it can avoid wrong tracking and adjust to the correct target position in time.
  • the result is the location area of the target object in the next frame image, which can be characterized as (x 1 , y 1 , w 1 , h 1 ).
  • the first loss function can be calculated according to the confidence level.
  • S j indicates that the j-th neuron is subjected to normalization processing, and using Equation 5 can be obtained:
  • ⁇ j represents the j-th value in the inner product vector.
  • the tracking neural network parameters that minimize the first loss function can be determined; then, the tracking neural network can be adjusted according to the tracking neural network parameters that minimize the first loss function.
  • the second loss function can be calculated according to the target object location area of the next frame of the image.
  • the second loss function L reg can be calculated using Equation 6:
  • the tracking neural network parameters that minimize the second loss function can be determined; then, the tracking neural network can be adjusted according to the tracking neural network parameters that minimize the second loss function.
  • the size of the aforementioned tracking neural network model is less than 1MB, which makes this model applicable to mobile phones and has good real-time tracking performance.
  • the exemplary embodiment of the present application may further include: using a deep residual network to identify the target object in the target object position area of the next frame image to determine The category of the audience.
  • the basic structure of the deep residual network is shown in FIG. 10.
  • the basic structure of the deep residual network used in this application is similar to the basic structure of the existing residual network, and no special description is given here. The difference is that this application uses an 18-layer deep residual network. Compared with the solution that generally uses 10 convolution kernels in the prior art, this application uses 3 convolution kernels. Therefore, the accuracy of identification is slightly sacrificed. At the same time, it greatly improves the recognition speed and reduces the size of the model. However, after testing, this result of slightly sacrificing recognition accuracy will not have any impact on the category of the target object, and the performance of the entire network has been greatly improved.
  • gesture recognition as an example, referring to FIG. 11, using the above recognition method, accurate recognition of gestures 1101 to 1111 in FIG. 11 can be achieved.
  • gestures may also include other categories.
  • the target object in the video image may be detected to determine the location area of the target object. For the detailed detection process, see step S32.
  • step S122 it may be determined whether the target object is detected. If it is detected, go to step. S124. If it is not detected, return to step S120 to detect the next frame of the video image.
  • step S124 the target object can be tracked in real time. Specifically, the tracking neural network in step S36 can be used to achieve real time. Tracking; in step S126, it can be judged whether the target object is tracked. If it is tracked, the target object recognition process of step S128 can be performed; if it is not tracked, it returns to step S120 to target the overall image of the current tracked object. Detection.
  • the position information of the target object can also be marked in advance in the video. In this case, the target object can be directly tracked in real time.
  • the identification process can be performed every predetermined frame, for example, the identification process is performed every 5 frames.
  • the video image processing method of the present application involves a small model and fast processing speed, and can be directly applied to a terminal device such as a mobile phone; on the other hand, the video image processing method of the present application can be applied to In various fields such as human-computer interaction, intelligent monitoring, security inspection, data entertainment, digital cameras, etc., it can achieve application purposes such as gesture recognition, face recognition, and vehicle detection with good performance. For example, in live broadcast, video chat, and other scenarios, after tracking and recognizing gestures in the video, you can add gesture pendants such as virtual bracelets or special effects such as color effects to your hand.
  • a video image processing apparatus is provided in this exemplary embodiment.
  • FIG. 13 schematically illustrates a block diagram of a video image processing apparatus according to an exemplary embodiment of the present application.
  • a video image processing apparatus 13 may include a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • the position determination module 131 may be used to determine a target object position area in the current frame image in the video; the tracking image acquisition module 133 may be used to sequentially perform multiple sets of convolution processing on the target object tracking image to determine the next frame image.
  • the target location area of the target; the next location determining module 135 may be used to sequentially perform multiple sets of convolution processing on the target tracking image to determine the target location area in the next frame of images;
  • the number of convolutions of one group of convolution processes is smaller than the number of convolutions of other groups of convolution processes.
  • convolution processing is performed on the target object tracking image instead of the entire next frame image, which greatly reduces the calculation amount and improves the efficiency of target object tracking; on the other hand, The number of convolutions of the first group of convolution processes in the multi-groups of convolution processes is smaller than the number of convolutions of other groups of convolution processes.
  • Such a network structure model is smaller and the processing speed is improved.
  • the video image processing device 14 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • Target object recognition module 141 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • the target object recognition module 141 may be configured to identify a target object in a target object position region of a next frame image by using a deep residual network to determine a category of the target object.
  • the video image processing device 15 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • the confidence determination module 151, the confidence comparison module 153, and the next image detection module 155 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • the confidence determination module 151, the confidence comparison module 153, and the next image detection module 155 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135.
  • the confidence determination module 151 may be used to determine the confidence that the target object position region in the next frame image contains the target object; the confidence comparison module 153 may be used to compare the confidence with a predetermined threshold; An image detection module 155 may be configured to detect a target object in the next frame image if the confidence level is less than the predetermined threshold.
  • the tracking image acquisition module 133 may include an area enlargement unit 1601 and a tracking image determination unit 1603.
  • the area enlargement unit 1601 may be configured to enlarge the target object position area of the current frame image by a predetermined multiple to obtain a target object tracking area; the tracking image determination unit 1603 may be configured to compare the target frame in the next frame image with the target. The image corresponding to the object tracking area is determined as the target object tracking image.
  • the next position determination module 135 may also be used to sequentially perform multiple sets of convolution processing on the target object tracking image using a tracking neural network; wherein the tracking neural network includes multiple stacked convolution blocks, each Each convolution block includes a convolution layer and a maximum pooling layer and each convolution block performs a set of convolution processing.
  • the first convolution block in the plurality of stacked convolution blocks includes 1 convolution layer, and the other convolution blocks except the first convolution block include 2 convolution layers;
  • the convolution layer in the first convolution block includes 8 convolution kernels with a size of 7 ⁇ 7 and a step size of 4;
  • the convolution layer in the second convolution block includes 16 convolution layers of 3 ⁇ 3 A convolution kernel with a step size of 1 and 24 convolution kernels with a 3 ⁇ 3 step size;
  • the convolution layer in the third convolution block includes 40 ones with a size of 3 ⁇ 3 and a step size of 1.
  • the tracking neural network further includes an inner product layer with a dimension of 96 and an inner product layer with a dimension of 128 arranged in order between the third convolution block and the output of the tracking neural network.
  • the tracking neural network it is determined that the confidence level of the target object in the tracking image includes the target object.
  • the tracking image acquisition module 133, the next position determination module 135, the confidence determination module 151, the confidence comparison module 153, and the next image detection module 155 may further include a first loss function calculation module 171 and a first network parameter determination module. 173 and the first network adjustment module 175.
  • the first loss function calculation module 171 may be configured to calculate a first loss function according to the confidence level; the first network parameter determination module 173 may be used to determine a tracking neural network parameter that minimizes the first loss function; The network adjustment module 175 may be configured to adjust the tracking neural network according to the tracking neural network parameters that minimize the first loss function.
  • the video image processing device 18 is compared with the video image processing device 13 except that it includes a position determination module 131, tracking
  • the image acquisition module 133 and the next position determination module 135 may further include a second loss function calculation module 181, a second network parameter determination module 183, and a second network adjustment module 185.
  • the second loss function calculation module 181 may be configured to calculate a second loss function according to a target object location area of the next frame image; the second network parameter determination module 183 may be used to determine a tracking neural network that minimizes the second loss function. Parameters; the second network adjustment module 185 may be configured to adjust the tracking neural network according to the tracking neural network parameters that minimize the second loss function.
  • the second loss function calculation module 181, the second network parameter determination module 183, and the second network adjustment module 185 may also be included in the video image processing device 17 to comprehensively determine and adjust the loss function calculation results in combination with the two. Network parameters.
  • the video image processing device 19 may include, in addition to the video image processing device 13, a position determination module 131, a tracking image acquisition module 133, and a next position determination module 135. Resolution adjustment module 191.
  • the resolution adjustment module 191 may be configured to adjust the resolution of the target object tracking image before the target object tracking image is input into the tracking neural network, so that the target object tracking image is adapted to the tracking neural network.
  • the position determination module 131 may include a position determination unit 2001.
  • the position determining unit 2001 may be used to input a feature map of a current frame image into a basic neural network for processing to determine a target object position area in the current frame image.
  • the basic neural network includes multiple stacked convolution processing modules.
  • Each convolution processing module processing the input feature map includes: performing 1 ⁇ 1 dimensionality reduction convolution processing on the input feature map to obtain a first feature map; and performing 1 ⁇ 1 extended convolution processing on the first feature map to A second feature map is obtained; a deep separable convolution process is performed on the first feature map to obtain a third feature map; and the second feature map and the third feature map are stitched to obtain a feature map output by the convolution processing module.
  • performing a deep separable convolution process on the first feature map to obtain a third feature map includes: performing a 3 ⁇ 3 convolution process on each dimension of the first feature map to obtain the first feature map. Intermediate feature maps with the same feature map dimensions; 1 ⁇ 1 convolution processing is performed on the intermediate feature maps to obtain a third feature map.
  • modules or units of the device for action execution are mentioned in the detailed description above, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种视频图像处理方法及装置、计算机可读介质和电子设备,涉及图像处理技术领域。该视频图像处理方法包括:确定视频中当前帧图像中的目标对象位置区域;确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像;对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。

Description

视频图像处理方法及装置、计算机可读介质和电子设备
本申请要求于2018年06月20日提交中国专利局、申请号为201810639495.0、发明名称为“视频图像处理方法及装置、计算机可读介质和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,具体而言,涉及一种视频图像处理方法、视频图像处理装置、计算机可读介质和电子设备。
背景
随着图像处理技术的发展,对视频图像中的各种对象进行检测、跟踪、识别已经广泛地应用到人机交互、智能监控、安全检查、数据娱乐、数码相机等各个领域。例如,可以采用人脸识别技术,对视频中识别出的人脸进行美颜处理。
需要说明的是,在上述背景技术部分公开的信息仅用于加强对本申请的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
技术内容
本申请实例提供了一种视频图像处理方法,由电子设备执行,包括:确定视频中当前帧图像中的目标对象位置区域;确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像;对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
本申请实例还提供了一种视频图像处理装置,包括位置确定模块、跟踪图像获取模块和下一位置确定模块。
具体的,位置确定模块可以用于确定视频中当前帧图像中的目标对象位置区域;跟踪图像获取模块可以用于对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;下一位置确定模块可以用于对目标 对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
本申请实例还提供了一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现如上述任意一项所述的视频图像处理方法。
本申请实例还提供了一种电子设备,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上述任意一项所述的视频图像处理方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图简要说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实例,并与说明书一起用于解释本申请的原理。显而易见地,下面描述中的附图仅仅是本申请的一些实例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。在附图中:
图1示出了可以应用本申请实例的图像处理方法或图像处理装置的示例性系统架构的示意图;
图2示出了适于用来实现本申请实例的电子设备的计算机系统的结构示意图;
图3示意性示出了根据本申请的示例性实施方式的视频图像处理方法的流程图;
图4示意性示出了根据本申请的示例性实施方式的基础神经网络的结构图;
图5示意性示出了根据本申请的示例性实施方式的卷积处理模块的结构图;
图6示意性示出了根据本申请的示例性实施方式的可分离卷积处理与普通卷积处理的对比图;
图7示意性示出了根据本申请的示例性实施方式的检测神经网络的模型示 意图;
图8示意性示出了根据本申请的示例性实施方式的候选区域的示意图;
图9示意性示出了根据本申请的示例性实施方式的跟踪神经网络的结构图;
图10示意性示出了根据本申请的示例性实施方式的深度残差网络的结构图;
图11示出了根据本申请的以手势识别为例的手势类别的举例示意图;
图12示出了根据本申请的示例性实施方式的视频图像处理方法的整个流程的逻辑示意图;
图13示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图14示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图15示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图16示意性示出了根据本申请的示例性实施方式的跟踪图像获取模块的方框图;
图17示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图18示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图19示意性示出了根据本申请的一个示例性实施方式的视频图像处理装置的方框图;
图20示意性示出了根据本申请的示例性实施方式的位置确定模块的方框图。
实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方 式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本申请的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本申请的各方面变得模糊。
此外,附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的步骤。例如,有的步骤还可以分解,而有的步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
视频图像处理技术中,以视频流中手势识别为例,在一些实例中,采用手势分割的技术实现手势识别,这种方法需要在每一帧进行手势分割,很难获取较好的实时性,并且效果较差;在一些实例中,还可以采用肤色检测结合手势识别的技术来确定视频中的手势,在这种方案中,由于肤色模型容易受到光照的干扰,容易出现对手势的错误检测。
此外,上述实例中用于视频图像处理中的模型较大,且计算速度慢。
针对上述技术问题,本申请实例提出了一种视频图像处理方法及装置、计算机可读介质和电子设备。
图1示出了可以应用本申请实例的视频图像处理方法或视频图像处理装置的示例性系统架构的示意图。
如图1所示,系统架构100可以包括终端设备101、102、103中的一种或多种,网络104和服务器105。网络104用以在终端设备101、102、103和服 务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。比如服务器105可以是多个服务器组成的服务器集群等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、便携式计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器。例如,服务器105可以获取终端设备101、102、103上传的视频,确定视频中当前帧图像中的目标对象位置区域;确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像;对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
其中,确定当前帧图像中的目标对象位置区域可以包括将当前帧图像的特征图(feature map)输入基础神经网络进行处理。具体的,基础神经网络可以包括多个堆叠的卷积处理模块,每一卷积处理模块对输入的特征图执行处理包括:对输入的特征图进行1×1降维卷积处理以得到第一特征图;对第一特征图进行1×1扩展卷积处理以得到第二特征图;对第一特征图进行深度可分离卷积处理以得到第三特征图;将第二特征图与第三特征图拼接,以得到该卷积处理模块输出的特征图。
对第一特征图进行深度可分离卷积处理以得到第三特征图可以包括:对第一特征图的各维度分别进行3×3卷积处理,以得到与第一特征图维度相同的中间特征图;对中间特征图进行1×1卷积处理以得到第三特征图。
此外,服务器105还可以对下一帧图像的目标对象位置区域中的目标对象进行识别,以确定目标对象的类别。
需要说明的是,上述的描述为服务器105具体执行本申请的视频图像处理方法的过程。在这种情况下,视频图像处理装置一般设置在服务器105中。
然而,应当理解的是,由于本申请所述视频图像处理方法具有采用模型小、处理速度快的特点,本申请另一些实例所提供的视频图像处理方法还可以直接由终端设备101、102、103执行,而不会使终端设备消耗大量的系统资源。也就是说,终端设备101、102、103可以直接利用采用下面描述的方法对视频图像进行处理,以检测并跟踪视频中的目标对象,在这种情况下,本申请可以不依靠服务器。相应地,视频图像处理装置也可以设置在终端设备101、102、103中。
图2示出了适于用来实现本申请实例的电子设备的计算机系统的结构示意图。
需要说明的是,图2示出的电子设备的计算机系统200仅是一个示例,不应对本申请实例的功能和使用范围带来任何限制。
如图2所示,计算机系统200包括中央处理单元(CPU)201,其可以根据存储在只读存储器(ROM)202中的程序或者从存储部分208加载到随机访问存储器(RAM)203中的程序而执行各种适当的动作和处理。在RAM 203中,还存储有系统操作所需的各种程序和数据。CPU 201、ROM 202以及RAM 203通过总线204彼此相连。输入/输出(I/O)接口205也连接至总线204。
以下部件连接至I/O接口205:包括键盘、鼠标等的输入部分206;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分207;包括硬盘等的存储部分208;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分209。通信部分209经由诸如因特网的网络执行通信处理。驱动器210也根据需要连接至I/O接口205。可拆卸介质211,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器210上,以便于从其上读出的计算机程序根据需要被安装入存储部分208。
特别地,根据本申请的实例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实例中,该计算机程序可以通过通信部分209从网络上被下载和安装,和/或从可拆卸介质211被安装。在该计算机程序被中央处理 单元(CPU)201执行时,执行本申请的系统中限定的各种功能。
需要说明的是,本申请所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。
在一些实例中,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如下述实例中所述的方法。
下面描述的视频图像处理方案可以适应移动互联网的产品需求。具体的,可以应用在相机的人脸识别、人像自拍中的人体检测、趣味短视频的身体特征(例如,手势)检测以及拍照识别车辆的车辆检测中。
图3示意性示出了本申请的示例性实施方式的视频图像处理方法的流程图。参考图3,所述视频图像处理方法可以包括以下步骤:
S32.确定视频中当前帧图像中的目标对象位置区域。
在本申请的示例性实施方式中,目标对象可以包括但不限于图像中的人脸、手势、汽车、树木等。应当理解的是,视频图像中的任何要素均可以作为本申请所述的目标对象。另外,目标对象位置区域可以由目标对象的宽W、高H以及图像中所处具体的位置(x,y)来确定表示。
在本申请的一些实例中,当前帧图像可以指代视频图像的首帧图像,也就是说,步骤S32检测的是原始视频的第一帧图像中的目标对象位置区域。
在本申请的另一些实例中,当前帧图像可以指代在视频图像处理过程中可能出现异常时,重新对目标对象进行检测的图像。这里所述的异常可以包括在图像中并未检测到目标对象,还可以包括在下述目标对象跟踪的方案中,未跟踪到目标对象。
应当理解的是,在整个视频图像处理过程正常的情况下,本申请仅存在一次检测目标对象位置区域的过程,随后是依赖于下述跟踪方案来确定图像中目标对象的位置。
具体的,可以采用一检测神经网络确定目标对象位置区域,该检测神经网 络可以包括一基础神经网络和一输出神经网络。下面将对本申请的示例性实施方式的检测图像中目标对象位置区域的过程进行说明,另外,以服务器执行检测过程为例进行说明,然而,应当理解的是,以例如手机的终端设备为例执行检测过程也属于本申请的构思。
服务器可以将图像的特征图输入基础神经网络进行处理。其中,基础神经网络可以包括多个堆叠的卷积处理模块。每个卷积处理模块的处理过程可以包括:首先,可以对输入图像的特征图进行1×1降维卷积处理以得到第一特征图;随后,可以对第一特征图进行1×1扩展卷积处理以得到第二特征图;另外,可以对第一特征进行深度可分离卷积处理以得到第三特征图;接下来,可以将第二特征图与第三特征图拼接,以得到该卷积处理模块输出的特征图。
其中,深度可分离卷积的具体处理过程可以包括:对第一特征图的各维度分别进行3×3卷积处理,以得到与第一特征图维度相同的中间特征图,随后,可以对中间特征图进行1×1卷积处理,以得到第三特征图。
另外,在对中间特征图进行1×1卷积处理之前,可以对中间特征图依次进行批量归一化处理和线性整流处理。在对中间特征图进行1×1卷积处理之后,还可以对1×1卷积处理后的中间特征图进行批量归一化处理以及线性整流处理,以得到第三特征图。
此外,在将特征图输入多个堆叠的卷积处理模块中预定卷积处理模块之前,可以对特征图进行最大池化处理。其中,预定卷积处理模块与实际检测场景相关,也就是说,检测场景不同,多个堆叠的卷积处理模块中的预定卷积处理模块的位置和数量可能不同,并且预定卷积处理模块可以由开发人员自行配置,本示例性实施方式中对此不做特殊限定。
最大池化处理的作用在于:一方面,可以实现降维的效果,便于后续卷积过程的处理;另一方面,可以确保特征不变性,使得检测过程更加鲁棒。本申请对最大池化处理的过程不做特殊限制,另外,例如可以将最大池化处理的步长设置为2。
应当理解的是,上述术语“第一”、“第二”、“第三”、“中间”仅是为了区分的目的,不应将其理解为是本申请内容的限制。
图4示例性示出了本申请所述的基础神经网络的网络结构的示意图。具体的,该示例性网络结构可以包括卷积层401,最大池化层403和409,卷积处理模块405、407、411、413和415,均值池化层417和softmax层419。应当理解的是,图4所示结构仅是一示例性描述,网络中还可以包括其他卷积层。另外,卷积处理模块、最大池化层的位置和数量也根据实际应用场景的不同会发生变化。
下面将参考图5对卷积处理模块的结构进行示例性说明。
输入的特征图的维度可以为128,也就是说,输入该卷积处理模块的特征图有128个。首先,输入的特征图可以经过第一卷积单元501的处理,生成维度为16的特征图,即,第一特征图。具体的,第一卷积单元501可以执行1×1且维度为16的降维卷积处理,其中,该1×1卷积核对应的值根据实际检测需要可能不同;接下来,一方面,第一特征图可以经过第二卷积单元502的处理,生成维度为64的特征图,即,第二特征图。具体的,第二卷积单元502可以执行1×1且维度为64的扩展卷积处理;另一方面,第一特征图可以经过第三卷积单元503的处理,生成维度为64的特征图,即,第三特征图;随后,可以将第二卷积单元502卷积后生成的第二特征图和第三卷积单元503卷积后生成的第三特征图输入特征图拼接单元504,特征图拼接单元504可以按维度对第二特征图和第三特征图进行拼接,以得到维度为128的特征图,即,该卷积处理模块输出的特征图。
第三卷积单元503还可以包括第一卷积子单元5031和第二卷积子单元5032。具体的,第一卷积子单元5031可以对第一特征图的各维度分别进行3×3卷积处理,以得到与第一特征图维度相同的中间特征图;第二卷积子单元5032可以对中间特征图进行1×1卷积处理,以得到第三特征图。
另外,在第一卷积子单元5031与第二卷积子单元5032之间,第三卷积单元503还可以包括批量归一化单元和线性整流单元。具体的,批量归一化单元可以为BN层(Batch Normalization layer,批量归一化层),用于加快网络学习的速度;线性整流单元可以为ReLU(Rectified Linear Unit,线性整流单元),用于增加网络的稀疏性并提高训练速度。
此外,在第二卷积子单元5032之后,第三卷积单元503还可以包括批量归一化单元和线性整流单元。
第一卷积子单元5031执行的是按深度逐层卷积(depthwise卷积)的过程。图6示意性示出了3×3的普通卷积与3×3的depthwise卷积的卷积效果对比图。
对于普通卷积,如果卷积核的大小为D K·D K,输入的特征图的数量为M,输出的特征图的数量为N,输出的特征图的大小为D F·D F,则运算复杂度可以表示为:D K·D K·M·N·D F·D F
对于depthwise卷积,运算复杂度可以仅为:D K·D K·M·D F·D F。再加上N个1×1的卷积,则深度可分离卷积(即,第一卷积子单元5031执行的过程)的总运算复杂度可以表示为:
D K·D K·M·D F·D F+M·N·D F·D F
由此可见,深度可分离卷积相对于普通卷积的运算复杂度所占比例为:
Figure PCTCN2019085604-appb-000001
对于3×3的卷积核,深度可分离卷积的运算复杂度仅为普通卷积的1/9至1/8,因此,深度可分离卷积可以有效提高检测速度。
参考图7,本申请的图像处理方法涉及的卷积神经网络除包括基础神经网络外,还可以包括输出神经网络。具体的,图像处理方法还可以包括:将基础神经网络输出的特征图发送至一输出神经网络。其中,输出神经网络用于采用预设候选区域预测的方式根据基础神经网络输出的特征图确定目标对象的位置。
具体的,可以采用候选区域(anchor)预测目标对象的坐标位置,此处,可以将候选区域理解为在特征图上预先定义的位置坐标(reference box),参考图8中虚线部分,这些预先设定的位置坐标可以作为目标对象的初始位置坐标,接下来,可以通过网络学习的方式准确地确定出目标对象的位置。另外,特征图上每个像素对应的候选区域的数量可以由开发人员自行设定,例如,每个像素对应的候选区域的数量为9个。
根据本申请的一些实例,本申请的视频图像处理方法还可以包括调整网络参数的过程。具体如下:
首先,可以计算由基础神经网络和输出神经网络构成的卷积神经网络的损 失函数;随后,可以确定使损失函数最小化的卷积神经网络参数;接下来,可以将使损失函数最小化的卷积神经网络参数应用于卷积神经网络,以实现网络权重调整的过程。
在本申请的示例性描述中,计算损失函数的过程可以包括:首先,可以分别计算出分类损失函数L conf和位置损失函数L loc。在一实例中,可以通过公式1来计算位置损失函数L loc
Figure PCTCN2019085604-appb-000002
其中,H和W分别为特征图的高度和宽度;K为所述候选区域的总数量;I ijk为检测评价参数,当在位置(i,j)的第k个候选区域与真实区域的IOU(Intersection Over Union,检测评价函数)大于一预定阈值(例如,0.7)时,I ijk为1,否则为0;δx ijk、δy ijk、δw ijk、δh ijk分别为所述卷积神经网络输出的相对于所述候选区域的坐标偏移量;
Figure PCTCN2019085604-appb-000003
分别为目标对象真实区域相对于所述候选区域的偏移量。
另外,可以通过公式2来计算分类损失函数L conf
Figure PCTCN2019085604-appb-000004
其中,H和W分别为特征图的高度和宽度,K为所述候选区域的总数量,C为目标对象所属类别,I ijk为检测评价参数,
Figure PCTCN2019085604-appb-000005
表征目标对象真实区域的分布,p c为目标对象所属类别的概率。
另外,可以确定与候选区域匹配的目标对象所在区域的数量N。
接下来,可以将分类损失函数L conf与位置损失函数L loc的和除以数量N的结果作为卷积神经网络的损失函数L。具体参见公式3:
Figure PCTCN2019085604-appb-000006
根据另外一些实例,可以在确定出目标对象的位置后,将置信度高于一预定置信度的检测结果确定为目标对象的坐标。这里,通过网络学习后,不仅输 出目标对象的位置,还输出该目标对象的位置包含目标对象的概率,即置信度,在置信度高于预定置信度时,将该置信度对应的目标对象的位置确定为目标对象的坐标。本示例性实施方式中对此不做特殊限定。
经过测试,采用上述实现图像处理方法的卷积神经网络,一方面,可以达到较好的目标对象的检测效果;另一方面,该卷积神经网络模型较小(约1.8MB),且检测速度块(在PC上的速度可达到60ms/帧)。由此,可以满足例如人脸、手势、行人、车辆等目标对象检测的需要。
S34.确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像。
在步骤S32中确定出当前帧图像中的目标对象位置区域后,可以利用该区域来从下一帧图像中获取目标对象跟踪图像。具体的,如果以矩形框的形式表示位置区域,则可以将当前帧图像中的目标对象位置区域记为(x,y,w,h),其中,x和y分别表示位置区域的中心点(或任一规定的一点)在当前帧图像中的坐标,w和h分别表示位置区域对应的宽度和高度。然而,容易理解的是,还可以采用除矩形框之外的位置表示方式来表征位置区域,例如,椭圆形位置框、圆形位置框等。
首先,由于在一帧的时间内,目标对象的位移通常较小,因此,可以将当前帧图像的目标对象位置区域放大预定倍数得到目标跟踪区域。具体的,预定倍数可以为1.5倍至2倍,并且可以基于矩形框中心点放大预定倍数,此时,可以将目标跟踪区域记为(x’,y’,w’,h’)。
接下来,可以将下一帧图像中与目标跟踪区域对应的图像确定为目标对象跟踪图像。
S36.对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
根据本申请的一些实例,可以采用跟踪神经网络对目标对象跟踪图像依次进行多组卷积过程。其中,跟踪神经网络可以包括多个堆叠的卷积块,每个卷积块可以包括卷积层和最大池化层,并且每个卷积块对应执行一组卷积处理。 在这种情况下,多个堆叠的卷积块中第一个卷积块的卷积层数量小于其他卷积块的卷积层数量。
在将步骤S34中确定出的目标对象跟踪图像输入跟踪神经网络之前,可以根据网络训练结构的要求,确定出适于网络输入的图像分辨率(例如,72×72,100×100)。服务器可以判断目标对象跟踪图像的分辨率是否与网络输入要求的分辨率匹配,如果不匹配,则可以对目标对象跟踪图像的分辨率进行调整,以使目标对象跟踪图像与跟踪神经网络适配。
下面将参考图9对本申请的跟踪神经网络进行示例性描述。
在图9所示实例中,跟踪神经网络可以包括第一个卷积块、第二个卷积块、第三个卷积块。应当注意的是,根据实际视频图像跟踪场景的不同,跟踪神经网络还可以包括其他卷积块。第一个卷积块可以包括1个卷积层,第二个卷积块和第三个卷积块均可以包括2个卷积层。
第一个卷积块可以包括卷积层901和最大池化层903。其中卷积层包括8(图中c=8)个大小为7×7(图中k=7)且步长为4(图中s=4)的卷积核,最大池化层903的大小为2×2且步长为4。
第二个卷积块中的卷积层包括由16个大小为3×3且步长为1的卷积核构成的卷积层905以及由24个3×3且步长为1的卷积核构成的卷积层907;第三个卷积块中的卷积层包括由40个大小为3×3且步长为1的卷积核构成的卷积层911以及由60个大小为3×3且步长为1的卷积核构成的卷积层913。此外,最大池化层909和最大池化层915与最大池化层903相同。
基于图9所示卷积的配置,一方面,在起始的第一个卷积块中,将卷积层901的维度设定为8,数值相对较小,有助于整体网络计算速度的提升。另外,通过将卷积层901的卷积核配置成大小为7×7且步长为4,可以在网络初始时提取更多的特征,而不会消耗大量的计算资源;另一方面,通过如上结构和参数设置第二个卷积块和第三个卷积块,在满足跟踪目标对象的同时,模型较小且计算速度快。
此外,应当理解的是,一方面,对于跟踪一些复杂的目标对象,也就是说,目标对象对应的特征较多,可以在跟踪神经网络中配置第四个卷积块、第五个 卷积块等,应当理解的是,新配置的卷积块的结构应当与第二和第三卷积块的结构类似。另一方面,对于跟踪一些简单的目标对象,也就是说,目标对象对应的特征较少,可以适当减小图9所示的卷积神经网络中各卷积层的维度和大小,而结构应与图9所示结构适应。这些均应属于本申请的构思。
除各卷积块之外,仍参考图9,本申请所述的跟踪神经网络还可以包括在第三个卷积块与跟踪神经网络的输出之间依次配置的维度为96的内积层917和维度为128的内积层919。其中,本领域技术人员容易理解的是,此处的内积层具有全连接功能,并且这种两次全连接的配置有助于整体网络计算速度的提升。
本申请的跟踪神经网络具有两个输出分支,即图9中的内积层921和内积层923。
针对由C=2表示的输出分支,其结果是判断目标对象跟踪图像中包含目标对象的置信度,即概率。该置信度的范围为[0,1]。本申请可以将输出的置信度与一预定阈值(例如,0.9)进行比较,如果小于该预定阈值,则可以确定该目标跟踪图像中不存在目标对象,此时,可以在整个下一帧图像中检测目标对象。具体的检测过程在步骤S32中已经详细描述,在此不再赘述。
跟踪神经网络输出置信度的意义在于:可以避免错误跟踪,从而及时调整到正确的目标位置。
针对由C=4表示的输出分支,其结果是目标对象在下一帧图像中的位置区域,可以将其表征为(x 1,y 1,w 1,h 1)。
另外,本申请还提供了对C=2输出分支进行损失计算以优化跟踪神经网络的方案。首先,可以根据置信度计算第一损失函数,具体的,可以根据公式4计算第一损失函数L conf’:
Figure PCTCN2019085604-appb-000007
其中,针对I{y G=j}函数,y G=j为真时值为1,否则为0;y G表示类别标定的真实数据,K为输出的C=2输出分支的神经元数量。另外,S j表示将第j个神经元执行归一化处理,利用公式5可得出:
Figure PCTCN2019085604-appb-000008
其中,∝ j表示内积向量中第j个的值。
接下来,可以确定使第一损失函数最小化的跟踪神经网络参数;随后,可以根据使第一损失函数最小化的跟踪神经网络参数对跟踪神经网络进行调整。
此外,本申请还提供了对C=4输出分支进行损失计算以优化跟踪神经网络的方案。首先,可以根据下一帧图像的目标对象位置区域计算第二损失函数,具体的,可以利用公式6来计算第二损失函数L reg
Figure PCTCN2019085604-appb-000009
其中,z i为目标矩形框的四个分量,分别为x、y、w、h坐标(即p=4)。
Figure PCTCN2019085604-appb-000010
表示网络模型的预测输出,z i表示目标的标定的真实坐标。
接下来,可以确定使第二损失函数最小化的跟踪神经网络参数;随后,可以根据使第二损失函数最小化的跟踪神经网络参数对跟踪神经网络进行调整。
应当注意的是,综合第一损失函数和第二损失函数对跟踪神经网络参数进行调整的方案也应当属于本申请的构思。
经测试,上述跟踪神经网络的模型大小小于1MB,使得这种模型可以应用于手机端,并具有较好的实时跟踪性能。
在确定出下一帧图像中的目标对象位置区域之后,本申请的示例性实施方式还可以包括:采用深度残差网络对下一帧图像的目标对象位置区域中的目标对象进行识别,以确定目标对象的类别。
具体的,深度残差网络的基本结构如图10所示,本申请所采用的深度残差网络的基本结构与现有的残差网络的基本结构类似,在此不做特殊说明。不同的是,本申请采用的是18层深度残差网络,相比于现有技术中一般采用10个卷积核的方案,本申请采用3个卷积核,由此,在略牺牲识别精确度的同时,大大提高了识别速度并减小了模型的大小。然而,经测试,这种略牺牲识别精确度的结果并不会对识别出目标对象的类别造成任何影响,而整个网络的性能大大得到了提升。
以手势识别为例,参考图11,采用上述识别方法,可以实现图11中手势1101至手势1111的精确识别。然而,不限于此,手势还可以包含其他类别。
下面将参考图12,对本申请的视频图像处理方法的整个过程进行说明。
在步骤S120中,可以对视频图像中的目标对象进行检测,以确定目标对象位置区域,具体检测过程详见步骤S32;在步骤S122中,可以判断是否检测到目标对象,如果检测到,进行步骤S124,如果未检测到,则返回步骤S120,以对视频图像的下一帧进行检测;在步骤S124中,可以对目标对象进行实时跟踪,具体的可以采用上述步骤S36中的跟踪神经网络实现实时跟踪;在步骤S126中,可以判断是否跟踪到目标对象,如果跟踪到则可以进行步骤S128的目标对象识别过程;如果未跟踪到,则返回步骤S120,以对当前进行跟踪的整体图像进行目标对象的检测。另外,视频中还可以预先标有目标对象的位置信息,在这种情况下,可以直接对目标对象进行实时跟踪。
在图12所描述的实例中,当跟踪到目标对象时,执行识别的处理过程。然而,考虑到负载以及手机端发热的问题,虽然每一帧均实现目标对象的跟踪,然而,可以每隔预定帧执行识别的过程,例如,每5帧执行识别的过程。
综上所述,一方面,本申请的视频图像处理方法所涉及的模型小,处理速度快,可以直接应用于例如手机的终端设备上;另一方面,本申请的视频图像处理方法可以应用于人机交互、智能监控、安全检查、数据娱乐、数码相机等各个领域,以较好地性能实现例如手势识别、人脸识别、车辆检测等应用目的。例如,在直播、视频聊天等场景下,可以在对视频中手势进行跟踪并识别后,在手上添加虚拟手链等手势挂件或色彩效果等特效。
应当注意,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。
进一步的,本示例实施方式中还提供了一种视频图像处理装置。
图13示意性示出了本申请的示例性实施方式的视频图像处理装置的方框图。参考图13,根据本申请的示例性实施方式的视频图像处理装置13可以包 括位置确定模块131、跟踪图像获取模块133和下一位置确定模块135。
具体的,位置确定模块131可以用于确定视频中当前帧图像中的目标对象位置区域;跟踪图像获取模块133可以用于对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;下一位置确定模块135可以用于对目标对象跟踪图像依次进行多组卷积处理以确定下一帧图像中的目标对象位置区域;其中,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
在本申请的视频图像处理装置中,一方面,对目标对象跟踪图像进行卷积处理,而非针对整个下一帧图像,大大降低了计算量,提高了目标对象跟踪的效率;另一方面,多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数,这样的网络结构模型较小,处理速度得到了提高。
根据本申请的示例性实例,参考图14,视频图像处理装置14相比于视频图像处理装置13,除包括位置确定模块131、跟踪图像获取模块133和下一位置确定模块135外,还可以包括目标对象识别模块141。
具体的,目标对象识别模块141可以用于采用深度残差网络对下一帧图像的目标对象位置区域中的目标对象进行识别,以确定所述目标对象的类别。
根据本申请的示例性实例,参考图15,视频图像处理装置15相比于视频图像处理装置13,除包括位置确定模块131、跟踪图像获取模块133和下一位置确定模块135外,还可以包括置信度确定模块151、置信度比较模块153和下一图像检测模块155。
具体的,置信度确定模块151可以用于确定下一帧图像中的目标对象位置区域包含目标对象的置信度;置信度比较模块153可以用于将所述置信度与一预定阈值进行比较;下一图像检测模块155可以用于如果所述置信度小于所述预定阈值,则在所述下一帧图像中检测目标对象。
根据本申请的示例性实例,参考图16,跟踪图像获取模块133可以包括区域放大单元1601和跟踪图像确定单元1603。
具体的,区域放大单元1601可以用于将所述当前帧图像的目标对象位置区域放大预定倍数得到目标对象跟踪区域;跟踪图像确定单元1603可以用于将所 述下一帧图像中与所述目标对象跟踪区域对应的图像确定为目标对象跟踪图像。
根据本申请的示例性实例,下一位置确定模块135还可以用于采用跟踪神经网络对目标对象跟踪图像依次进行多组卷积处理;其中,跟踪神经网络包括多个堆叠的卷积块,每个卷积块包括卷积层和最大池化层并且每个卷积块执行一组卷积处理。
根据本申请的示例性实例,多个堆叠的卷积块中第一个卷积块包括1个卷积层,除第一个卷积块外的其他卷积块均包括2个卷积层;其中,第一个卷积块中的卷积层包括8个大小为7×7且步长为4的卷积核;第二个卷积块中的卷积层包括16个大小为3×3且步长为1的卷积核以及24个3×3且步长为1的卷积核;第三个卷积块中的卷积层包括40个大小为3×3且步长为1的卷积核以及60个大小为3×3且步长为1的卷积核。
根据本申请的示例性实例,跟踪神经网络还包括在第三个卷积块与跟踪神经网络的输出之间依次配置的维度为96的内积层和维度为128的内积层。
根据本申请的示例性实例,针对跟踪神经网络确定目标对象跟踪图像中包含目标对象的置信度,参考图17,视频图像处理装置17相比于视频图像处理装置15,除包括位置确定模块131、跟踪图像获取模块133、下一位置确定模块135、置信度确定模块151、置信度比较模块153和下一图像检测模块155外,还可以包括第一损失函数计算模块171、第一网络参数确定模块173和第一网络调整模块175。
具体的,第一损失函数计算模块171可以用于根据所述置信度计算第一损失函数;第一网络参数确定模块173可以用于确定使第一损失函数最小化的跟踪神经网络参数;第一网络调整模块175可以用于根据使第一损失函数最小化的跟踪神经网络参数对所述跟踪神经网络进行调整。
根据本申请的示例性实例,针对跟踪神经网络确定出下一帧图像的目标对象位置区域,参考图18,视频图像处理装置18相比于视频图像处理装置13,除包括位置确定模块131、跟踪图像获取模块133和下一位置确定模块135外,还可以包括第二损失函数计算模块181、第二网络参数确定模块183和第二网络调整模块185。
第二损失函数计算模块181可以用于根据所述下一帧图像的目标对象位置区域计算第二损失函数;第二网络参数确定模块183可以用于确定使第二损失函数最小化的跟踪神经网络参数;第二网络调整模块185可以用于根据使第二损失函数最小化的跟踪神经网络参数对所述跟踪神经网络进行调整。
应当理解的是,第二损失函数计算模块181、第二网络参数确定模块183和第二网络调整模块185还可以包含于视频图像处理装置17中,以结合二者的损失函数计算结果综合确定调整的网络参数。
根据本申请的示例性实例,参考图19,视频图像处理装置19相比于视频图像处理装置13,除包括位置确定模块131、跟踪图像获取模块133和下一位置确定模块135外,还可以包括分辨率调整模块191。
具体的,分辨率调整模块191可以用于在将目标对象跟踪图像输入跟踪神经网络之前,对目标对象跟踪图像的分辨率进行调整,以使目标对象跟踪图像与跟踪神经网络适配。
根据本申请的示例性实例,参考图20,位置确定模块131可以包括位置确定单元2001。
具体的,位置确定单元2001可以用于将当前帧图像的特征图输入基础神经网络进行处理以确定当前帧图像中的目标对象位置区域;其中,基础神经网络包括多个堆叠的卷积处理模块,每一卷积处理模块对输入的特征图进行处理包括:对输入的特征图进行1×1降维卷积处理以得到第一特征图;对第一特征图进行1×1扩展卷积处理以得到第二特征图;对第一特征图进行深度可分离卷积处理以得到第三特征图;将第二特征图与第三特征图拼接,以得到卷积处理模块输出的特征图。
根据本申请的示例性实例,对第一特征图进行深度可分离卷积处理以得到第三特征图包括:对第一特征图的各维度分别进行3×3卷积处理,以得到与第一特征图维度相同的中间特征图;对中间特征图进行1×1卷积处理,以得到第三特征图。
由于本申请实施方式的程序运行性能分析装置的各个功能模块与上述方法发明实施方式中相同,因此在此不再赘述。
此外,上述附图仅是根据本申请示例性实例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其他实例。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实例仅被视为示例性的,本申请的真正范围和精神由权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限。

Claims (16)

  1. 一种视频图像处理方法,由电子设备执行,包括:
    确定视频中当前帧图像中的目标对象位置区域;
    确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像;
    对所述目标对象跟踪图像依次进行多组卷积处理以确定所述下一帧图像中的目标对象位置区域;其中,所述多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
  2. 根据权利要求1所述的视频图像处理方法,所述视频图像处理方法还包括:
    采用深度残差网络对所述下一帧图像的目标对象位置区域中的目标对象进行识别,以确定所述目标对象的类别。
  3. 根据权利要求1所述的视频图像处理方法,所述视频图像处理方法还包括:
    确定所述下一帧图像中的目标对象位置区域包含目标对象的置信度;
    将所述置信度与一预定阈值进行比较;
    如果所述置信度小于所述预定阈值,则在所述下一帧图像对应的下一帧图像中检测目标对象。
  4. 根据权利要求3所述的视频图像处理方法,所述视频图像处理方法还包括:
    如果所述置信度大于所述预定阈值,则在所述下一帧图像中检测目标对象。
  5. 根据权利要求1所述的视频图像处理方法,其中,确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像包括:
    将所述当前帧图像的目标对象位置区域放大预定倍数得到目标对象跟踪区域;
    将下一帧图像中与所述目标对象跟踪区域对应的图像确定为目标对象跟踪图像。
  6. 根据权利要求3或4所述的视频图像处理方法,其中,对所述目标对象 跟踪图像依次进行多组卷积处理包括:
    采用跟踪神经网络对所述目标对象跟踪图像依次进行多组卷积处理;
    其中,所述跟踪神经网络包括多个堆叠的卷积块,每个卷积块包括卷积层和最大池化层并且每个卷积块执行一组卷积处理。
  7. 根据权利要求6所述的视频图像处理方法,其中,所述多个堆叠的卷积块中第一个卷积块包括1个卷积层,除所述第一个卷积块外的其他卷积块均包括2个卷积层;
    其中,所述第一个卷积块中的卷积层包括8个大小为7×7且步长为4的卷积核;
    第二个卷积块中的卷积层包括16个大小为3×3且步长为1的卷积核以及24个3×3且步长为1的卷积核;
    第三个卷积块中的卷积层包括40个大小为3×3且步长为1的卷积核以及60个大小为3×3且步长为1的卷积核。
  8. 根据权利要求7所述的视频图像处理方法,其中,所述跟踪神经网络还包括在第三个卷积块与所述跟踪神经网络的输出之间依次配置的维度为96的内积层和维度为128的内积层。
  9. 根据权利要求6所述的视频图像处理方法,其中,针对所述确定所述目标对象跟踪图像中包含目标对象的置信度,所述视频图像处理方法还包括:
    根据所述置信度计算第一损失函数;
    确定使第一损失函数最小化的跟踪神经网络参数;
    根据使第一损失函数最小化的跟踪神经网络参数对所述跟踪神经网络进行调整。
  10. 根据权利要求6或9所述的视频图像处理方法,其中,针对所述跟踪神经网络确定出下一帧图像的目标对象位置区域,所述视频图像处理方法还包括:
    根据所述下一帧图像的目标对象位置区域计算第二损失函数;
    确定使第二损失函数最小化的跟踪神经网络参数;
    根据使第二损失函数最小化的跟踪神经网络参数对所述跟踪神经网络进行 调整。
  11. 根据权利要求6所述的视频图像处理方法,其中,在采用跟踪神经网络对所述目标对象跟踪图像依次进行多组卷积处理之前,所述视频图像处理方法还包括:
    对所述目标对象跟踪图像的分辨率进行调整,以使所述目标对象跟踪图像与所述跟踪神经网络适配。
  12. 根据权利要求1所述的视频图像处理方法,其中,确定视频中当前帧图像中的目标对象位置区域包括:
    将所述当前帧图像的特征图输入基础神经网络进行处理以确定所述当前帧图像中的目标对象位置区域;其中,所述基础神经网络包括多个堆叠的卷积处理模块,每一所述卷积处理模块对输入的特征图进行处理包括:
    对输入的特征图进行1×1降维卷积处理以得到第一特征图;
    对所述第一特征图进行1×1扩展卷积处理以得到第二特征图;
    对所述第一特征图进行深度可分离卷积处理以得到第三特征图;
    将所述第二特征图与所述第三特征图拼接,以得到所述卷积处理模块输出的特征图。
  13. 根据权利要求12所述的视频图像处理方法,其中,对所述第一特征图进行深度可分离卷积处理以得到第三特征图包括:
    对所述第一特征图的各维度分别进行3×3卷积处理,以得到与所述第一特征图维度相同的中间特征图;
    对所述中间特征图进行1×1卷积处理,以得到第三特征图。
  14. 一种视频图像处理装置,包括:
    位置确定模块,用于确定视频中当前帧图像中的目标对象位置区域;
    跟踪图像确定模块,用于确定下一帧图像中与所述目标对象位置区域对应的目标对象跟踪图像;
    下一位置确定模块,用于对所述目标对象跟踪图像依次进行多组卷积处理以确定所述下一帧图像中的目标对象位置区域;其中,所述多组卷积处理中的第一组卷积处理的卷积次数小于其他组卷积处理的卷积次数。
  15. 一种存储介质,存有处理器可执行指令,所述指令由一个或一个以上处理器执行时,实现如权利要求1至13中任一项所述的视频图像处理方法。
  16. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如权利要求1至13中任一项所述的视频图像处理方法。
PCT/CN2019/085604 2018-06-20 2019-05-06 视频图像处理方法及装置、计算机可读介质和电子设备 WO2019242416A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/922,196 US11436739B2 (en) 2018-06-20 2020-07-07 Method, apparatus, and storage medium for processing video image

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810639495.0 2018-06-20
CN201810639495.0A CN108898086B (zh) 2018-06-20 2018-06-20 视频图像处理方法及装置、计算机可读介质和电子设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/922,196 Continuation US11436739B2 (en) 2018-06-20 2020-07-07 Method, apparatus, and storage medium for processing video image

Publications (1)

Publication Number Publication Date
WO2019242416A1 true WO2019242416A1 (zh) 2019-12-26

Family

ID=64345258

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085604 WO2019242416A1 (zh) 2018-06-20 2019-05-06 视频图像处理方法及装置、计算机可读介质和电子设备

Country Status (3)

Country Link
US (1) US11436739B2 (zh)
CN (1) CN108898086B (zh)
WO (1) WO2019242416A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598902A (zh) * 2020-05-20 2020-08-28 北京字节跳动网络技术有限公司 图像分割方法、装置、电子设备及计算机可读介质
CN113129338A (zh) * 2021-04-21 2021-07-16 平安国际智慧城市科技股份有限公司 基于多目标跟踪算法的图像处理方法、装置、设备及介质
CN113158867A (zh) * 2021-04-15 2021-07-23 微马科技有限公司 人脸特征的确定方法、装置和计算机可读存储介质

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108898086B (zh) 2018-06-20 2023-05-26 腾讯科技(深圳)有限公司 视频图像处理方法及装置、计算机可读介质和电子设备
CN111488776B (zh) * 2019-01-25 2023-08-08 北京地平线机器人技术研发有限公司 对象检测方法、对象检测装置和电子设备
CN111783497A (zh) * 2019-04-03 2020-10-16 北京京东尚科信息技术有限公司 视频中目标的特征确定方法、装置和计算机可读存储介质
CN110473137B (zh) * 2019-04-24 2021-09-14 华为技术有限公司 图像处理方法和装置
CN110176024B (zh) * 2019-05-21 2023-06-02 腾讯科技(深圳)有限公司 在视频中对目标进行检测的方法、装置、设备和存储介质
CN112417932B (zh) * 2019-08-23 2023-04-07 中移雄安信息通信科技有限公司 视频中的目标对象的识别方法、装置及设备
CN110598649A (zh) * 2019-09-17 2019-12-20 中控智慧科技股份有限公司 车辆识别方法、装置及电子设备和存储介质
CN110648327B (zh) * 2019-09-29 2022-06-28 无锡祥生医疗科技股份有限公司 基于人工智能的超声影像视频自动追踪方法和设备
CN111027376A (zh) * 2019-10-28 2020-04-17 中国科学院上海微系统与信息技术研究所 一种确定事件图谱的方法、装置、电子设备及存储介质
CN111104920B (zh) * 2019-12-27 2023-12-01 深圳市商汤科技有限公司 视频处理方法及装置、电子设备和存储介质
CN111797728B (zh) * 2020-06-19 2024-06-14 浙江大华技术股份有限公司 一种运动物体的检测方法、装置、计算设备及存储介质
CN113919405B (zh) * 2020-07-07 2024-01-19 华为技术有限公司 数据处理方法、装置与相关设备
CN112529934B (zh) * 2020-12-02 2023-12-19 北京航空航天大学杭州创新研究院 多目标追踪方法、装置、电子设备和存储介质
CN112712124B (zh) * 2020-12-31 2021-12-10 山东奥邦交通设施工程有限公司 一种基于深度学习的多模块协同物体识别系统及方法
CN112802338B (zh) * 2020-12-31 2022-07-12 山东奥邦交通设施工程有限公司 一种基于深度学习的高速公路实时预警方法及系统
CN112863100B (zh) * 2020-12-31 2022-09-06 山东奥邦交通设施工程有限公司 一种智能施工安全监测系统及方法
CN112861780A (zh) * 2021-03-05 2021-05-28 上海有个机器人有限公司 一种行人重识别方法、装置、介质和移动机器人
CN113160244B (zh) * 2021-03-24 2024-03-15 北京达佳互联信息技术有限公司 视频处理方法、装置、电子设备及存储介质
CN112861830B (zh) * 2021-04-13 2023-08-25 北京百度网讯科技有限公司 特征提取方法、装置、设备、存储介质以及程序产品
CN113392743B (zh) * 2021-06-04 2023-04-07 北京格灵深瞳信息技术股份有限公司 异常动作检测方法、装置、电子设备和计算机存储介质
CN113706614B (zh) * 2021-08-27 2024-06-21 重庆赛迪奇智人工智能科技有限公司 一种小目标检测方法、装置、存储介质和电子设备
CN114581796B (zh) * 2022-01-19 2024-04-02 上海土蜂科技有限公司 目标物跟踪系统、方法及其计算机装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110229052A1 (en) * 2010-03-22 2011-09-22 Sony Corporation Blur function modeling for depth of field rendering
CN106296728A (zh) * 2016-07-27 2017-01-04 昆明理工大学 一种基于全卷积网络的非限制场景中运动目标快速分割方法
CN106875415A (zh) * 2016-12-29 2017-06-20 北京理工雷科电子信息技术有限公司 一种动态背景中弱小动目标的连续稳定跟踪方法
CN107871105A (zh) * 2016-09-26 2018-04-03 北京眼神科技有限公司 一种人脸认证方法和装置
CN108898086A (zh) * 2018-06-20 2018-11-27 腾讯科技(深圳)有限公司 视频图像处理方法及装置、计算机可读介质和电子设备
CN108960090A (zh) * 2018-06-20 2018-12-07 腾讯科技(深圳)有限公司 视频图像处理方法及装置、计算机可读介质和电子设备

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6102680B2 (ja) * 2013-10-29 2017-03-29 ソニー株式会社 符号化装置、復号装置、符号化データ、符号化方法、復号方法およびプログラム
JP6655878B2 (ja) * 2015-03-02 2020-03-04 キヤノン株式会社 画像認識方法及び装置、プログラム
US20160259980A1 (en) * 2015-03-03 2016-09-08 Umm Al-Qura University Systems and methodologies for performing intelligent perception based real-time counting
WO2017132830A1 (en) * 2016-02-02 2017-08-10 Xiaogang Wang Methods and systems for cnn network adaption and object online tracking
US9760806B1 (en) * 2016-05-11 2017-09-12 TCL Research America Inc. Method and system for vision-centric deep-learning-based road situation analysis
US20180129934A1 (en) * 2016-11-07 2018-05-10 Qualcomm Incorporated Enhanced siamese trackers
US10997421B2 (en) * 2017-03-30 2021-05-04 Hrl Laboratories, Llc Neuromorphic system for real-time visual activity recognition
US20180307912A1 (en) * 2017-04-20 2018-10-25 David Lee Selinger United states utility patent application system and method for monitoring virtual perimeter breaches
CN107492115B (zh) * 2017-08-30 2021-01-01 北京小米移动软件有限公司 目标对象的检测方法及装置
TWI651662B (zh) * 2017-11-23 2019-02-21 財團法人資訊工業策進會 影像標註方法、電子裝置及非暫態電腦可讀取儲存媒體
CN110096933B (zh) * 2018-01-30 2023-07-18 华为技术有限公司 目标检测的方法、装置及系统
US11205274B2 (en) * 2018-04-03 2021-12-21 Altumview Systems Inc. High-performance visual object tracking for embedded vision systems
US10671855B2 (en) * 2018-04-10 2020-06-02 Adobe Inc. Video object segmentation by reference-guided mask propagation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110229052A1 (en) * 2010-03-22 2011-09-22 Sony Corporation Blur function modeling for depth of field rendering
CN106296728A (zh) * 2016-07-27 2017-01-04 昆明理工大学 一种基于全卷积网络的非限制场景中运动目标快速分割方法
CN107871105A (zh) * 2016-09-26 2018-04-03 北京眼神科技有限公司 一种人脸认证方法和装置
CN106875415A (zh) * 2016-12-29 2017-06-20 北京理工雷科电子信息技术有限公司 一种动态背景中弱小动目标的连续稳定跟踪方法
CN108898086A (zh) * 2018-06-20 2018-11-27 腾讯科技(深圳)有限公司 视频图像处理方法及装置、计算机可读介质和电子设备
CN108960090A (zh) * 2018-06-20 2018-12-07 腾讯科技(深圳)有限公司 视频图像处理方法及装置、计算机可读介质和电子设备

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598902A (zh) * 2020-05-20 2020-08-28 北京字节跳动网络技术有限公司 图像分割方法、装置、电子设备及计算机可读介质
CN111598902B (zh) * 2020-05-20 2023-05-30 抖音视界有限公司 图像分割方法、装置、电子设备及计算机可读介质
CN113158867A (zh) * 2021-04-15 2021-07-23 微马科技有限公司 人脸特征的确定方法、装置和计算机可读存储介质
CN113129338A (zh) * 2021-04-21 2021-07-16 平安国际智慧城市科技股份有限公司 基于多目标跟踪算法的图像处理方法、装置、设备及介质
CN113129338B (zh) * 2021-04-21 2024-01-26 平安国际智慧城市科技股份有限公司 基于多目标跟踪算法的图像处理方法、装置、设备及介质

Also Published As

Publication number Publication date
US11436739B2 (en) 2022-09-06
US20200334830A1 (en) 2020-10-22
CN108898086A (zh) 2018-11-27
CN108898086B (zh) 2023-05-26

Similar Documents

Publication Publication Date Title
WO2019242416A1 (zh) 视频图像处理方法及装置、计算机可读介质和电子设备
CN108846440B (zh) 图像处理方法及装置、计算机可读介质和电子设备
US11734851B2 (en) Face key point detection method and apparatus, storage medium, and electronic device
CN108960090B (zh) 视频图像处理方法及装置、计算机可读介质和电子设备
CN111476309B (zh) 图像处理方法、模型训练方法、装置、设备及可读介质
US20190108447A1 (en) Multifunction perceptrons in machine learning environments
WO2020078119A1 (zh) 模拟用户穿戴服装饰品的方法、装置和系统
CN108229419B (zh) 用于聚类图像的方法和装置
CN112200062B (zh) 一种基于神经网络的目标检测方法、装置、机器可读介质及设备
US20210342643A1 (en) Method, apparatus, and electronic device for training place recognition model
CN112052186B (zh) 目标检测方法、装置、设备以及存储介质
WO2020244075A1 (zh) 手语识别方法、装置、计算机设备及存储介质
WO2022041830A1 (zh) 行人重识别方法和装置
CN111967467A (zh) 图像目标检测方法、装置、电子设备和计算机可读介质
CN111950570B (zh) 目标图像提取方法、神经网络训练方法及装置
CN114677565B (zh) 特征提取网络的训练方法和图像处理方法、装置
CN112614110B (zh) 评估图像质量的方法、装置及终端设备
WO2023124040A1 (zh) 一种人脸识别方法及装置
WO2021164328A1 (zh) 图像生成方法、设备及存储介质
CN114627244A (zh) 三维重建方法及装置、电子设备、计算机可读介质
WO2020155984A1 (zh) 人脸表情图像处理方法、装置和电子设备
US20230036366A1 (en) Image attribute classification method, apparatus, electronic device, medium and program product
WO2024125267A1 (zh) 图像处理方法、装置、计算机可读存储介质、电子设备及计算机程序产品
CN116129228B (zh) 图像匹配模型的训练方法、图像匹配方法及其装置
WO2023061195A1 (zh) 图像获取模型的训练方法、图像检测方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19823201

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19823201

Country of ref document: EP

Kind code of ref document: A1