WO2018153323A1 - 用于检测视频中物体的方法、装置和电子设备 - Google Patents
用于检测视频中物体的方法、装置和电子设备 Download PDFInfo
- Publication number
- WO2018153323A1 WO2018153323A1 PCT/CN2018/076708 CN2018076708W WO2018153323A1 WO 2018153323 A1 WO2018153323 A1 WO 2018153323A1 CN 2018076708 W CN2018076708 W CN 2018076708W WO 2018153323 A1 WO2018153323 A1 WO 2018153323A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image frame
- feature
- target object
- location area
- detected
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
Definitions
- the embodiments of the present invention relate to the field of object detection, and relate to the field of object detection in video, and in particular, to a method, device, and electronic device for detecting an object in a video.
- the detection technique for objects in video is an extension of the object detection technique in still images in the video field, which requires detecting one or more identical or different objects in each frame of the video.
- the embodiment of the present application proposes a technical solution for detecting an object in a video.
- a method for detecting an object in a video comprising: determining that at least one image frame in the video to be detected is a detected image frame; and acquiring at least the included image frame a first location area corresponding to a target object; respectively extracting a first feature of each of the first location areas in each of the detected image frames and at least one subsequent image frame consecutive to each of the video frames in the video sequence a second feature of the first location area; predicting, according to the extracted first feature and the second feature, motion information of the at least one target object in the at least one subsequent image frame; Determining, by the at least one target object, the predicted result of the first positional region in the at least one detected image frame and the motion information in the at least one subsequent image frame, determining that the at least one target object is at least after The location area in the image frame.
- determining that at least one image frame in the video to be detected is a detected image frame comprises: using the first image frame of the video to be detected as the detected image frame.
- the determining that at least one image frame in the video to be detected is a detection image frame comprises: using any key frame of the video to be detected as the detection image frame.
- the determining that at least one image frame in the video to be detected is a detection image frame comprises: using at least one image frame of the location area of the at least one target object in the video to be detected as The detected image frame is described.
- the video to be detected includes a plurality of temporally consecutive video sub-segments, and at least two temporally adjacent video sub-segments comprise at least one common image frame; wherein the determining at least one of the to-be-detected video frames is Detecting the image frame includes: using the at least one common image frame as the detected image frame.
- each of the video sub-segments includes m time-series consecutive image frames; and the determining that at least one of the video frames to be detected is a detection image frame, including: m-1 times before timing
- the image frame is used as the above-described detected image frame.
- the acquiring the first location area corresponding to the at least one target object included in the detection image frame includes: marking, in the detection image frame, a first location area corresponding to each of the target objects.
- the acquiring the first location area corresponding to the at least one target object included in the detection image frame includes: determining, according to the location area of the at least one target object known in the detection image frame. The first location area.
- the obtaining, by the foregoing, the first location area corresponding to the at least one target object included in the detection image frame comprises: according to the video sub-segment in the preceding video sub-segment of any two timings Determining a location area of the at least one target object in at least one common image frame, and determining a first location area of the detected image frame in a subsequent video sub-segment.
- the acquiring the first location area corresponding to the at least one target object included in the detection image frame includes: a circumscribed rectangular area or an external connection according to a position of the at least one target object in the detection image frame a contour area that determines the first location area.
- the predicting the motion information of the at least one target object in the at least one subsequent image frame according to the extracted first feature and the second feature including: according to Predicting motion information of the at least one target object in any of the subsequent image frames by describing a first feature of the at least one target object in any of the detected image frames and a second feature in any of the subsequent image frames .
- predicting motion information of the at least one target object in the at least one subsequent image frame according to the extracted first feature and the second feature including: for each video a sub-segment, according to a first feature of the m-1 image frames preceding the time series, a first preset weight corresponding to the first feature, and a second feature of the mth image frame after the time sequence, and the The second preset weight corresponding to the second feature predicts motion information of the at least one target object in the mth image frame after the timing, where m is an integer and m>1.
- the weight matrix of the trained first neural network includes the first preset weight and the second preset weight.
- the pre-trained first neural network is obtained by the following training steps: dividing the weight matrix of the pre-trained second neural network into a third weight and a fourth weight; The third weight is determined as an initial value of the first preset weight of the feature of the first image frame in the m image frames; and the fourth weight is determined as the second preset weight of the feature of the tth image frame The initial value, where 2 ⁇ t ⁇ m, and t is a positive integer; the above-mentioned pre-trained second neural network is obtained by the following training steps: respectively extracting two sample image frames adjacent to each other in the labeled training video a feature of the target object; predicting motion information of the target object in the sample image frame after the time sequence according to the extracted feature; adjusting the second neural network according to the prediction result of the motion information and the labeling information of the training video The weight matrix until the predetermined training completion condition of the second neural network described above is satisfied.
- predicting, according to the extracted first feature and the second feature, motion information of the at least one target object in the at least one subsequent image frame according to the Determining, by a feature and the second feature, the target object of the at least one target object in the first location area of the subsequent image frame relative to the detected image frame in the first location area Relative change information; predicting motion information of the at least one target object in the at least one subsequent image frame based on at least the relative change information of the at least one target object.
- the relative position change information includes: a movement amount of the first position area center point in the rear image frame in a horizontal direction compared to a center point of the first position area in the detection image frame, and the foregoing The amount of movement of the center point of the first position area in the rear image frame in the vertical direction is higher than the center point of the first position area in the detected image frame.
- the relative position change information includes: a change amount of the first position area in the rear image frame in the horizontal direction compared to the first position area in the detected image frame, and the subsequent image frame.
- the first positional area in the first direction is larger than the amount of change in the first positional area in the detected image frame in the vertical direction.
- the determining is performed according to at least the prediction result of the first location area of the at least one target object in the at least one detected image frame and the motion information in the at least one subsequent image frame.
- a location area of the at least one target object in the at least one subsequent image frame comprising: horizontally comparing the center point of the first location area in the subsequent image frame according to the first location area Transmitting the amount of movement of the center point of the first location area in the image frame, the center point of the first location area in the subsequent image frame in a vertical direction compared to the first in the detected image frame The amount of movement of a center point of a positional region, the amount of change of the first positional area in the rear image frame in the horizontal direction compared to the first positional area in the detected image frame, and the subsequent image Determining, in a vertical direction, the amount of change of the first location area in the frame compared to the first location area in the detected image frame, determining that the at least one target object is in the at least one subsequent image frame Location area
- predicting motion information of the at least one target object in the at least one subsequent image frame based on the relative change information of the at least one target object includes:
- a moving amount of the first position region center point in the at least one subsequent image frame in a horizontal direction compared to a center point of the first position region in the detected image frame, and the subsequent image frame The amount of movement of the first position region center point in the vertical direction compared to the center point of the first position region in the detected image frame, predicting prediction of the at least one target object after the at least one Motion information in an image frame; wherein a movement amount of the first position region center point in the rear image frame in a horizontal direction is smaller than a center point of the first position region in the detection image frame according to Determining, in a post image frame, a second feature of the target object in a horizontal direction relative to a first feature of the target object corresponding thereto; a center point of the first location region in the subsequent image frame
- the amount of movement in the vertical direction relative to the center point of the first position area in the detected image frame is according to the second feature of the target object in the subsequent image frame being more than the target object corresponding thereto
- predicting, according to the relative change information of the at least one target object, motion information of the at least one target object in the at least one subsequent image frame including: according to the subsequent image frame The first positional area in the horizontal direction is larger than the first positional area in the detected image frame in the horizontal direction and the first positional area in the subsequent image frame in the vertical direction. And detecting, by the amount of change in the first location area in the image frame, predicting motion information of the at least one target object in the at least one subsequent image frame;
- the amount of change of the first location area in the horizontal image direction in the horizontal image direction compared to the first location area in the detected image frame according to the second object in the subsequent image frame. Determining a feature in a horizontal direction from a feature of the first feature of the target object corresponding thereto; wherein the first location region in the rear image frame is vertically higher than the first location region in the detected image frame.
- the amount of change is determined according to the amount of change in the vertical direction of the second feature of the target object in the subsequent image frame compared to the first feature of the target object corresponding thereto.
- the determining is performed according to at least the prediction result of the first location area of the at least one target object in the at least one detected image frame and the motion information in the at least one subsequent image frame.
- a location area of the at least one target object in the at least one subsequent image frame comprising: using the first location area as a second location area of the at least one target object in the at least one subsequent image frame; Updating the second location area according to the relative change information of the target object in the first location area of the subsequent image frame relative to the target image in the first location area, and obtaining the location Determining a location area of the at least one target object in the at least one subsequent image frame.
- the method further includes: extracting the at least one in response to the location area determination completion in the at least one image frame of the to-be-detected video or the video sub-segment a third feature in a location area of the video to be detected or at least one image frame of the video sub-segment; determining, according to the extracted third feature, a target object in the at least one image frame category.
- each of the above-mentioned video to be detected or each of the video sub-segments includes n time-series consecutive image frames, n>1, and n is an integer; and the extracting the at least one target object is
- the third feature in the location area in the at least one image frame of the video to be detected or the video sub-segment includes: extracting a third feature of the n image frames in a time sequence; and for the i-th image frame, The third feature and the third feature of the i-1 image frames preceding the image frame are encoded until the third feature encoding of the nth image frame is completed, where 1 ⁇ i ⁇ n.
- determining, according to the extracted third feature, the category of the target object in the at least one image frame, respectively comprising: encoding result according to the extracted third feature and the third feature of the nth image frame Determining a decoding result of the third feature of the at least one image frame; determining a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- determining the decoding result of the third feature of the at least one image frame according to the encoded result of the extracted third feature and the third feature of the nth image frame including: performing the above-mentioned steps in reverse order Decoding the encoded result of the third feature of the n image frames; and determining, for the jth image frame, the jth image according to the encoding result of the third feature of the jth image frame and the third feature of the nth image frame The decoding result of the third feature of the frame until the decoding of the third feature of the n image frames is completed.
- a method for detecting an object in a video comprising: determining that at least one target object is in a video to be detected or at least one image frame included in the video sub-segment a location area; extracting a third feature of the at least one target object in a location area in the at least one image frame; and determining, according to the extracted third feature, a category of the target object in the at least one image frame.
- the video to be detected or the video sub-segment includes n time-series consecutive image frames, n>1, and n is an integer; and extracting the at least one target object in the at least one a third feature in the location area in the image frame, comprising: extracting a third feature of the n image frames in chronological order; for the i-th image frame, a third feature thereof and i-1 before the image frame The third feature of the image frame is encoded until the third feature encoding of the nth image frame is completed, where 1 ⁇ i ⁇ n.
- determining, according to the extracted third feature, the category of the target object in the at least one image frame comprises: determining, according to the encoded result of the extracted third feature and the third feature of the nth image frame, a decoding result of the third feature of the at least one image frame; determining a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- determining the decoding result of the third feature of the at least one image frame according to the encoded result of the extracted third feature and the third feature of the nth image frame including: performing the above-mentioned steps in reverse order Decoding the encoded result of the third feature of the n image frames; and determining, for the jth image frame, the jth image according to the encoding result of the third feature of the jth image frame and the third feature of the nth image frame The decoding result of the third feature of the frame until the decoding of the third feature of the n image frames is completed.
- an apparatus for detecting an object in a video comprising: a detection image frame determining unit, configured to determine that at least one image frame in the video to be detected is a detection image frame; a first location area determining unit, configured to acquire a first location area corresponding to the at least one target object included in the detection image frame, and a feature extraction unit, configured to respectively extract a first location of each of the first location areas in each of the detected image frames a feature and at least one second feature of the at least one subsequent image frame in each of the first location regions in the video in the video; the motion information prediction unit is configured to perform, according to the extracted first feature and a second feature for predicting motion information of the at least one target object in the at least one subsequent image frame; a location area determining unit, configured to: at least according to the at least one target object, in at least one detected image frame Determining the at least one item by the first location area and a prediction result of motion information in the at least one
- the detection image frame determining unit is configured to: use the first image frame of the video to be detected as the detection image frame.
- the detection image frame determining unit is configured to: use any key frame of the video to be detected as the detection image frame.
- the detection image frame determining unit is configured to: use at least one image frame of the location area of the at least one target object in the video to be detected as the detection image frame.
- the video to be detected includes a plurality of sequential video sub-segments, and at least two temporally adjacent video sub-segments include at least one common image frame; the detected image frame determining unit is configured to: A common image frame is used as the above-described detected image frame.
- each of the video sub-segments includes m consecutive image frames; the detected image frame determining unit is configured to: use the m-1 image frames with the preceding timing as the detected image frame.
- the first location area determining unit is configured to: mark, in the detected image frame, a first location area corresponding to the at least one target object.
- the first location area determining unit is configured to: determine the first location area according to a location area of the at least one target object that is known in the detected image frame.
- the first location area determining unit is configured to: according to the at least one target object in the at least one common image frame in the video sub-segment of the preceding video sub-segment of any two timings a location area that determines a first location area of the detected image frame in a subsequent video sub-segment.
- the first location area determining unit is configured to: predict, according to the first feature of the at least one target object in any of the detected image frames and the second feature in any subsequent image frame The motion information of at least one target object in any of the subsequent image frames.
- the motion information prediction unit is configured to: predict the at least the first feature in the at least one of the detected image frames and the second feature in any of the subsequent image frames Motion information of a target object in any of the subsequent image frames.
- the motion information prediction unit is configured to: for each video sub-segment, according to a first feature of the first m-1 image frames of the time series, and a first preset weight corresponding to the first feature And a second feature of the mth image frame after the time sequence, and a second preset weight corresponding to the second feature, predicting the at least one target object in the mth image frame after the timing Motion information, m is an integer, and m>1.
- the motion information prediction unit is configured to: predict, by the pre-trained first neural network, the at least one target object after the timing according to the extracted first feature and the second feature The motion information in the mth image frame, wherein the weight matrix of the pre-trained first neural network includes the first preset weight and the second preset weight.
- the pre-trained first neural network is obtained by a first training module, the first training module is configured to: divide a weight matrix of the pre-trained second neural network a third weight and a fourth weight; determining the third weight as an initial value of the first preset weight of a feature of the first image frame in the m image frames; determining the fourth weight as An initial value of the second preset weight of the feature of the tth image frame, where 2 ⁇ t ⁇ m, and t is a positive integer; the pre-trained second neural network is obtained by the second training module, the foregoing
- the second training module is configured to: separately extract features of the target object in two sample image frames adjacent to each other in the labeled training video; and predict motion of the target object in the sample image frame after time series according to the extracted features And adjusting the weight matrix of the second neural network according to the prediction result of the motion information and the labeling information of the training video, until the predetermined training completion condition of the second neural network is satisfied.
- the motion information prediction unit includes: a relative change information determining module, configured to determine, according to the first feature and the second feature, the at least one subsequent image frame in the first location area Relative change information of the at least one target object relative to the target object in the first location area of the detected image frame; a prediction module, configured to predict the location based on at least the relative change information of the at least one target object And describing motion information of the at least one target object in the at least one subsequent image frame.
- the relative position change information includes: a movement amount of the first position area center point in the rear image frame in a horizontal direction compared to a center point of the first position area in the detection image frame, and the foregoing The amount of movement of the center point of the first position area in the rear image frame in the vertical direction is higher than the center point of the first position area in the detected image frame.
- the relative position change information includes: a change amount of the first position area in the rear image frame in the horizontal direction compared to the first position area in the detected image frame, and the subsequent image frame.
- the first positional area in the first direction is larger than the amount of change in the first positional area in the detected image frame in the vertical direction.
- the location area determining unit includes: a location area determining module, configured to compare the center point of the first location area and the first location area in the subsequent image frame in a horizontal direction according to the first location area Transmitting the amount of movement of the center point of the first location area in the image frame, the center point of the first location area in the subsequent image frame in a vertical direction compared to the first in the detected image frame The amount of movement of a center point of a positional region, the amount of change of the first positional area in the rear image frame in the horizontal direction compared to the first positional area in the detected image frame, and the subsequent image Determining, in a vertical direction, the amount of change of the first location area in the frame compared to the first location area in the detected image frame, determining that the at least one target object is in the at least one subsequent image frame Location area.
- a location area determining module configured to compare the center point of the first location area and the first location area in the subsequent image frame in a horizontal direction according to the first location area Transmitting the amount of movement
- the prediction module is configured to: compare, in a horizontal direction, the center of the first location region in the detected image frame according to the first location region center point in the at least one subsequent image frame The amount of movement of the point, and the amount of movement of the center point of the first position area in the rear image frame in the vertical direction compared to the center point of the first position area in the detected image frame, predictive prediction Transmitting motion information of the at least one target object in the at least one subsequent image frame; wherein the first location area center point in the subsequent image frame is horizontally compared to the detected image frame The amount of movement of the center point of the first location area is determined according to the amount of movement of the second feature of the target object in the subsequent image frame relative to the first feature of the target object corresponding thereto in the horizontal direction; a movement amount of the first position region center point in the image frame in a vertical direction from a center point of the first position region in the detection image frame according to the target object in the subsequent image frame Second special Wherein said first target object corresponding thereto than the determined amount
- the prediction module is configured to: compare, according to the first location area in the subsequent image frame, a change amount and a location of the first location area in the detection image frame in a horizontal direction. Predicting the amount of change of the first location area in the rear image frame in the vertical direction compared to the first location area in the detected image frame, predicting that the at least one target object is in the at least one after Motion information in an image frame; wherein, in the subsequent image frame, the amount of change of the first location area in the horizontal direction compared to the first location area in the detected image frame is according to the subsequent image frame Determining, in a horizontal direction, a second characteristic of the target object compared to a first feature of the corresponding target object; the first positional area in the rear image frame being vertically compared to the detected image frame The amount of change in the first location area is determined according to the amount of change in the vertical direction of the second feature of the target object in the subsequent image frame compared to the first feature of the target object corresponding thereto.
- the location area determining unit is configured to: use the first location area as a second location area of the at least one target object in the at least one subsequent image frame; according to the subsequent image Updating relative information of the target object in the first location area relative to the target object in the first location area of the detected image frame, updating the second location area, to obtain the at least one target object The at least one location area in the subsequent image frame.
- the apparatus further includes: a third feature extraction unit, configured to respond to a location area of the at least one target object in the at least one image frame of the video to be detected or the video sub-segment Determining completion, extracting a third feature of the at least one target object in a location area of the video to be detected or at least one image frame of the video sub-segment; a category determining unit, configured to perform, according to the extracted third feature Determining a category of the target object in the at least one image frame, respectively.
- a third feature extraction unit configured to respond to a location area of the at least one target object in the at least one image frame of the video to be detected or the video sub-segment Determining completion, extracting a third feature of the at least one target object in a location area of the video to be detected or at least one image frame of the video sub-segment
- a category determining unit configured to perform, according to the extracted third feature Determining a category of the target object in the at least one image
- each of the above-mentioned video to be detected or each of the video sub-segments includes n time-series consecutive image frames, n>1, and n is an integer; and the third feature extraction unit is configured to: Extracting a third feature of the n image frames sequentially; for the ith image frame, encoding a third feature and a third feature of the i-1 image frames before the image frame until the nth image frame The third feature encoding is completed, where 1 ⁇ i ⁇ n.
- the foregoing class determining unit includes: a decoding result determining module, configured to determine a third feature of the at least one image frame according to the extracted third feature and the encoding result of the third feature of the nth image frame And a category determining module, configured to respectively determine a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- the decoding result determining module is configured to: decode the encoding result of the third feature of the n image frames in reverse order of time series; and perform the third image frame according to the jth image frame The encoding result of the feature and the third feature of the nth image frame determines a decoding result of the third feature of the jth image frame until the third feature decoding of the n image frames is completed.
- an apparatus for detecting an object in a video comprising: a second location area determining unit, configured to determine at least one target object in a video to be detected or the video a sub-segment includes a location area in the at least one image frame; the first feature extraction unit is configured to extract a third feature in the location area of the at least one target object in the at least one image frame; the first category determining unit And determining, according to the extracted third feature, a category of the target object in the at least one image frame.
- the video to be detected or the video sub-segment includes n time-series consecutive image frames, n>1, and n is an integer; the first feature extraction unit is configured to: extract the foregoing according to a time sequence a third feature of the n image frames; for the ith image frame, encoding the third feature and the third feature of the i-1 image frames preceding the image frame until the third of the nth image frame Feature encoding is completed, where 1 ⁇ i ⁇ n.
- the first class determining unit includes: a first decoding result determining module, configured to determine the at least one image frame according to the extracted third feature and the encoding result of the third feature of the nth image frame. a decoding result of the third feature, the first category determining module, configured to respectively determine a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- the first decoding result determining module is configured to: decode the encoding result of the third feature of the n image frames in reverse order of time series; and for the jth image frame, according to the jth image frame The encoding result of the third feature and the third feature of the nth image frame determines a decoding result of the third feature of the jth image frame until the decoding of the third feature of the n image frames is completed.
- an electronic device includes: a processor and a memory; and a memory for storing at least one executable instruction, the executable instruction causing the processor to execute the claim 1
- An operation operation corresponding to any of the methods of any of the preceding claims.
- an electronic device comprising: a memory storing executable instructions; and one or more processors in communication with the memory to execute executable instructions to perform the following operations: determining at least one target a location area in each image frame included in the video to be detected; extracting a third feature of each of the target objects in a location area of the video to be detected or each image frame of the video sub-segment; The third feature determines the category of the target object in each image frame.
- a computer program comprising computer readable code, the processor in the device executing the implementation of the present application when the computer readable code is run on a device
- the instructions of the various steps in the method of an embodiment are provided.
- a computer readable storage medium for storing computer readable instructions that, when executed, implement steps in the method of any of the embodiments of the present application. operating.
- the method and apparatus for detecting an object in a video provided by the embodiment of the present application first determines that one or more image frames in the video to be detected are detected image frames, and then acquires corresponding to each target object included in the detected image frame. a location area, and respectively extracting a first feature of each of the detected image frames in the first location area and a second feature of the one or more subsequent image frames in the first location area consecutive to each detected image frame timing, according to Extracting each of the first features and the second features, predicting motion information of each of the target objects in each subsequent image frame, and finally determining, according to the first location area and the prediction result, each target object in each subsequent image frame Location area.
- the detection of the target object in the video can be realized, and the calculation efficiency is effectively improved.
- FIG. 1 is a flow diagram of one embodiment of a method for detecting an object in a video in accordance with the present application
- 1a is a schematic diagram of comparison between a detection result of a method for detecting an object in a video and a detection result of the prior art according to an embodiment of the present application;
- FIG. 2 is a flow chart of another embodiment of a method for detecting an object in a video in accordance with the present application
- FIG. 2a is a schematic diagram of initializing a 16-dimensional weight matrix using a four-dimensional weight matrix in the embodiment shown in FIG. 2;
- FIG. 2b is a schematic diagram of constructing a 20-frame prediction model using a 5-frame prediction model in the embodiment shown in FIG. 2;
- FIG. 3 is a flow chart of still another embodiment of a method for detecting an object in a video in accordance with the present application
- FIG. 4 is a flow diagram of still another embodiment of a method for detecting an object in a video in accordance with the present application
- FIG. 4a is a schematic diagram of the working relationship corresponding to the flow shown in FIG. 4;
- FIG. 5 is a block diagram showing an embodiment of an apparatus for detecting an object in a video according to the present application
- FIG. 6 is a schematic structural diagram of another embodiment of an apparatus for detecting an object in a video according to the present application.
- FIG. 7 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server of an embodiment of the present application.
- Embodiments of the present invention can be applied to electronic devices such as terminal devices, computer systems, servers, etc., which can operate with numerous other general purpose or special purpose computing system environments or configurations.
- Examples of well-known terminal devices, computing systems, environments, and/or configurations suitable for use with electronic devices such as terminal devices, computer systems, servers, and the like include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients Machines, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
- Electronic devices such as terminal devices, computer systems, servers, etc., can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
- program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
- the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
- program modules may be located on a local or remote computing system storage medium including storage devices.
- FIG. 1 a flow 100 of one embodiment of a method for detecting an object in a video in accordance with the present application is illustrated.
- the method for detecting an object in a video of this embodiment includes the following steps:
- Step 101 Determine at least one image frame in the video to be detected as a detection image frame, and acquire a first location area corresponding to at least one target object included in the detection image frame.
- the video to be detected may include a plurality of consecutive sequential image frames
- the electronic device such as a terminal or a server on which the method of the embodiments of the present application operates may determine the video to be detected.
- One or more image frames in the frame are detected image frames.
- the detected image frame is one, it may include a plurality of target objects, and the plurality of target objects may be the same type of target object or different types of target objects.
- the detected image frames may be sequential in sequence or discrete in time. Further, the number and/or type of the target objects included in each of the detected image frames may be the same or different.
- the target object may be preset various types of objects, for example, may include various vehicles such as airplanes, bicycles, automobiles, and the like, and may also include various animals such as birds, dogs, and lions.
- each detection image frame may be detected by using a static area proposal method.
- the step 101 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by the detected image frame determining unit 501 and the first location area determining unit 502 executed by the processor.
- Step 102 Extract a first feature of each first location area in each detected image frame and a second feature of at least one subsequent image frame in each of the first location areas that are consecutive with respect to each detected image frame sequence in the video.
- each detected image frame After each detected image frame is determined, it is necessary to simultaneously determine at least one subsequent image frame that is continuous with each detected image frame timing. Thus, if each detected image frame is sequential continuous, it is combined with at least one subsequent image frame, and is still a set of images that are consecutive in time series; if each detected image frame is time-series, each detected image frame exists after At least one of the subsequent image frames, the video to be detected includes a plurality of discrete image combinations, and each image combination includes at least two image frames.
- the positional region of the target object in the two image frames adjacent to the timing is also close, so that the timing can be more easily performed. Predicting the location area of the target object in multiple image frames improves the accuracy of the prediction. For a plurality of discrete detection image frames, since the time interval between the detected image frames is large, the detection of a plurality of detected image frames with consecutive timings due to the close proximity of the positional regions is avoided, and the effective detection rate is improved. .
- the first feature of each detected image frame in the first location area and the subsequent image frame in the first location area may be separately extracted.
- the second feature When the first feature and the second feature are extracted, for example, it can be implemented by using a convolutional layer of a convolutional neural network.
- Step 103 Predict motion information of the at least one target object in at least one subsequent image frame according to each of the extracted first features and each second feature.
- the extracted first features and the second features may be used to predict that the at least one target object is at least one after Motion information in the image frame.
- the above motion information may include, but is not limited to, at least one of: a trend of motion of each target object, a distance relative to a distance at which the detected image frame is moved, and the like.
- the step 103 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a motion information prediction unit 504 that is executed by the processor.
- Step 104 Determine the at least one target object according to at least a prediction result of the first location area of the at least one target object in the at least one detected image frame and the motion information of the at least one target object in the at least one subsequent image frame. At least one location area in the rear image frame.
- each target object is in at least one subsequent image frame.
- Location area After determining the location area of each target object in the at least one detection image frame and the at least one subsequent image frame, further application may be performed based on the acquired location area, for example, detection of each target object may be implemented according to the location area.
- the step 104 may be performed by a processor invoking a corresponding instruction stored in the memory, or may be performed by a location area determining unit 505 that is executed by the processor.
- the positional area of each consecutive image frame can be connected to form a tubular region penetrating through the entire video or video sub-segment to be detected.
- the area contains the information of the moving position of the target object, and also contains the time information of the moving of the target object in each image frame, that is, the motion information in each image frame has temporal correlation.
- FIG. 1a shows four lines of images, wherein (a) the original image frame in the video to be detected; (b) the detection result obtained by the static region proposing method; (c) the behavior using the object The accurate position is the detection result obtained by the target regression method; (d) the behavior is obtained by using the detection result obtained by the method for detecting an object in the video in the embodiment of the present application, and it is known that the detection result retains the diversity of detection, and Time correlation is preserved.
- the method for detecting an object in a video provided by the foregoing embodiment of the present application first determines that one or more image frames in the video to be detected are detection image frames, and then acquires corresponding to each target object included in the detection image frame. a location area, and respectively extracting a first feature of each of the detected image frames in the first location area and a second feature of the one or more subsequent image frames in the first location area consecutive to each detected image frame timing, according to Extracting the first feature and the second feature, predicting motion information of the at least one target object in the at least one subsequent image frame, and finally determining, according to the first location region and the prediction result, that the at least one target object is at least one after The location area in the image frame.
- the pair can be realized.
- the detection of the target object in the video improves the computational efficiency while preserving the time information of the motion of the target object while ensuring the diversity of the detection results.
- At least one image frame in the video to be detected may be determined as a detected image frame by using the first image frame of the video to be detected as Detect image frames.
- the first image frame in the video to be detected can be used as the detection image frame, so that the target object in each image frame in the detected video can be detected in turn, which can ensure the comprehensiveness of the detection. Without reducing the accuracy of the test.
- At least one image frame in the video to be detected may be determined as a detected image frame by using any of the following: As a detection image frame.
- the key frame may be an image frame that appears for the first time of a certain type of target object, and may be the most complete image frame of the target object (herein, the whole of the target object appears in the image frame as a whole). It can also be an image frame with the largest number of target objects, or an image frame with the largest number of target objects. It will be appreciated that each image frame in the video to be detected may be traversed to determine the number and/or type and/or integrity of the target object, and the position and number of key frames in each image frame may be determined.
- At least one image frame in the video to be detected may be determined as a detection image frame by using at least one of the videos to be detected.
- An image frame of a position area of the at least one target object is used as a detection image frame.
- the one or more image frames are used as the detection image. frame. In this way, it is no longer necessary to detect the target object in the detected image frame, and the calculation efficiency can be further improved.
- the video to be detected may be divided into a plurality of sequential video sub-segments, and at least two video sub-segments adjacent to each other are defined to share at least one image frame. Then, in the above step 101, at least one image frame in the video to be detected may be determined as a detection image frame by using at least one image frame shared as the detection image frame.
- each of the video sub-segments includes m image frames.
- at least one image frame in the video to be detected may be determined as follows. Detecting image frames: The m-1 image frames preceding the timing are used as the detected image frames.
- the first m-1 image frames of each video sub-segment can be used as the detection image frame, and the feature of the m-th image frame in the last image frame, that is, the target object in the m-th image frame is predicted. Location area. In this way, the accuracy of the detection can be improved.
- the first location area corresponding to the at least one target object included in the detection image frame may be acquired by: marking the at least one target in the detection image frame.
- the first position area corresponding to the object may be acquired by: marking the at least one target in the detection image frame.
- the target object included in the detected image frame may be marked, and the first location area of each target object is determined by the marked region.
- the first location area corresponding to the at least one target object included in the detection image frame may be acquired by: A location area of a target object determines a first location area.
- the known location area may be determined as the first Location area.
- the first location area corresponding to the at least one target object included in the detection image frame may be acquired by: the video sub-sequences according to any two timings. A location area of the at least one target object in at least one common image frame in the preceding video sub-segment in the segment, and determining a first location area of the image frame in the subsequent video sub-segment.
- the shared image frame is selected as the detection image frame. It is no longer necessary to detect the target object in the detected image frame, which can further improve the calculation efficiency.
- the first location area corresponding to the at least one target object included in the detection image frame may be acquired by: detecting the image frame according to the at least one target object.
- the circumscribed rectangular area or the circumscribed outline area of the position in the middle determines the first position area.
- the circumscribed rectangle or other external contour of the position of the target object may be used to mark the target object, and then the circumscribed rectangular area or the circumscribed rectangular area may be determined.
- the external contour area is the first position area.
- FIG. 2 a flow diagram 200 of another embodiment of a method for detecting objects in a video in accordance with the present application is shown. As shown in FIG. 2, in the method for detecting an object in a video in this embodiment, when predicting motion information of a target object in each subsequent image frame, the following steps may be implemented:
- Step 201 Extract a first feature of each of the first position regions and a second feature of the mth image frame in each of the first location regions of the m-1 image frames in the video sub-segment .
- m image frames are defined in each video sub-segment, and m-1 image frames with the preceding timing are used as the detection image frame, and the m-th image frame is used as the subsequent image frame.
- the feature is extracted, the first feature of the first m-1 image frames in the first position region and the second feature of the mth image frame in the first position region after the time series are respectively extracted.
- the step 201 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a feature extraction unit 503 that is executed by the processor.
- Step 202 For each video sub-segment, predicting according to the extracted first feature, the first preset weight corresponding to each first feature, and the extracted second feature, and the second preset weight corresponding to the second feature, The motion information of the at least one target object in the mth image frame after the timing.
- n is an integer greater than one.
- each first feature may be weighted based on the first preset weight, and the second feature is weighted based on the second preset weight.
- the step 202 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a motion information prediction unit 504 that is executed by the processor.
- the pre-trained first neural network when predicting motion information by using each of the first feature and the second feature, may be used to predict, the pre-trained first nerve
- the network parameter of the network includes a weight matrix including the first preset weight and the second preset weight.
- the pre-trained first neural network is obtained by the following training steps not shown in FIG. 2:
- the weight matrix of the pre-trained second neural network into a third weight and a fourth weight; determining the third weight as an initial value of the first preset weight of the feature of the first image frame in the m image frames;
- the fourth weight is determined as an initial value of the second predetermined weight of the feature of the tth image frame, where 2 ⁇ t ⁇ m, and m and t are both positive integers.
- the weight of the first feature of the first one of the m consecutive image frames is initialized by using the third weight And using the fourth weight to initialize the weights of the second features of the second to mth image frames in the m consecutive image frames, that is, setting initial values for the first preset weight and the second preset weight And obtaining, in the weight matrix of the initial first neural network, an initial value of the first preset weight and an initial value of the second preset weight.
- the weight is adjusted from the initial value to the first preset weight and the second preset weight, and the first neural network with the new (m-1) 2- dimensional weight matrix is obtained.
- the motion information of the target object in the second to mth image frames can be simultaneously predicted, and the computational efficiency is effectively improved.
- the network also referred to as a 2-frame prediction model
- the weight matrix of the second convolutional neural network includes two weighting portions corresponding to the features extracted by the two image frames - weight A (corresponding to the third weight) and weight B (corresponding to the fourth weight mentioned above).
- the second neural network may combine the first feature of the temporally preceding image and the second feature of the subsequent one-frame image and the weight A and the weight B to predict the target object in the subsequent frame of the image. Sports information.
- the weight matrix of the first neural network (such as the first convolutional neural network) for detecting the video sub-segments including the plurality of image frames may be constructed by using two weight portions included in the weight matrix of the second neural network.
- the right dotted line frame is a first neural network (also referred to as a 5-frame prediction model) for detecting a video sub-segment including 5 image frames, such as a first convolutional neural network weight matrix
- the weight A is an initial value of the weight of the feature of the first image frame among the five consecutive image frames
- the weight B is the second image frame and the third image frame of the five consecutive image frames. The initial value of the weight of the feature of the fourth image frame and the fifth image frame.
- the weight matrix of the training completion includes the first preset Weight and second preset weight.
- the trained first neural network with the weight matrix described above can simultaneously predict the location regions of the target object in the second image frame, the third image frame, the fourth image frame, and the fifth image frame. Therefore, the two-frame prediction model can combine the features of the first image frame and the second image frame to predict a positional region of the target object in the second image frame; the five-frame prediction model can be combined with the first to fifth The feature of the image frame predicts the location area of the target object in the second to fifth image frames. This scheme is beneficial to improve the training speed of the neural network model and improve the computational efficiency.
- the above-described 5-frame prediction model can be used to construct a prediction model with a longer length to simultaneously predict the location region of the target object in more image frames.
- a 20-frame prediction model can be constructed by using the above five 5-frame prediction models, since the last image frame of each 5-frame prediction model is used as the first image frame of the next 5-frame prediction model, 5 of the above 5-frame prediction models can construct a 20-frame prediction model, and so on.
- the above process is an initialization process.
- the second neural network needs to be trained first, and the pre-trained second neural network is obtained through the following training steps:
- the labeled training video is obtained first.
- the training video includes multiple image frames, and the target objects in each image frame are marked, such that each image The frame can be used as a sample image frame.
- extracting the feature of the target object in the two sample image frames adjacent to each other in the training video, and predicting the motion information of the target object in the sample image frame after the time series according to the extracted feature it can be understood that, according to the The motion information can determine the position area of the target object in the sample image frame after the time series, input the position area and the marked position area into the second neural network simultaneously, and adjust the parameters of the second neural network until the first
- the training conditions of the two neural networks are completed.
- the above training completion condition may be any condition that can stop the second neural network training, for example, the above condition may be that the error between the predicted determined position area and the marked position area is less than a preset value or the like.
- the method for detecting an object in a video provided by the above embodiment of the present application, after extracting the first feature of the detected image frame and the second feature of the subsequent image frame, by initializing and adjusting the weight of the first feature and the second
- the weight of the feature combined with the adjusted weights described above, enables a more accurate prediction of the motion information of the target object in the subsequent image frame.
- FIG. 3 a flow 300 of yet another embodiment of a method for detecting objects in a video in accordance with the present application is illustrated. As shown in FIG. 3, in the method for detecting an object in a video in the method for predicting motion information of a target object in each subsequent image frame, the following steps may be implemented:
- Step 301 Determine, according to the first feature and the second feature, a relative change of the target object of the at least one target object in the first position region of the subsequent image frame relative to the detected image frame in the first location region. information.
- the pre-trained regression network may be used to determine, according to the extracted feature, that the target object in the first image region of the subsequent image frame is in the first location region relative to the detected image frame. Relative position information of the target object in .
- the foregoing relative position information may include: moving the center point of the first location area in the rear image frame in a horizontal direction compared to the center point of the first location area in the detected image frame.
- the amount ⁇ x, the center point of the first position area in the rear image frame is larger than the movement amount ⁇ y of the center point of the first position area in the detected image frame in the vertical direction.
- the center point thereof can be determined.
- the moving distance of the target object in the horizontal direction can be determined by the amount of movement ⁇ x of the center point of the first position region described above in the horizontal direction.
- the moving distance of the target object in the vertical direction can be determined by the amount of movement ⁇ y of the center point of the first position region in the vertical direction.
- the foregoing relative position information may further include: a change amount ⁇ w of the first location area in the detected image frame in the horizontal direction in the first location area in the rear image frame, The first position area in the rear image frame is detected in the vertical direction by the amount of change ⁇ h of the first position area in the image frame.
- the width of the positional region where the target object is located in the horizontal direction may be determined by determining the width change amount ⁇ w of the first position region in the horizontal direction.
- the height of the positional region where the target object is located in the vertical direction can be determined by determining the height change amount ⁇ h of the first positional region in the vertical direction.
- the foregoing relative change information may be determined according to the following formula:
- the center point of the first position area is determined to be in the subsequent image frame.
- the coordinate x t in the horizontal direction; the first position region is determined according to the movement amount ⁇ y of the center point of the first position region in the vertical direction and the coordinate y 1 and the height h 1 thereof in the vertical direction in the detected image frame
- the center point is in the vertical direction coordinate y t in the rear image frame;
- the first position area is determined according to the width change amount ⁇ w of the first position area in the horizontal direction and its width w 1 in the horizontal direction in the detected image frame
- the width w t in the horizontal direction in the rear image frame; the first position region is determined according to the height change amount ⁇ h of the first position region in the vertical direction and its height h 1 in the vertical direction in the detected image frame
- the step 301 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a relative change information determination module executed by the processor.
- Step 302 predict motion information of the at least one target object in the at least one subsequent image frame according to at least the relative change information of the at least one target object.
- motion information of the at least one target object in the at least one subsequent image frame may be predicted according to the obtained relative change information.
- the step 302 may be performed by a processor invoking a corresponding instruction stored in a memory or by a relative prediction module executed by the processor.
- the target can be predicted according to the movement amount ⁇ x of the center point of the first position area in the horizontal direction and the movement amount ⁇ y of the center point of the first position area in the vertical direction.
- the motion information of the object in each subsequent image frame can be predicted according to the movement amount ⁇ x of the center point of the first position area in the horizontal direction and the movement amount ⁇ y of the center point of the first position area in the vertical direction.
- the target object may be predicted after each according to the width change amount ⁇ w of the first position area in the horizontal direction and the height change amount ⁇ h of the first position area in the vertical direction. Motion information in the image frame.
- the location area of the at least one target object in the at least one subsequent image frame may be determined by: And as a second location area of the at least one target object in the at least one subsequent image frame, updating the second location area according to the relative change information, to obtain a location area of the at least one target object in the at least one subsequent image frame .
- the coordinate x t in the horizontal direction in the subsequent image frame in the first image region, the coordinate y t in the vertical direction in the subsequent image frame, and the horizontal direction in the subsequent image frame are determined.
- the position of the second location area may be updated according to the relative change information, and the updated second location area is used as the at least one target The location of the object in at least one of the subsequent image frames.
- the method for detecting an object in a video provided by the above embodiment of the present application can accurately determine a location area of the at least one target object in at least one subsequent image frame by determining the relative change information, thereby ensuring the target object. The accuracy of the test.
- the task of classifying and segmenting the image may be performed based on the obtained location area of each image frame.
- This application does not limit the corresponding means of implementation.
- the following is an example of a classification task. It can be understood that the classification task in the embodiment of the present application can use any method for detecting an object in a video used in the embodiment of the present application to determine position information of a target object in each image frame in the video to be detected, and can also use the current information. There are other methods in the art to detect the positional information of the target object in the video frame in the video. This embodiment of the present application is not limited thereto.
- the method for detecting an object in a video includes the following steps (the following steps may be performed after the flow shown in FIG. 1, or the target object in the video may be obtained by using a method different from that in FIG. Execution after the location area in each image frame, the embodiment of the present application is not limited):
- Step 401 Extract a third feature of the at least one target object in a location area in at least one image frame of the video or video sub-segment to be detected.
- the step may be performed in response to determining that the location area in the at least one of the video or video sub-segments to be detected is completed, and the corresponding determining method may adopt any one of the embodiments provided by the embodiments of the present application.
- the detection method is performed, and may be determined by other methods, such as the labeling mode, the image-by-image frame static detection mode, and the like, and the embodiment of the present application is not limited.
- the location area of each of the consecutive image frames may be connected to form a tubular region penetrating the entire video or video sub-segment to be detected, and then the above may be extracted.
- the third feature of the location area It can be understood that since the target area is determined to contain the target object, the extracted third feature is the feature of each target object.
- each video to be detected or each video sub-segment includes n time-series consecutive image frames, where n is an integer greater than one.
- the above step 401 can be implemented in the following manner: extracting the third feature of the n image frames in time series; for the i-th image frame, the third feature and the third of the i-1 image frames before the image frame The feature is encoded until the third feature encoding of the nth image frame is completed, where 1 ⁇ i ⁇ n.
- the third feature of the n image frames is sequentially extracted in the order from the first image frame to the nth image frame, and then for each image frame, the third feature of the image frame is The third feature of each image frame preceding the image frame is encoded until the third feature encoding of the nth image frame is completed.
- LSTM long short-term memory
- the step 401 can be performed by a processor invoking a corresponding instruction stored in a memory, or can be performed by a first feature extraction unit 602 that is executed by the processor.
- Step 402 Determine, according to the extracted third feature, a category of the target object in the at least one image frame.
- the category of the at least one target object may be determined. It can be understood that the categories of the target objects in different image frames may be the same or different.
- the step 402 may be performed by a processor invoking a corresponding instruction stored in a memory, or may be performed by a first class determining unit 603 that is executed by the processor.
- the foregoing step 402 may be further implemented by the following steps not shown in FIG. 4: the encoding result according to the extracted third feature and the third feature of the nth image frame. Determining a decoding result of the third feature of the at least one image frame; determining a class of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- the above decoding can be implemented using a decodable LSTM unit.
- decoding of the third feature of the at least one image frame may be implemented according to the following steps not shown in FIG. 4 during decoding: n image frames in reverse order of time series Decoding the encoded result of the third feature; for the jth image frame, determining the third of the jth image frame according to the encoding result of the third feature of the jth image frame and the third feature of the nth image frame The decoding result of the feature until the decoding of the third feature of the n image frames is completed.
- the encoding result of the third feature of each of the at least one image frame is sequentially decoded in the order from the nth image frame to the first image frame.
- the decoding result of the image frame is determined according to the encoding result of the third feature of the image frame and the third feature of the nth image frame until the decoding of the third feature of the n image frames is completed.
- the encoding result of the third feature of the nth image frame is the encoding result of the tubular region of the video or video sub-segment to be detected, and when the encoding result of the third feature of each image frame is decoded.
- the image frame is decoded in combination with the encoding result of the tubular region and the third feature of the image frame, and the obtained decoding result preserves the temporal correlation between the target objects in each image frame.
- FIG. 4a is a schematic diagram of the working relationship corresponding to the flow shown in FIG.
- FIG. 4a after determining that each target object is in the first position region of the first image frame, a plurality of tubular regions are formed, and then the positional region of each target object in each image frame is predicted, and the tubular region formed above is determined. Make adjustments. After the adjustment is completed, the features of each image frame in the above tubular region are extracted, and after the features of each image frame are obtained, the features in each image frame are sequentially sequenced from the first image frame to the last image frame. And then get the encoded result of the entire tubular area.
- an encodeable LSTM can be used. Then, the obtained coding result is decoded.
- the characteristics of the tubular region in each image frame and the coding result of the obtained tubular region can be combined, and the sequences are sequentially arranged from the last image frame to the first image frame. The features in the image frame are decoded.
- a decodable LSTM can be used. After decoding, the target objects included in the at least one image frame in the video may be classified according to the decoding result.
- the method for detecting an object in a video may, after determining a location area of the at least one target object in at least one image frame, encode a third feature of each location area to obtain at least one target.
- the object is integrated into the entire tubular region, and then the classification of each target object in at least one target object is performed according to the decoding result, all the features of each target object in the entire tubular region are comprehensively considered, and the coding obtained by decoding the tubular region is decoded.
- the features in each image frame in the at least one image frame may be sequentially decoded from the order of the last image frame to the first image frame, or may be sequentially performed from the first image frame to the last image frame.
- Decoding the features in each image frame, but decoding the features in each image frame sequentially from the last image frame to the first image frame ensures that the detection categories of each target object in each image frame are According to all the characteristics of each target object in the entire tubular region, improve the object in the video Accuracy.
- the method provided by any of the foregoing embodiments of the present application may be performed by any suitable device having data processing capability, including but not limited to: a terminal device, a server, and the like.
- the method provided by any of the foregoing embodiments of the present application may be executed by a processor, such as the processor, by executing a corresponding instruction stored in the memory to perform the method provided by any one of the foregoing embodiments of the present application. This will not be repeated below.
- the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
- the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
- the apparatus 500 for detecting an object in a video of the present embodiment includes: a detected image frame determining unit 501, a first position area determining unit 502, a feature extracting unit 503, a motion information predicting unit 504, and a position area determining. Unit 505.
- the detection image frame determining unit 501 is configured to determine that at least one image frame in the video to be detected is a detection image frame.
- the first location area determining unit 502 is configured to acquire a first location area corresponding to the at least one target object included in the detection image frame.
- the feature extraction unit 503 is configured to respectively extract a first feature of each of the first location regions in each of the detected image frames and at least one subsequent image frame that is consecutive to the respective detected image frame timings in each of the first location regions The second feature.
- the motion information prediction unit 504 is configured to predict, according to the extracted first feature and the second feature, motion information of the at least one target object in the at least one subsequent image frame.
- the location area determining unit 505 is configured to: at least according to the prediction result of the motion information of the first location area of the at least one target object in the at least one detected image frame and the at least one target object in the at least one subsequent image frame, Determining a location area of the at least one target object in the at least one subsequent image frame.
- the apparatus for detecting an object in a video provided by the foregoing embodiment of the present application first determines that one or more image frames in the video to be detected are detection image frames, and then acquires a first corresponding to the target object included in the detection image frame. Positioning regions, respectively extracting, respectively, the first feature of the at least one detected image frame in the first location area and the second feature of the one or more subsequent image frames in the first location area consecutive to each detected image frame timing, according to Extracting the first feature and the second feature, predicting motion information of the at least one target object in the at least one subsequent image frame, and finally determining, according to the first location region and the prediction result, that the at least one target object is in the at least one The location area in the post image frame. In this way, by determining the location area of the target object in at least one image frame in the video to be detected, the detection of the target object in the video can be realized, and the calculation efficiency is effectively improved.
- the detected image frame determining unit 501 is configured to: use the first image frame of the video to be detected as the detected image frame.
- the detection image frame determining unit 501 is configured to: use any key frame of the video to be detected as the detection image frame.
- the detection image frame determining unit 501 may be configured to: use at least one image frame of the location area of the at least one target object in the video to be detected as the detection image frame. .
- the video to be detected includes a plurality of sequential video sub-segments, and the at least two temporally adjacent video sub-segments comprise at least one common image frame.
- the detected image frame determining unit 501 is configured to: use the at least one common image frame as the detected image frame.
- each of the video sub-segments includes m consecutive image frames.
- the detected image frame determining unit 501 can be configured to use the m-1 image frames with the preceding timing as the detected image frame.
- the first location area determining unit 502 may be configured to: in the foregoing detection image frame, mark a first location area corresponding to each of the target objects.
- the first location area determining unit 502 is configured to: determine the first location area according to the location area of each of the target objects that are known in the detection image frame.
- the first location area determining unit 502 may be configured to: according to the at least one common image frame in the video sub-segment in the preceding video sub-segment of any two timings. And a location area of the at least one target object, and determining a first location area of the detected image frame in the subsequent video sub-segment.
- the first location area determining unit 502 may be configured to: determine, according to the circumscribed rectangular area or the external contour area of each of the target objects in the detected image frame, the first Location area.
- the motion information prediction unit 504 may be configured to: according to the first feature of each of the target objects in any of the detected image frames, and the target object in any subsequent image frame.
- the second feature predicts motion information of the at least one target object in the at least one subsequent image frame.
- the motion information prediction unit 504 may be configured to: for each video sub-segment, according to each of the first features of the m-1 image frames preceding the time series, and the foregoing a first preset weight corresponding to a feature and a second feature of the mth image frame after the timing, and a second preset weight corresponding to the second feature, predicting that the at least one target object is after the timing
- the motion information in m image frames, m is an integer, and m>1.
- the motion information prediction unit 504 may be configured to: predict, by using the pre-trained first neural network, the at least one target object according to the extracted first feature and the second feature.
- the pre-trained first neural network is obtained by using the following first training module, where the first training module is used to:
- the weight matrix of the pre-trained second neural network into a third weight and a fourth weight; determining the third weight as a feature of the first image frame in the m image frames, the first preset weight An initial value; the fourth weight is determined as an initial value of the second preset weight of the feature of the tth image frame, where 2 ⁇ t ⁇ m, and t is a positive integer.
- the pre-trained second neural network is obtained by the second training module, and the second training module is configured to: separately extract features of the target object in the two sample image frames adjacent to each other in the labeled training video; The feature predicts motion information of the target object in the sample image frame after the time sequence; adjusting the weight matrix of the second neural network according to the prediction result of the motion information and the labeling information of the training video, until the second nerve is satisfied.
- the network is scheduled to complete the training.
- the motion information prediction unit 504 may further include a relative change information determining module and a prediction module not shown in FIG. 5.
- the relative change information determining module is configured to determine, according to the first feature and the second feature, that the target object of the at least one subsequent image frame in the first location area is in the first location area with respect to the detected image frame. Relative change information of the target object in the medium.
- a prediction module configured to predict, according to the relative change information of the at least one target object, motion information of the at least one target object in the at least one subsequent image frame.
- the relative position change information includes: the first location area center point in the rear image frame is higher than the first location area in the detected image frame in a horizontal direction. The amount of movement of the center point, the amount of movement of the center point of the first position area in the rear image frame in the vertical direction compared to the center point of the first position area in the detected image frame.
- the foregoing relative position change information includes: the change of the first location area in the rear image frame in the horizontal direction compared to the first location area in the detected image frame. And an amount of change of the first position area in the rear image frame in the vertical direction compared to the first position area in the detected image frame.
- the location area determining unit 505 may further include a location area determining module, not shown in FIG. 5, according to the foregoing first location area, the foregoing image frame.
- the moving point amount of the first position area center point in the horizontal direction is higher than the center point of the first position area in the detected image frame, and the center point of the first position area in the rear image frame is higher than the above Detecting a movement amount of the center point of the first position area in the image frame, a variation amount of the first position area in the rear image frame in the horizontal direction compared to the first position area in the detection image frame, and the above And determining, by the first position area in the rear image frame, a change amount of the first position area in the detection image frame in the vertical direction, and determining a position area of the target object in the at least one subsequent image frame.
- the predicting module may be configured to: compare, according to the first location area center point in the subsequent image frame, the first location area in the detected image frame in a horizontal direction The amount of movement of the center point, and the amount of movement of the center point of the first position area in the rear image frame in the vertical direction compared to the center point of the first position area in the detected image frame, predicting and predicting the target object The above motion information in the rear image frame.
- the moving amount of the center point of the first position area in the rear image frame in the horizontal direction is higher than the center point of the first position area in the detected image frame according to the second object of the target object in the rear image frame.
- the feature is determined by the amount of movement of the first feature of the target object corresponding thereto in the horizontal direction.
- the amount of movement of the center point of the first position area in the rear image frame in the vertical direction compared to the center point of the first position area in the detected image frame is compared according to the second feature of the target object in the rear image frame.
- the amount of movement of the first feature of the above-described target object corresponding thereto in the vertical direction is determined.
- the predicting module may be configured to: change the first location area in the detected image frame in a horizontal direction according to the first location area in the subsequent image frame. The amount and the amount of change of the first position area in the rear image frame in the vertical image direction in the vertical direction are compared with the first position area in the detected image frame, and the motion information of the target object in the subsequent image frame is predicted.
- the change amount of the first location area in the horizontal image direction of the first location area in the image frame is smaller than the second feature of the target object in the subsequent image frame.
- the amount of change in the horizontal direction of the first feature of the object is determined.
- the first characteristic is determined by the amount of change in the vertical direction.
- the location area determining unit 505 may be configured to: use the first location area as the second location area of the target object in the subsequent image frame; And updating, by the target object in the first location area, the relative change information of the target object in the first location area of the detected image frame, and updating the second location area, to obtain the target object in the subsequent image frame. Location area.
- the foregoing apparatus 500 for detecting an object in a video may further include a third feature extraction unit and a category determination unit not shown in FIG. 5.
- a third feature extraction unit configured to: in response to the determining, by the at least one target object, the location area determination in the video to be detected or the image frame in the video sub-segment, extracting the at least one target object in the video to be detected or The third feature in the location area in the image frame of the video sub-segment.
- a category determining unit configured to respectively determine a category of the target object in the image frame according to the extracted third feature.
- each of the video to be detected or each of the video sub-segments includes n consecutive sequential image frames, n>1, and n is an integer.
- the third feature extraction unit may further be configured to: extract the third feature of the n image frames in time series; for the i-th image frame, the third feature and the i-1 image frames before the image frame The third feature is encoded until the third feature encoding of the nth image frame is completed, where 1 ⁇ i ⁇ n.
- the category determining unit may include a decoding result determining module and a category determining module not shown in FIG. 5.
- the decoding result determining module is configured to determine a decoding result of the third feature of the at least one image frame according to the extracted third feature and the encoding result of the third feature of the nth image frame.
- a category determining module configured to respectively determine a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- the decoding result determining module may be configured to: decode the encoding result of the third feature of the n image frames in reverse order of time series; for the jth image frame, according to The encoding result of the third feature of the jth image frame and the third feature of the nth image frame determines a decoding result of the third feature of the jth image frame until the decoding of the third feature of the n image frames is completed.
- FIG. 6 shows a schematic structural diagram of an apparatus for detecting an object in a video according to an embodiment of the present application.
- the apparatus 600 for detecting an object in a video of the present embodiment includes: a second location area determining unit 601, a first feature extracting unit 602, and a first category determining unit 603.
- the second location area determining unit 601 is configured to determine a location area of the at least one target object in the at least one image frame included in the video or video sub-segment to be detected.
- the first feature extraction unit 602 is configured to extract a third feature of the at least one target object in the location area of the video to be detected or the at least one image frame of the video sub-segment.
- the first category determining unit 603 is configured to determine, according to the extracted third feature, a category of the target object in the at least one image frame.
- the apparatus for detecting an object in a video provided by the above embodiment of the present application can realize classification of the target object according to the third feature of the location area after determining the location area of the target object in the image frame, and expand the video.
- the function of object detection can realize classification of the target object according to the third feature of the location area after determining the location area of the target object in the image frame, and expand the video.
- the video or video sub-segment to be detected includes n time-series consecutive image frames, n>1, and n is an integer.
- the first feature extraction unit 602 may be configured to: extract the third feature of the n image frames in time series; for the i-th image frame, the third feature and the i-1 image frames before the image frame The third feature is encoded until the third feature encoding of the nth image frame is completed, where 1 ⁇ i ⁇ n.
- the first category determining unit 603 may further include a first decoding result determining module not shown in FIG. 6 and a first category determining module.
- the first decoding result determining module is configured to determine a decoding result of the third feature of the at least one image frame according to the extracted third feature and the encoding result of the third feature of the nth image frame.
- the first category determining module is configured to respectively determine a category of the target object in the at least one image frame according to a decoding result of the third feature of the at least one image frame.
- the first decoding result determining module may be configured to: decode the encoding result of the third feature of the n image frames in reverse order of time series; for the jth image frame Determining, according to the third feature of the jth image frame and the encoding result of the third feature of the nth image frame, a decoding result of the third feature of the jth image frame until the third feature decoding of the n image frames carry out.
- each block of the flowchart or block diagrams can represent a module, a program segment, or a portion of code that includes one or more Executable instructions.
- the functions noted in the blocks may also occur in a different order than that illustrated in the drawings. For example, two successively represented blocks may in fact be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.
- the units involved in the embodiments of the present application may be implemented by software or by hardware.
- the described unit may also be provided in the processor.
- a processor includes a detected image frame determining unit, a first position area determining unit, a feature extracting unit, a motion information predicting unit, and a position area determining unit.
- the name of the unit does not constitute a limitation on the unit itself in some cases.
- the detected image frame determining unit may also be described as “determining at least one image frame in the video to be detected as a unit for detecting the image frame. ".
- the embodiment of the present application further provides an electronic device, which may be, for example, a mobile terminal, a personal computer (PC), a tablet computer, a server, etc., including a processor and a memory; wherein: a memory for storing at least one executable instruction, The executable instructions cause the processor to perform an operation operation corresponding to the method for detecting an object in the video according to any of the above embodiments of the present application.
- an electronic device which may be, for example, a mobile terminal, a personal computer (PC), a tablet computer, a server, etc., including a processor and a memory; wherein: a memory for storing at least one executable instruction, The executable instructions cause the processor to perform an operation operation corresponding to the method for detecting an object in the video according to any of the above embodiments of the present application.
- the embodiment of the present application further provides a computer program, including computer readable code, when the computer readable code is run on a device, the processor in the device executes to implement any of the foregoing embodiments of the present application.
- the embodiment of the present application further provides a computer readable storage medium, configured to store computer readable instructions, when the instructions are executed, implementing any of the methods for detecting an object in a video according to any one of the above applications. The operation of the steps.
- the computer system 700 includes one or more processors and a communication unit.
- the one or more processors for example: one or more central processing units (CPUs) 701, and/or one or more image processing units (GPUs) 713, etc., the processors may be stored in a read only memory (ROM)
- the executable instructions in 702 are either loaded from executable portion 708 into executable instructions in random access memory (RAM) 703 to perform appropriate actions and processing.
- the communication unit 712 can include, but is not limited to, a network card, which can include, but is not limited to, an IB (Infiniband) network card.
- the processor can communicate with the ROM 702 and/or the RAM 703 to execute executable instructions, connect to the communication unit 712 via the bus 704, and communicate with other target devices via the communication unit 712, thereby completing any of the solutions provided by the embodiments of the present application.
- the operation corresponding to the method for detecting an object in the video for example, determining that at least one image frame in the video to be detected is a detection image frame; acquiring a first location area corresponding to at least one target object included in the detection image frame; Detecting, in the image frame, a first feature of each of the first location areas and a second feature of the at least one subsequent image frame in each of the first location regions in the video that is consecutive with respect to each of the detected image frame timings And predicting, according to the extracted first feature and the second feature, motion information of the at least one target object in the at least one subsequent image frame; at least detecting according to the at least one target object Determining the at least one target object in the first location area in the image frame and the prediction result of the motion information in the at least one subsequent image frame A low position in the region of the image frame.
- determining for example, determining a location area of the at least one target object in the video to be detected or the at least one image frame included in the video sub-segment; extracting the at least one target object in a location area in the at least one image frame a third feature; determining, according to the extracted third feature, a category of the target object in the at least one image frame.
- RAM 703 various programs and data required for the operation of the device can be stored.
- the CPU 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704.
- ROM 702 is an optional module.
- the RAM 703 stores executable instructions or writes executable instructions to the ROM 702 at runtime, the executable instructions causing the CPU 701 to perform operations corresponding to the above-described communication methods.
- An input/output (I/O) interface 705 is also coupled to bus 704.
- the communication portion 712 may be integrated or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and linked on the bus 704.
- the following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, etc.; an output portion 707 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the like, and a speaker; a storage portion 708 including a hard disk or the like And a communication portion 709 including a network interface card such as a LAN card, a modem, or the like.
- the communication section 709 performs communication processing via a network such as the Internet.
- Driver 710 is also connected to I/O interface 705 as needed.
- a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like, is mounted on the drive 710 as needed so that a computer program read therefrom is installed into the storage portion 708 as needed.
- FIG. 7 is only an optional implementation manner.
- the number and types of components in FIG. 7 may be selected, deleted, added, or replaced according to actual needs;
- the function component setting may also adopt an implementation such as a separate setting or an integrated setting.
- the GPU 713 and the CPU 701 may be separately configured or the GPU 713 may be integrated on the CPU 701, and the communication part may be separately configured or integrated in the CPU 701. Or on GPU 713, and so on.
- an embodiment of the present application includes a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program comprising program code for executing the method illustrated in the flowchart, the program code comprising Executing an instruction corresponding to the method step provided by the embodiment of the present application, for example, determining that at least one image frame in the video to be detected is a detection image frame; and acquiring a first location area corresponding to the at least one target object included in the detection image frame; Extracting, respectively, a first feature of each of the first location areas in each of the detected image frames and at least one subsequent image frame of the video in each of the first location regions that are consecutive with respect to each of the detected image frame timings a second feature; predicting, according to the extracted first feature and the second feature, motion information of the at least one target object in the at least one subsequent image frame; at least according to
- the computer program can be downloaded and installed from the network via communication portion 709, and/or installed from removable media 711.
- the computer program is executed by the CPU 701, the above-described functions defined in the method of the embodiment of the present application are executed.
- the methods, apparatus, and apparatus of the present application may be implemented in a number of ways.
- the methods, apparatus, and apparatus of the present application can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
- the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present application are not limited to the order described above unless otherwise specifically stated.
- the present application can also be implemented as a program recorded in a recording medium, the programs including machine readable instructions for implementing the method according to the present application.
- the present application also covers a recording medium storing a program for executing the method according to the present application.
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
本申请实施例公开了用于检测视频中物体的方法、装置和电子设备。其中用于检测视频中物体的方法包括:确定待检测的视频中至少一图像帧为检测图像帧;获取检测图像帧所包含的至少一目标物体对应的第一位置区域;分别提取各检测图像帧中各第一位置区域的第一特征和视频中相对各检测图像帧时序连续的至少一在后图像帧在各第一位置区域的第二特征;根据提取的第一特征和第二特征,预测至少一目标物体分别在至少一在后图像帧中的运动信息;至少根据至少一目标物体在至少一检测图像帧中的第一位置区域及在至少一在后图像帧中的运动信息的预测结果,确定至少一目标物体在至少一在后图像帧中的位置区域。该实施方式有效地提高了计算效率。
Description
本申请要求在2017年02月21日提交中国专利局、申请号为CN201710093583.0、发明名称为“用于检测视频中物体的方法、装置和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请实施例中。
本申请实施例涉及物体检测领域,涉及视频中物体检测领域,尤其涉及一种用于检测视频中物体的方法、装置和电子设备。
对视频中物体的检测技术是对静态图像中物体检测技术在视频领域的扩展,该技术需要在视频的每一帧图像中检测一个或多个相同或不同的物体。
发明内容
本申请实施例提出了一种用于检测视频中物体的技术方案。
根据本申请实施例的一个方面,提供了一种用于检测视频中物体的方法,上述方法包括:确定待检测的视频中至少一图像帧为检测图像帧;获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域;分别提取各上述检测图像帧中各上述第一位置区域的第一特征和上述视频中相对各上述检测图像帧时序连续的至少一在后图像帧在各上述第一位置区域的第二特征;根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述确定待检测的视频中至少一图像帧为检测图像帧,包括:将上述待检测的视频的第一图像帧作为上述检测图像帧。
在一些实施例中,上述确定待检测的视频中至少一图像帧为检测图像帧,包括:将上述待检测的视频的任一关键帧作为上述检测图像帧。
在一些实施例中,上述确定待检测的视频中至少一图像帧为检测图像帧,包括:将所述待检测的视频中至少一已知所述至少一个目标物体的位置区域的图像帧作为所述检测图像帧。
在一些实施例中,上述待检测的视频包括多个时序连续的视频子段,至少两个时序相邻的视频子段包括至少一共同图像帧;上述确定待检测的视频中至少一图像帧为检测图像帧,包括:将上述至少一共同图像帧作为上述检测图像帧。
在一些实施例中,每一上述视频子段中包括m个时序连续的图像帧;以及上述确定待检测的视频中至少一图像帧为检测图像帧,包括:将时序在前的m-1个图像帧作为上述检测图像帧。
在一些实施例中,上述获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:在上述检测图像帧中标注各上述目标物体对应的第一位置区域。
在一些实施例中,上述获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据所述检测图像帧中已知的所述至少一目标物体的位置区域确定所述第一位置区域。
在一些实施例中,上述获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据任两个时序相邻的视频子段中时序在前的视频子段中所述至少一共同图像帧中所述至少一目标物体的位置区域,确定时序在后的视频子段中所述检测图像帧的第一位置区域。
在一些实施例中,上述获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据所述至少一目标物体在所述检测图像帧中的位置的外接矩形区域或外接轮廓区域,确定所述第一位置区域。
在一些实施例中,上述所述根据提取的所述第一特征和所述第二特征,预测所述至少一目标 物体分别在所述至少一在后图像帧中的运动信息,包括:根据所述至少一目标物体在任一所述检测图像帧中的第一特征及在任一在后图像帧中的第二特征,预测所述至少一目标物体在所述任一在后图像帧中的运动信息。
在一些实施例中,上述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:对于每个视频子段,根据时序在前的m-1个图像帧的第一特征、与所述第一特征对应的第一预设权重以及时序在后的第m个图像帧的第二特征、与所述第二特征对应的第二预设权重,预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,m为整数,且m>1。
在一些实施例中,上述所述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:根据提取的所述第一特征和所述第二特征,利用预先训练的第一神经网络预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,其中,所述预先训练的第一神经网络的权重矩阵包括所述第一预设权重以及所述第二预设权重。
在一些实施例中,响应于m大于2,上述预先训练的第一神经网络通过以下训练步骤得到:将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将上述第三权重确定为上述m个图像帧中的第1个图像帧的特征的上述第一预设权重的初始值;将上述第四权重确定为第t个图像帧的特征的上述第二预设权重的初始值,其中,2≤t≤m,且t为正整数;上述预先训练的第二神经网络通过以下训练步骤得到:分别提取已标注的训练用视频中时序相邻的两个样本图像帧中上述目标物体的特征;根据提取的特征预测上述目标物体在时序在后的样本图像帧中的运动信息;根据上述运动信息的预测结果和上述训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足上述第二神经网络预定的训练完成条件。
在一些实施例中,上述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:根据所述第一特征和所述第二特征,确定所述至少一在后图像帧在所述第一位置区域中的所述至少一目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息;至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息。
在一些实施例中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量、上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量。
在一些实施例中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域在水平方向上较上述检测图像帧中的上述第一位置区域的变化量、上述在后图像帧中的上述第一位置区域在竖直方向上较上述检测图像帧中的上述第一位置区域的变化量。
在一些实施例中,上述至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域,包括:根据所述第一位置区域、所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息,包括:
根据所述至少一在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,和所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,预测预测所述至少一目标物体在 所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在水平方向的移动量确定;所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在竖直方向的移动量确定。
在一些实施例中,上述至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息,包括:根据所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;
其中,所述在后图像帧中所述第一位置区域在水平方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在水平方向的变化量确定;所述在后图像帧中所述第一位置区域在竖直方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在竖直方向的变化量确定。
在一些实施例中,上述至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域,包括:将所述第一位置区域作为所述至少一目标物体在所述至少一在后图像帧中的第二位置区域;根据所述在后图像帧在所述第一位置区域中的目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息,更新所述第二位置区域,得到所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述方法还包括:响应于所述至少一目标物体在所述待检测的视频或所述视频子段中的至少一图像帧中的位置区域确定完成,提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,每个上述待检测的视频或每一上述视频子段包括n个时序连续的图像帧,n>1,且n为整数;以及上述所述提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征,包括:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在一些实施例中,上述根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别,包括:根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果;根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,上述根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果,包括:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
根据本申请实施例的另一方面,提供了一种用于检测视频中物体的方法,上述方法包括:确定至少一目标物体在待检测的视频或所述视频子段包括的至少一图像帧中的位置区域;提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。
在一些实施例中,所述待检测的视频或所述视频子段包括n个时序连续的图像帧,n>1,且 n为整数;所述提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征,包括:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在一些实施例中,上述根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别,包括:根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果;根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,上述根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果,包括:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
根据本申请实施例的又一方面,提供了一种用于检测视频中物体的装置,上述装置包括:检测图像帧确定单元,用于确定待检测的视频中至少一图像帧为检测图像帧;第一位置区域确定单元,用于获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域;特征提取单元,用于分别提取各上述检测图像帧中各上述第一位置区域的第一特征和上述视频中相对各上述检测图像帧时序连续的至少一在后图像帧在各上述第一位置区域的第二特征;运动信息预测单元,用于根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;位置区域确定单元,用于至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述检测图像帧确定单元用于:将上述待检测的视频的第一图像帧作为上述检测图像帧。
在一些实施例中,上述检测图像帧确定单元用于:将上述待检测的视频的任一关键帧作为上述检测图像帧。
在一些实施例中,上述检测图像帧确定单元用于:将所述待检测的视频中至少一已知所述至少一目标物体的位置区域的图像帧作为所述检测图像帧。
在一些实施例中,上述待检测的视频包括多个时序连续的视频子段,至少两个时序相邻的视频子段包括至少一共同图像帧;上述检测图像帧确定单元用于:将上述至少一共同图像帧作为上述检测图像帧。
在一些实施例中,每一上述视频子段中包括时序连续的m个图像帧;上述检测图像帧确定单元用于:将时序在前的m-1个图像帧作为上述检测图像帧。
在一些实施例中,上述第一位置区域确定单元用于:在所述检测图像帧中标注所述至少一目标物体对应的第一位置区域。
在一些实施例中,上述第一位置区域确定单元用于:根据所述检测图像帧中已知的所述至少一目标物体的位置区域确定所述第一位置区域。
在一些实施例中,上述第一位置区域确定单元用于:根据任两个时序相邻的视频子段中时序在前的视频子段中所述至少一共同图像帧中所述至少一目标物体的位置区域,确定时序在后的视频子段中所述检测图像帧的第一位置区域。
在一些实施例中,上述第一位置区域确定单元用于:根据所述至少一目标物体在任一所述检测图像帧中的第一特征及在任一在后图像帧中的第二特征,预测所述至少一目标物体在所述任一在后图像帧中的运动信息。
在一些实施例中,上述运动信息预测单元用于:根据所述至少一目标物体在任一所述检测图像帧中的第一特征及在任一在后图像帧中的第二特征,预测所述至少一目标物体在所述任一在后图像帧中的运动信息。
在一些实施例中,上述运动信息预测单元用于:对于每个视频子段,根据时序在前的m-1个图像帧的第一特征、与所述第一特征对应的第一预设权重以及时序在后的第m个图像帧的第二特征、与所述第二特征对应的第二预设权重,预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,m为整数,且m>1。
在一些实施例中,上述运动信息预测单元用于:根据提取的所述第一特征和所述第二特征,利用预先训练的第一神经网络预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,其中,所述预先训练的第一神经网络的权重矩阵包括所述第一预设权重以及所述第二预设权重。
在一些实施例中,响应于m大于2,所述预先训练的第一神经网络通过以下第一训练模块得到,所述第一训练模块用于:将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将所述第三权重确定为所述m个图像帧中的第1个图像帧的的特征所述第一预设权重的初始值;将第四权重确定为第t个图像帧的特征的所述第二预设权重的初始值,其中,2≤t≤m,且t为正整数;上述预先训练的第二神经网络通过第二训练模块得到,上述第二训练模块用于:分别提取已标注的训练用视频中时序相邻的两个样本图像帧中上述目标物体的特征;根据提取的特征预测上述目标物体在时序在后的样本图像帧中的运动信息;根据上述运动信息的预测结果和上述训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足上述第二神经网络预定的训练完成条件。
在一些实施例中,上述运动信息预测单元包括:相对变化信息确定模块,用于根据所述第一特征和所述第二特征,确定所述至少一在后图像帧在所述第一位置区域中的所述至少一目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息;预测模块,用于至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息。
在一些实施例中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量、上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量。
在一些实施例中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域在水平方向上较上述检测图像帧中的上述第一位置区域的变化量、上述在后图像帧中的上述第一位置区域在竖直方向上较上述检测图像帧中的上述第一位置区域的变化量。
在一些实施例中,上述位置区域确定单元包括:位置区域确定模块,用于根据所述第一位置区域、所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述预测模块用于:根据所述至少一在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,和所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,预测预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在水平方向的移动量确定;所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在竖直方向的移动量确定。
在一些实施例中,上述预测模块用于:根据所述在后图像帧中的所述第一位置区域在水平方 向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中所述第一位置区域在水平方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在水平方向的变化量确定;所述在后图像帧中所述第一位置区域在竖直方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在竖直方向的变化量确定。
在一些实施例中,上述位置区域确定单元用于:将所述第一位置区域作为所述至少一目标物体在所述至少一在后图像帧中的第二位置区域;根据所述在后图像帧在所述第一位置区域中的目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息,更新所述第二位置区域,得到所述至少一目标物体在所述至少一在后图像帧中的位置区域。
在一些实施例中,上述装置还包括:第三特征提取单元,用于响应于所述至少一目标物体在所述待检测的视频或所述视频子段中的至少一图像帧中的位置区域确定完成,提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征;类别确定单元,用于根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,每个上述待检测的视频或每一上述视频子段包括n个时序连续的图像帧,n>1,且n为整数;以及上述第三特征提取单元用于:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在一些实施例中,上述类别确定单元包括:解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果;类别确定模块,用于根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,上述解码结果确定模块用于:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
根据本申请实施例的再一方面,提供了一种用于检测视频中物体的装置,上述装置包括:第二位置区域确定单元,用于确定至少一目标物体在待检测的视频或所述视频子段包括的至少一图像帧中的位置区域;第一特征提取单元,用于提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;第一类别确定单元,用于根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。
在一些实施例中,所述待检测的视频或所述视频子段包括n个时序连续的图像帧,n>1,且n为整数;上述第一特征提取单元用于:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在一些实施例中,上述第一类别确定单元包括:第一解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果;第一类别确定模块,用于根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
在一些实施例中,上述第一解码结果确定模块用于:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
根据本申请实施例的再一方面,提供了一种电子设备,包括:处理器和存储器;存储器,用 于存储至少一可执行指令,所述可执行指令使所述处理器执行权利要求1~29任一项所述方法对应的操作操作。
根据本申请实施例的再一方面,提供了一种电子设备,包括:存储器,存储可执行指令;一个或多个处理器,与存储器通信以执行可执行指令从而完成以下操作:确定至少一目标物体在待检测的视频包括的各图像帧中的位置区域;提取各上述目标物体在上述待检测的视频或上述视频子段的各图像帧中的位置区域中的第三特征;根据提取的各第三特征,分别确定各图像帧中的目标物体的类别。
根据本申请实施例的再一方面,提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现本申请任一实施例所述方法中各步骤的指令。
根据本申请实施例的再一方面,提供了一种计算机可读存储介质,用于存储计算机可读取的指令,所述指令被执行时实现本申请任一实施例所述方法中各步骤的操作。
本申请实施例提供的用于检测视频中物体的方法和装置,首先确定待检测的视频中的一个或多个图像帧为检测图像帧,然后获取检测图像帧中包含的各个目标物体对应的第一位置区域,再分别提取各个检测图像帧在上述第一位置区域的第一特征和与各检测图像帧时序连续的一个或多个在后图像帧在上述第一位置区域的第二特征,根据提取的各第一特征和各第二特征,预测上述各个目标物体在各在后图像帧中的运动信息,最后根据上述第一位置区域和预测结果,确定各个目标物体在各在后图像帧中的位置区域。这样,通过确定各个目标物体在待检测的视频中各个图像帧中的位置区域,就可以实现对视频中的目标物体的检测,有效地提高了计算效率。
下面通过附图和实施例,对本发明的技术方案做进一步的详细描述。
构成说明书的一部分的附图描述了本发明的实施例,并且连同描述一起用于解释本发明的原理。通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请实施例的其它特征、目的和优点将会变得更明显:
图1是根据本申请的用于检测视频中物体的方法的一个实施例的流程图;
图1a是本申请实施例的用于检测视频中物体的方法的检测结果与现有技术的检测结果的对比示意图;
图2是根据本申请的用于检测视频中物体的方法的另一个实施例的流程图;
图2a是图2所示实施例中利用四维权重矩阵初始化16维权重矩阵的示意图;
图2b是图2所示实施例中利用5帧预测模型构建20帧预测模型的示意图;
图3是根据本申请的用于检测视频中物体的方法的又一个实施例的流程图;
图4是根据本申请的用于检测视频中物体的方法的又一个实施例的流程图;
图4a是是图4所示流程对应的工作关系示意图;
图5是根据本申请的用于检测视频中物体的装置的一个实施例的结构示意图;
图6是根据本申请的用于检测视频中物体的装置的另一个实施例的结构示意图;
图7是适于用来实现本申请实施例的终端设备或服务器的计算机系统的结构示意图。
下面结合附图和实施例对本申请实施例作进一步的详细说明。可以理解的是,此处所描述的实施例仅仅用于解释相关发明,而非对该发明的限定。应注意到:除非另外说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为说明书的一部分。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
本发明实施例可以应用于终端设备、计算机系统、服务器等电子设备,其可与众多其它通用或专用 计算系统环境或配置一起操作。适于与终端设备、计算机系统、服务器等电子设备一起使用的众所周知的终端设备、计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统﹑大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。
终端设备、计算机系统、服务器等电子设备可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。
需要说明的是,在不冲突的情况下,本申请中的各实施例及各实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请实施例。
参考图1,示出了根据本申请的用于检测视频中的物体的方法的一个实施例的流程100。本实施例的用于检测视频中的物体的方法,包括以下步骤:
步骤101,确定待检测的视频中至少一图像帧为检测图像帧,获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域。
在本申请各实施例中,待检测的视频中可以包括多个时序连续的图像帧,本申请各实施例的方法运行于其上的电子设备(如终端或服务器)可以确定上述待检测的视频中的一个或多个图像帧为检测图像帧。上述检测图像帧为一个时,其可以包含多个目标物体,且上述多个目标物体可以为同一种类的目标物体,也可以为不同种类的目标物体。上述检测图像帧为多个时,各检测图像帧之间可以是时序连续的,也可以是时序离散的。并且,各检测图像帧所包含的目标物体的数量和/或种类可以相同,也可以不同。上述目标物体可以是预设的各种类别的物体,例如可以包括飞机、自行车、汽车等各种交通工具,还可以包括鸟类、狗、狮子等各种动物。
在确定了检测图像帧后,可以利用各种图像处理方法来获取各检测图像帧中包含的各目标物体对应的第一位置区域,例如可以利用静态区域提议方法对各检测图像帧进行检测。
在一个可选示例中,该步骤101可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的检测图像帧确定单元501和第一位置区域确定单元502执行。
步骤102,分别提取各检测图像帧中各第一位置区域的第一特征和上述视频中相对各检测图像帧时序连续的至少一在后图像帧在各第一位置区域的第二特征。
在确定了各检测图像帧后,需要同时确定与各检测图像帧时序连续的至少一在后图像帧。这样,如果各检测图像帧为时序连续的,其与至少一在后图像帧结合,仍然为时序连续的一组图像;如果各检测图像帧为时序离散的,每个检测图像帧后都存在至少一在后图像帧,则上述待检测的视频中包括多个离散的图像组合,每个图像组合包括至少两个图像帧。
由于在待检测的视频中,时序相邻的两图像帧间的时间间隔很小,则目标物体在此时序相邻的两图像帧中的位置区域也很近,从而能够更容易的在时序连续的多个图像帧中预测目标物体的位置区域,提高预测的准确性。而对于离散的多个检测图像帧,由于各检测图像帧之间的时间间隔较大,避免了时序连续的多个检测图像帧由于位置区域相近造成的检测资源浪费的现象,提高了有效检测率。
本实施例中,在确定了检测图像帧和至少一在后图像帧后,可以分别提取各检测图像帧在上述第一位置区域的第一特征和各在后图像帧在上述第一位置区域的第二特征。在提取上述第一特征和第二特征时,例如可以利用卷积神经网络的卷积层来实现。
步骤103,根据提取的各第一特征和各第二特征,预测上述至少一目标物体分别在至少一在后图像帧中的运动信息。
在提取了检测图像帧的各第一特征和各在后图像帧的各第二特征后,可以利用提取的各第一特征和各第二特征来预测上述至少一目标物体分别在至少一在后图像帧中的运动信息。上述运动信息可以包括但不限于以下至少之一:各目标物体的运动趋势、相对于检测图像帧移动的距离等信息。
在一个可选示例中,该步骤103可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的运动信息预测单元504执行。
步骤104,至少根据上述至少一目标物体在上述至少一检测图像帧中的第一位置区域及上述至少一目标物体在至少一在后图像帧中的运动信息的预测结果,确定上述至少一目标物体在至少一在后图像帧中的位置区域。
本实施例中,根据目标物体在至少一检测图像帧中的第一位置区域以及其在至少一在后图像帧中的运动信息的预测结果,可以确定各目标物体在至少一在后图像帧中的位置区域。在确定了各目标物体在上述至少一检测图像帧和至少一在后图像帧中的位置区域后,可基于获取的位置区域进行进一步的应用,例如可以根据位置区域实现对各目标物体的检测。
在一个可选示例中,该步骤104可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的位置区域确定单元505执行。
可以理解的是,在各目标物体在各图像帧中的位置区域确定完成后,时序连续的各图像帧的位置区域连通可以形成贯穿与整个待检测的视频或视频子段的管状区域,此管状区域中即包含了目标物体的运动位置的信息,还包含了目标物体在每个图像帧中运动的时间信息,即各图像帧中的运动信息具有时间相关性。
本申请的上述实施例提供的用于检测视频中物体的方法,既能够保留目标物体的运动的时间相关性,又能保证视频中物体检测的多样性。参见图1a,图1a中示出了4行图像,其中(a)行为待检测的视频中的原始图像帧;(b)行为利用静态区域提议方法得到的检测结果;(c)行为利用以物体的准确位置为目标的回归方法得到的检测结果;(d)行为利用本申请实施例的用于检测视频中物体的方法得到的检测结果,可知该检测结果中即保留了检测的多样性,又保留了时间相关性。
本申请的上述实施例提供的用于检测视频中物体的方法,首先确定待检测的视频中的一个或多个图像帧为检测图像帧,然后获取检测图像帧中包含的各个目标物体对应的第一位置区域,再分别提取各个检测图像帧在上述第一位置区域的第一特征和与各检测图像帧时序连续的一个或多个在后图像帧在上述第一位置区域的第二特征,根据提取的第一特征和第二特征,预测上述至少一目标物体在至少一在后图像帧中的运动信息,最后根据上述第一位置区域和预测结果,确定上述至少一目标物体在至少一在后图像帧中的位置区域。这样,通过预测各目标物体在上述至少一在后图像帧中的运动信息,并在上述运动信息预测完成后确定各个目标物体在待检测的视频中各个图像帧中的位置区域,就可以实现对视频中的目标物体的检测,在提高了计算效率的同时,保留了目标物体运动的时间信息,同时保证了检测结果的多样性。
在本实施例的一些可选的实现方式中,在上述步骤101中,可以通过以下方式来确定待检测的视频中至少一图像帧为检测图像帧:将待检测的视频的第一图像帧作为检测图像帧。
在本实现方式中,可以将待检测的视频中的第一个图像帧作为检测图像帧,这样可以依次对待检测的视频中的各图像帧中的目标物体进行检测,既可以保证检测的全面性,又不会降低检测的准确性。
在本实施例的一些可选的实现方式中,在上述步骤101中,还可以通过以下方式来确定待检测的视频中至少一图像帧为检测图像帧:将待检测的视频的任一关键帧作为检测图像帧。
本实现方式中,上述关键帧可以是某一类目标物体第一次出现的图像帧,可以是目标物体最完整的图像帧(此处的完整是指目标物体的整体全部出现在图像帧中),也可以是出现目标物体数量最多的图像帧,还可以是出现目标物体种类最多的图像帧。可以理解的是,可以对待检测的视频中的各图像帧进行遍历,确定目标物体的数量和/或种类和/或完整性,可以确定各图像帧中关键帧的位置和数量。
在本实施例的一些可选的实现方式中,在上述步骤101中,还可以通过以下方式来确定待检测的视频中至少一图像帧为检测图像帧:将待检测的视频中至少一已知上述至少一目标物体的位置区域的图像帧作为检测图像帧。
本实现方式中,如果待检测的视频中存在一个或多个图像帧,并且已知该一个或多个图像帧 中各目标物体所在的位置区域,则将此一个或多个图像帧作为检测图像帧。这样,无需再对检测图像帧中的目标物体进行检测,可以进一步提高计算效率。
在本实施例的一些可选的实现方式中,可以将上述待检测的视频分为多个时序连续的视频子段,并且定义至少两个时序相邻的视频子段共有至少一个图像帧。则上述步骤101中,还可以通过以下方式来确定待检测的视频中至少一图像帧为检测图像帧:将上述共有的至少一个图像帧作为检测图像帧。
本实现方式中,对于两个时序相邻的视频子段,如果时序在前的视频子段中的各图像帧中的目标物体的位置区域都已确定,对于时序在后的视频子段来说,选择共有的图像帧作为检测图像帧,无需再对检测图像帧中的目标物体进行检测,可以进一步提高计算效率。
在本实施例的一些可选的实现方式中,可以定义每一上述视频子段包括m个图像帧,则上述步骤101中,还可以通过以下方式来确定待检测的视频中至少一图像帧为检测图像帧:将时序在前的m-1个图像帧作为检测图像帧。
本实现方式中,可以将每个视频子段的前m-1个图像帧作为检测图像帧,结合最后一个图像帧即第m个图像帧中的特征来预测第m个图像帧中目标物体的位置区域。这样,可以提高检测的准确性。
在本实施例的一些可选的实现方式中,上述步骤101中可以通过以下方式实现获取检测图像帧所包含的至少一目标物体对应的第一位置区域:在检测图像帧中标注上述至少一目标物体对应的第一位置区域。
本实现方式中,在确定了各检测图像帧后,可以对上述检测图像帧中包含的目标物体进行标注,通过标注的区域来确定各目标物体的第一位置区域。
在本实施例的一些可选的实现方式中,上述步骤101中可以通过以下方式实现获取检测图像帧所包含的至少一目标物体对应的第一位置区域:根据检测图像帧中已知的上述至少一目标物体的位置区域确定第一位置区域。
本实现方式中,如果待检测的视频中存在一个或多个图像帧,并且已知该一个或多个图像帧中各目标物体所在的位置区域,则可以将已知的位置区域确定为第一位置区域。
在本实施例的一些可选的实现方式中,上述步骤101中可以通过以下方式实现获取检测图像帧所包含的至少一目标物体对应的第一位置区域:根据任两个时序相邻的视频子段中时序在前的视频子段中至少一共同图像帧中上述至少一目标物体的位置区域,确定时序在后的视频子段中检测图像帧的第一位置区域。
本实现方式中,如果时序在前的视频子段中的各图像帧中的目标物体的位置区域都已确定,对于时序在后的视频子段来说,选择共有的图像帧作为检测图像帧,无需再对检测图像帧中的目标物体进行检测,可以进一步提高计算效率。
在本实施例的一些可选的实现方式中,上述步骤101中可以通过以下方式实现获取检测图像帧所包含的至少一目标物体对应的第一位置区域:根据上述至少一目标物体在检测图像帧中的位置的外接矩形区域或外接轮廓区域,确定第一位置区域。
本实现方式中,在对检测图像帧中的目标物体进行标注时,可以采用但不限于目标物体所在位置的外接矩形或其它外接轮廓对目标物体进行标注,则此时可以确定上述外接矩形区域或外接轮廓区域为第一位置区域。在利用外接矩形对目标物体进行标注时,可以采用但不限于目标物体的最小外接矩形对目标物体进行标注。
继续参考图2,其示出了根据本申请的用于检测视频中物体的方法的另一个实施例的流程图200。如图2所示,本实施例的用于检测视频中物体的方法中在预测目标物体在各在后图像帧中的运动信息时,可以通过以下步骤来实现:
步骤201,提取每一视频子段中时序在前的m-1个图像帧在各第一位置区域的第一特征和时序在后的第m个图像帧在各第一位置区域的第二特征。
本实施例中,定义每一视频子段中包括m个图像帧,并将时序在前的m-1个图像帧作为检测 图像帧,将第m个图像帧作为在后图像帧。在提取特征时,分别提取时序在前的m-1个图像帧在各第一位置区域的第一特征以及时序在后的第m个图像帧在各第一位置区域的第二特征。
在一个可选示例中,该步骤201可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的特征提取单元503执行。
步骤202,对于每个视频子段,根据提取的第一特征、与各第一特征对应的第一预设权重以及提取的第二特征、与上述第二特征对应的第二预设权重,预测上述至少一目标物体在时序在后的第m个图像帧中的运动信息。
其中,m为大于1的整数。
在得到上述第一特征和第二特征后,可基于第一预设权重对各第一特征进行加权处理,基于第二预设权重对第二特征进行加权处理。
在一个可选示例中,该步骤202可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的运动信息预测单元504执行。
在本实施例的一些可选的实现方式中,在利用上述各第一特征和第二特征进行运动信息的预测时,可以利用预先训练的第一神经网络来预测,上述预先训练的第一神经网络的网络参数包括权重矩阵,该权重矩阵包括上述第一预设权重和第二预设权重。
在本实施例的一些可选的实现方式中,上述预先训练的第一神经网络由图2中未示出的以下训练步骤得到:
将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将第三权重确定为m个图像帧中的第1个图像帧的特征的第一预设权重的初始值;将第四权重确定为第t个图像帧的特征的第二预设权重的初始值,其中,2≤t≤m,且m和t均为正整数。
当待检测的视频的时间窗包括多个图像帧或视频子段包括多个图像帧时,利用上述第三权重初始化时序连续的m个图像帧中的第一个图像帧的第一特征的权重,利用上述第四权重分别初始化时序连续的m个图像帧中的第2~第m个图像帧的第二特征的权重,即为上述第一预设权重和上述第二预设权重设置初始值,得到的初始第一神经网络的权重矩阵中包括上述第一预设权重的初始值和第二预设权重的初始值。通过训练上述初始第一神经网络,上述权重由初始值调整为第一预设权重和第二预设权重,同时得到带有新的(m-1)
2维权重矩阵的第一神经网络,就可以同时预测第2~第m个图像帧中目标物体的运动信息,有效地提高了运算效率。
以待检测的视频子段分别包括2个图像帧和5个图像帧的情形为例,参见图2a和图2b,图2a中,用于检测包括2个图像帧的视频子段的第二神经网络(也可称为2帧预测模型),如第二卷积神经网络的权重矩阵包括分别对应2个图像帧所提取特征的两个权重部分—权重A(对应上述第三权重)和权重B(对应上述第四权重)。第二神经网络可以结合时序在前的一帧图像的第一特征和时序在后的一帧图像的第二特征以及上述权重A和权重B,来预测时序在后的一帧图像中的目标物体的运动信息。为了提高运算效率,可以利用第二神经网络的权重矩阵包含的两个权重部分构建用于检测包括多个图像帧的视频子段的第一神经网络(如第一卷积神经网络)的权重矩阵,图2a中,右侧虚线框中为用于检测包括5个图像帧的视频子段的第一神经网络(也可称为5帧预测模型),如第一卷积神经网络权重矩阵,则权重A为时序连续的5个图像帧中的第1个图像帧的特征的权重的初始值,权重B分别为时序连续的5帧图像帧中的第2个图像帧、第3个图像帧、第4个图像帧以及第5个图像帧的特征的权重的初始值。基于已标注的训练用视频训练第一神经网络,根据每次训练过程中的检测结果反复调整第一神经网络的权重矩阵,直至满足训练完成条件,训练完成的权重矩阵即包括上述第一预设权重和第二预设权重。带有上述权重矩阵的训练后的第一神经网络可以同时预测目标物体在第2个图像帧、第3个图像帧、第4个图像帧以及第5个图像帧中的位置区域。由此上述2帧预测模型可以结合第1个图像帧和第2个图像帧的特征,预测目标物体在第2个图像帧中的位置区域;上述5帧预测模型可以结合第1~第5个图像帧的特征,预测目标物体分别在第2~5个图像帧中的位置区域。该方案有利于提高神经网络模型的训练速度,提高运算效率。
为了进一步地提高运算效率,可以利用上述5帧预测模型构建长度更长的预测模型,以同时预测目标物体在更多个图像帧中的位置区域。如图2b所示,可以利用5个上述5帧预测模型构建20帧预测模型,由于每个5帧预测模型的最后一个图像帧用于作为下一个5帧预测模型的第1个图像帧,因此,5个上述5帧预测模型可以构建20帧预测模型,等等。
可以理解的是,上述过程是一个初始化的过程,实际在应用上述第一神经网络时,需要首先对第二神经网络进行训练,则预先训练的第二神经网络通过以下训练步骤得到:
分别提取已标注的训练用视频中时序相邻的两个样本图像帧中目标物体的特征;根据提取的特征预测目标物体在时序在后的样本图像帧中的运动信息;根据运动信息的预测结果和训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足第二神经网络预定的训练完成条件。
在训练上述第二神经网络时,先获取已标注的训练用视频,可以理解的是,上述训练用视频包括多个图像帧,每个图像帧中的目标物体均已被标注,这样每个图像帧可以作为样本图像帧。然后提取上述训练用视频中时序相邻的两个样本图像帧中目标物体的特征,根据提取的特征预测目标物体在时序在后的样本图像帧中的运动信息,可以理解的是,根据此处的运动信息就可以确定目标物体在时序在后的样本图像帧中的位置区域,将此位置区域与已标注的位置区域同时输入第二神经网络,并调整第二神经网络的参数,直到满足第二神经网络的训练完成条件。上述训练完成条件可以是任何可以停止第二神经网络训练的条件,例如上述条件可以是预测确定的位置区域与标注的位置区域之间的误差小于预设值等等。
本申请的上述实施例提供的用于检测视频中物体的方法,在提取检测图像帧的第一特征以及在后图像帧的第二特征后,通过初始化并调整上述第一特征的权重和第二特征的权重,并结合上述调整后的权重,可以实现对上述在后图像帧中的目标物体的运动信息的更精准的预测。
继续参考图3,其示出了根据本申请的用于检测视频中物体的方法的又一个实施例的流程300。如图3所示,本实施例的用于检测视频中物体的方法中在预测目标物体在各在后图像帧中的运动信息时,可以通过以下步骤来实现:
步骤301,根据上述第一特征和第二特征,确定上述至少一在后图像帧在各第一位置区域中的上述至少一目标物体相对检测图像帧在第一位置区域中的目标物体的相对变化信息。
在提取了第一特征和第二特征后,可以利用预先训练的回归网络基于上述提取的特征,确定在后图像帧在各第一位置区域中的目标物体相对于检测图像帧在第一位置区域中的目标物体的相对位置信息。
在本实施例的一些可选的实现方式中,上述相对位置信息可以包括:在后图像帧中的第一位置区域中心点在水平方向上较检测图像帧中的第一位置区域中心点的移动量Δx、在后图像帧中的第一位置区域中心点在竖直方向上较检测图像帧中的第一位置区域中心点的移动量Δy。
当上述目标物体在检测图像帧中的第一位置区域和在在后图像帧中的第一位置区域为矩形、椭圆、圆形或其它规则的图形时,可以确定其中心点。对于同一目标物体,可以通过上述第一位置区域的中心点在水平方向上的移动量Δx确定该目标物体在水平方向上的移动距离。同理,可以通过上述第一位置区域的中心点在竖直方向上的移动量Δy确定该目标物体在竖直方向上的移动距离。
在本实施例的一些可选的实现方式中,上述相对位置信息还可以包括:在后图像帧中的第一位置区域在水平方向上较检测图像帧中的第一位置区域的变化量Δw、在后图像帧中的第一位置区域在竖直方向上较检测图像帧中的第一位置区域的变化量Δh。
本实现方式中,对于同一目标物体,可以通过确定上述第一位置区域在水平方向上的宽度变化量Δw确定该目标物体所在的位置区域在水平方向上的宽度。同理,可以通过确定上述第一位置区域在竖直方向上的高度变化量Δh确定该目标物体所在的位置区域在竖直方向上的高度。
在本实施例的一些可选的实现方式中,上述相对变化信息可以根据以下公式来确定:
Δx=(x
t-x
1)/w
1;Δy=(y
t-y
1)/h
1;Δw=log(w
t/w
1);Δh=log(h
t/h
1)。
即根据第一位置区域的中心点在水平方向上的移动量Δx以及其在检测图像帧中沿水平方向的坐标x
1和宽度w
1,确定第一位置区域的中心点在在后图像帧中沿水平方向的坐标x
t;根据第一位置区域的中心点在竖直方向的移动量Δy以及其在检测图像帧中沿竖直方向的坐标y
1和高度h
1,确定第一位置区域的中心点在在后图像帧中沿竖直方向的坐标y
t;根据第一位置区域在水平方向的宽度变化量Δw以及其在检测图像帧中沿水平方向的宽度w
1,确定第一位置区域在在后图像帧中沿水平方向的宽度w
t;根据第一位置区域在竖直方向的高度变化量Δh以及其在检测图像帧中沿竖直方向的高度h
1,确定第一位置区域在在后图像帧中沿竖直方向的高度h
t。
在一个可选示例中,该步骤301可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的相对变化信息确定模块执行。
步骤302,至少根据上述至少一目标物体的相对变化信息,预测上述至少一目标物体在上述至少一在后图像帧中的运动信息。
本实施例中,可以根据得到的上述相对变化信息,来预测上述至少一目标物体在上述至少一在后图像帧中的运动信息。
在一个可选示例中,该步骤302可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的相对预测模块执行。
在本实施例的一些可选的实现方式中,可以根据第一位置区域的中心点在水平方向上的移动量Δx以及第一位置区域的中心点在竖直方向的移动量Δy,预测各目标物体在各在后图像帧中的运动信息。
在本实施例的一些可选的实现方式中,可以根据第一位置区域在水平方向的宽度变化量Δw以及第一位置区域在竖直方向的高度变化量Δh,预测各目标物体在各在后图像帧中的运动信息。
在本实施例的一些可选的实现方式中,在得到上述相对变化信息后,还可以通过以下方式来确定上述至少一目标物体在至少一在后图像帧中的位置区域:将上述第一位置区域作为至少一目标物体在至少一在后图像帧中的第二位置区域,根据上述相对变化信息,更新上述第二位置区域,得到上述至少一目标物体在至少一在后图像帧中的位置区域。
本实现方式中,在确定了第一位置区域在在后图像帧中沿水平方向的坐标x
t、在在后图像帧中沿竖直方向的坐标y
t、在在后图像帧中沿水平方向的宽度w
t以及在在后图像帧中沿竖直方向的高度h
t后,可以根据上述相对变化信息更新上述第二位置区域的位置,并将更新后的第二位置区域作为上述至少一目标物体在至少一在后图像帧中的位置区域。
本申请的上述实施例提供的用于检测视频中物体的方法,通过确定上述各相对变化信息,能够准确的确定上述至少一目标物体在至少一在后图像帧中的位置区域,保证了目标物体检测的准确性。
采用本申请实施例提供的任一种检测视频中物体的方法,获得视频包括的至少一图像帧的位置区域之后,可基于获得的各图像帧的位置区域进行分类、图像分割等任务的处理,本申请并不限制相应的实现手段。下文将以分类任务为例进行说明。可以理解,本申请实施例中的分类任务可采用本申请实施例体用的任一种检测视频中物体的方法来确定待检测视频中目标物体在各图像帧中的位置信息,也可采用现有技术的其他方法来检测视频中目标物体在各图像帧中位置信息,本申请实施例对此并不限制。
参考图4,其示出了根据本申请的用于检测视频中物体的方法的又一个实施例的流程400。如图4所示,本实施例的用于检测视频中物体的方法包括以下步骤(以下步骤可以在图1所示的流程后执行,也可在采用与图1不同的方法获得视频中目标物体在各图像帧中的位置区域之后执行,本申请实施例并不限制):
步骤401,提取至少一目标物体在待检测的视频或视频子段的至少一图像帧中的位置区域中的第三特征。
本步骤可在响应于对至少一目标物体在待检测的视频或视频子段中的至少一图像帧中的位置区域确定完成时执行,相应的确定方法可采用本申请实施例提供的任一种检测方法进行,也可采用其他方法确定,如标注方式、逐图像帧静态检测方式等等,本申请实施例并不限制。
本实施例中,在目标物体在图像帧中的位置区域确定完成后,时序连续的各图像帧的位置区域连通可以形成贯穿与整个待检测的视频或视频子段的管状区域,然后可以提取上述位置区域的第三特征。可 以理解的是,由于已经确定上述位置区域中包含目标物体,所以提取的第三特征为各目标物体的特征。
在本实施例的一些可选的实现方式中,设定每个待检测的视频或每个视频子段包括n个时序连续的图像帧,其中,n为大于1的整数。上述步骤401可以按照以下方式来实现:按照时序顺序提取n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
本实现方式中,按照从第1个图像帧到第n个图像帧的顺序,依次提取n个图像帧的第三特征,然后对于每个图像帧,都对该图像帧的第三特征和在该图像帧之前的各图像帧的第三特征进行编码,直到对第n个图像帧的第三特征编码完成。
在编码时,可以采用但不限于可编码的长短期记忆(Long short-term memory,LSTM)单元。其在编码时可以读入管状区域的特征,从而可以对管状区域的外观和管状区域所包含的目标物体的运动信息进行编码,从而可以得到每个图像帧的第三特征的编码信息。
在一个可选示例中,该步骤401可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一特征提取单元602执行。
步骤402,根据提取的第三特征,分别确定上述至少一图像帧中的目标物体的类别。
根据提取的至少一目标物体的第三特征,可以确定至少一目标物体的类别。可以理解的是,不同图像帧中的目标物体的类别可以相同,也可以不相同。
在一个可选示例中,该步骤402可以由处理器调用存储器存储的相应指令执行,也可以由被处理器运行的第一类别确定单元603执行。
在本实施例的一些可选的实现方式中,上述步骤402可以进一步通过图4中未示出的以下步骤来实现:根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定上述至少一图像帧的第三特征的解码结果;根据至少一图像帧的第三特征的解码结果,分别确定至少一图像帧中的目标物体的类别。
在提取了上述至少一图像帧的第三特征,并完成了对第n个图像帧的第三特征的编码后,对上述至少一图像帧的第三特征及上述第n个图像帧的编码结果进行解码,然后根据解码结果,确定各图像帧中的目标物体的类别。
在解码时,可以采用可解码的LSTM单元实现上述解码。
在本实施例的一些可选的实现方式中,在解码时可以根据图4中未示出的以下步骤实现对上述至少一图像帧的第三特征的解码:按照时序倒序,对n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至n个图像帧的第三特征解码完成。
在解码时,按照从第n个图像帧到第1个图像帧的顺序,依次对上述至少一图像帧中的每个图像帧的第三特征的编码结果进行解码。对于每个图像帧,根据该图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定该图像帧的解码结果,直到n个图像帧的第三特征的解码完成。可以理解的是,第n个图像帧的第三特征的编码结果即为待检测的视频或视频子段的管状区域的编码结果,在对每个图像帧的第三特征的编码结果进行解码时,结合管状区域的编码结果和该图像帧的第三特征对该图像帧进行解码,得到的解码结果保留了各图像帧中的目标物体之间的时间相关性。
本实施例的用于检测视频中物体的方法,可以采用图4a所示的结构来完成,图4a是图4所示流程对应的工作关系示意图。图4a中,首先在确定了各目标物体在第1个图像帧的第一位置区域后,形成多个管状区域,然后预测各目标物体在各图像帧中的位置区域,对上述形成的管状区域进行调整。在调整完成后,提取每个图像帧在上述管状区域的特征,得到每个图像帧的特征后,按照从第1个图像帧~最后一个图像帧的顺序依次对各图像帧中的特征进行编码,然后得到整个管状区域的编码结果。在编码时,可以采用可编码的LSTM。然后对得到的编码结果进行解码,在解码时,可以结合每个图像帧在管状区域的特征以及得到的管状区域的编码结果,按照从最后一个图像帧~第1个图像帧的顺序依次对各图像帧中的特征进行解码。在解码时,可以采用可解码的LSTM。在解码后,可以根据解码结果,对视频中上述至少一图像帧包含的目标物体进行分类。
本申请的上述实施例提供的用于检测视频中物体的方法,在确定了上述至少一目标物体在至少一图像帧中的位置区域后,可以编码各位置区域的第三特征,得到至少一目标物体在整个管状区域内的综合特征,再根据解码结果实现对至少一目标物体中各目标物体的分类时,综合考虑了各目标物体在整个管状区域内的全部特征,解码上述管状区域得到的编码结果可以采用从最后一个图像帧~第1个图像帧的顺序依次对上述至少一图像帧中各图像帧中的特征进行解码,也可以从采用第1个图像帧~最后一个图像帧的顺序依次对各图像帧中的特征进行解码,但采用从最后一个图像帧~第1个图像帧的顺序依次对各图像帧中的特征进行解码可以保证每个图像帧中各目标物体的检测类别都是根据各目标物体在整个管状区域内的全部特征确定的,提高对视频中物体分类的准确度。
本申请上述任一实施例提供的方法可以由任意适当的具有数据处理能力的设备执行,包括但不限于:终端设备和服务器等。或者,本申请上述任一实施例提供的方法可以由处理器执行,如处理器通过调用存储器存储的相应指令来执行本申请上述任一实施例提供的方法。下文不再赘述。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
继续参见图5,其示出了根据本申请的用于检测视频中物体的装置的结构示意图。如图5所示,本实施例的用于检测视频中物体的装置500包括:检测图像帧确定单元501、第一位置区域确定单元502、特征提取单元503、运动信息预测单元504以及位置区域确定单元505。
其中,检测图像帧确定单元501,用于确定待检测的视频中至少一图像帧为检测图像帧。
第一位置区域确定单元502,用于获取上述检测图像帧所包含的至少一目标物体对应的第一位置区域。
特征提取单元503,用于分别提取上述各检测图像帧中各第一位置区域的第一特征和视频中相对上述各检测图像帧时序连续的至少一在后图像帧在各上述第一位置区域的第二特征。
运动信息预测单元504,用于根据提取的上述第一特征和第二特征,预测上述至少一目标物体分别在上述至少一在后图像帧中的运动信息。
位置区域确定单元505,用于至少根据上述至少一目标物体在上述至少一检测图像帧中的第一位置区域及上述至少一目标物体在上述至少一在后图像帧中的运动信息的预测结果,确定上述至少一目标物体在上述至少一在后图像帧中的位置区域。
本申请的上述实施例提供的用于检测视频中物体的装置,首先确定待检测的视频中的一个或多个图像帧为检测图像帧,然后获取检测图像帧中包含的目标物体对应的第一位置区域,再分别提取上述至少一检测图像帧在第一位置区域的第一特征和与各检测图像帧时序连续的一个或多个在后图像帧在上述第一位置区域的第二特征,根据提取的第一特征和第二特征,预测上述至少一目标物体在上述至少一在后图像帧中的运动信息,最后根据上述第一位置区域和预测结果,确定至少一目标物体在上述至少一在后图像帧中的位置区域。这样,通过确定目标物体在待检测的视频中至少一图像帧中的位置区域,就可以实现对视频中的目标物体的检测,有效地提高了计算效率。
在本实施例的一些可选的实现方式中,上述检测图像帧确定单元501可用于:将上述待检测的视频的第一图像帧作为上述检测图像帧。
在本实施例的一些可选的实现方式中,上述检测图像帧确定单元501可用于:将上述待检测的视频的任一关键帧作为上述检测图像帧。
在本实施例的一些可选的实现方式中,上述检测图像帧确定单元501可用于:将上述待检测的视频中至少一已知上述至少一目标物体的位置区域的图像帧作为上述检测图像帧。
在本实施例的一些可选的实现方式中,上述待检测的视频包括多个时序连续的视频子段,至少两个时序相邻的视频子段包括至少一共同图像帧。则上述检测图像帧确定单元501可用于:将上述至少一共同图像帧作为上述检测图像帧。
在本实施例的一些可选的实现方式中,每一上述视频子段中包括时序连续的m个图像帧。则上述检测图像帧确定单元501可用于:将时序在前的m-1个图像帧作为上述检测图像帧。
在本实施例的一些可选的实现方式中,上述第一位置区域确定单元502可用于:在上述检测图像帧中标注各上述目标物体对应的第一位置区域。
在本实施例的一些可选的实现方式中,上述第一位置区域确定单元502可用于:根据上述检测图像帧中已知的各上述目标物体的位置区域确定上述第一位置区域。
在本实施例的一些可选的实现方式中,上述第一位置区域确定单元502可用于:根据任两个时序相邻的视频子段中时序在前的视频子段中上述至少一共同图像帧中上述至少一目标物体的位置区域,确定时序在后的视频子段中上述检测图像帧的第一位置区域。
在本实施例的一些可选的实现方式中,上述第一位置区域确定单元502可用于:根据各上述目标物体在上述检测图像帧中的位置的外接矩形区域或外接轮廓区域,确定上述第一位置区域。
在本实施例的一些可选的实现方式中,上述运动信息预测单元504可用于:根据各上述目标物体在任一上述检测图像帧中的第一特征及上述目标物体在任一在后图像帧中的第二特征,预测上述至少一目标物体在上述至少一在后图像帧中的运动信息。
在本实施例的一些可选的实现方式中,上述运动信息预测单元504可用于:对于每个视频子段,根据时序在前的m-1个图像帧的各第一特征、与上述各第一特征对应的第一预设权重以及时序在后的第m个图像帧的第二特征、与上述第二特征对应的第二预设权重,预测上述至少一目标物体在上述时序在后 的第m个图像帧中的运动信息,m为整数,且m>1。
在本实施例的一些可选的实现方式中,上述运动信息预测单元504可用于:根据提取的上述第一特征和上述第二特征,利用预先训练的第一神经网络预测上述至少一目标物体分别在上述至少一在后图像帧中的运动信息,其中,上述预先训练的第一神经网络的权重矩阵包括上述第一预设权重以及上述第二预设权重。
在本实施例的一些可选的实现方式中,在m大于2时,上述预先训练的第一神经网络通过以下第一训练模块得到,上述第一训练模块用于:
将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将上述第三权重确定为上述m个图像帧中的第1个图像帧的的特征上述第一预设权重的初始值;将第四权重确定为第t个图像帧的特征的上述第二预设权重的初始值,其中,2≤t≤m,且t为正整数。
上述预先训练的第二神经网络通过第二训练模块得到,上述第二训练模块用于:分别提取已标注的训练用视频中时序相邻的两个样本图像帧中上述目标物体的特征;根据提取的特征预测上述目标物体在时序在后的样本图像帧中的运动信息;根据上述运动信息的预测结果和上述训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足上述第二神经网络预定的训练完成条件。
在本实施例的一些可选的实现方式中,上述运动信息预测单元504还可以包括图5中未示出的相对变化信息确定模块和预测模块。
其中,相对变化信息确定模块,用于根据上述第一特征和上述第二特征,确定上述至少一在后图像帧在上述第一位置区域中的目标物体相对上述检测图像帧在上述第一位置区域中的目标物体的相对变化信息。
预测模块,用于至少根据上述至少一目标物体的相对变化信息,预测上述至少一目标物体在上述至少一在后图像帧中的运动信息。
在本实施例的一些可选的实现方式中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量、上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量。
在本实施例的一些可选的实现方式中,上述相对位置变化信息包括:上述在后图像帧中的上述第一位置区域在水平方向上较上述检测图像帧中的上述第一位置区域的变化量、上述在后图像帧中的上述第一位置区域在竖直方向上较上述检测图像帧中的上述第一位置区域的变化量。
在本实施例的一些可选的实现方式中,上述位置区域确定单元505还可以包括图5中未示出的位置区域确定模块,用于根据上述第一位置区域、上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量、上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量、上述在后图像帧中的上述第一位置区域在水平方向上较上述检测图像帧中的上述第一位置区域的变化量和上述在后图像帧中的上述第一位置区域在竖直方向上较上述检测图像帧中的上述第一位置区域的变化量,确定上述目标物体在上述至少一在后图像帧中的位置区域。
在本实施例的一些可选的实现方式中,上述预测模块可用于:根据上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量,和上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量,预测预测上述目标物体在上述在后图像帧中的运动信息。
其中,上述在后图像帧中的上述第一位置区域中心点在水平方向上较上述检测图像帧中的上述第一位置区域中心点的移动量根据上述在后图像帧中上述目标物体的第二特征较与其对应的上述目标物体的第一特征在水平方向的移动量确定。
上述在后图像帧中的上述第一位置区域中心点在竖直方向上较上述检测图像帧中的上述第一位置区域中心点的移动量根据上述在后图像帧中目标物体的第二特征较与其对应的上述目标物体的第一特征在竖直方向的移动量确定。
在本实施例的一些可选的实现方式中,上述预测模块可用于:根据上述在后图像帧中的上述第一位置区域在水平方向上较上述检测图像帧中的上述第一位置区域的变化量和上述在后图像帧中的上述第一位置区域在竖直方向上较上述检测图像帧中的上述第一位置区域的变化量,预测上述目标物体在上述在后图像帧中的运动信息。
其中,上述在后图像帧中上述第一位置区域在水平方向上较上述检测图像帧中上述第一位置区域的变化量根据上述在后图像帧中上述目标物体的第二特征较与其对应的目标物体的第一特征在水平方向的变化量确定。
上述在后图像帧中上述第一位置区域在竖直方向上较上述检测图像帧中上述第一位置区域的变化量根据上述在后图像帧中上述目标物体的第二特征较与其对应的目标物体的第一特征在竖直方向的变化量确定。
在本实施例的一些可选的实现方式中,上述位置区域确定单元505可以可用于:将上述第一位置区域作为上述目标物体在上述在后图像帧中的第二位置区域;根据上述在后图像帧在上述第一位置区域中的目标物体相对上述检测图像帧在上述第一位置区域中的目标物体的相对变化信息,更新上述第二位置区域,得到上述目标物体在上述在后图像帧中的位置区域。
在本实施例的一些可选的实现方式中,上述用于检测视频中物体的装置500还可以包括图5中未示出的第三特征提取单元和类别确定单元。
第三特征提取单元,用于响应于上述至少一目标物体在上述待检测的视频或上述视频子段中的图像帧中的位置区域确定完成,提取上述至少一目标物体在上述待检测的视频或上述视频子段的图像帧中的位置区域中的第三特征。
类别确定单元,用于根据提取的第三特征,分别确定图像帧中的目标物体的类别。
在本实施例的一些可选的实现方式中,每个上述待检测的视频或每一上述视频子段包括n个时序连续的图像帧,n>1,且n为整数。上述第三特征提取单元还可以用于:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在本实施例的一些可选的实现方式中,上述类别确定单元可以可包括图5中未示出的解码结果确定模块和类别确定模块。
其中,解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定上述至少一图像帧的第三特征的解码结果。
类别确定模块,用于根据上述至少一图像帧的第三特征的解码结果,分别确定上述至少一图像帧中的目标物体的类别。
在本实施例的一些可选的实现方式中,上述解码结果确定模块可以可用于:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
图6示出了根据本申请实施例的用于检测视频中物体的装置的结构示意图。如图6所示,本实施例的用于检测视频中物体的装置600包括:第二位置区域确定单元601、第一特征提取单元602以及第一类别确定单元603。
其中,第二位置区域确定单元601,用于确定至少一目标物体在待检测的视频或视频子段包括的至少一图像帧中的位置区域。
第一特征提取单元602,用于提取上述至少一目标物体在上述待检测的视频或上述视频子段的至少一图像帧中的位置区域中的第三特征。
第一类别确定单元603,用于根据提取的第三特征,分别确定上述至少一图像帧中的目标物体的类别。
本申请的上述实施例提供的用于检测视频中物体的装置,在确定了目标物体在图像帧中的位置区域后,可以根据位置区域的第三特征实现对目标物体的分类,扩展了对视频中物体检测的功能。
在本实施例的一些可选的实现方式中,上述待检测的视频或视频子段包括n个时序连续的图像帧,n>1,且n为整数。则上述第一特征提取单元602可以用于:按照时序顺序提取上述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
在本实施例的一些可选的实现方式中,上述第一类别确定单元603还可以包括图6中未示出的第一解码结果确定模块以及第一类别确定模块。
其中,第一解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定上述至少一图像帧的第三特征的解码结果。
第一类别确定模块,用于根据上述至少一图像帧的第三特征的解码结果,分别确定上述至少一图像帧中的目标物体的类别。
在本实施例的一些可选的实现方式中,上述第一解码结果确定模块可以用于:按照时序倒序,对上述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至上述n个图像帧的第三特征解码完成。
附图中的流程图和框图,图示了按照本申请种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括检测图像帧确定单元、第一位置区域确定单元、特征提取单元、运动信息预测单元及位置区域确定单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,检测图像帧确定单元还可以被描述为“确定待检测的视频中至少一图像帧为检测图像帧的单元”。
本申请实施例还提供了一种电子设备,例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等,其包括处理器和存储器;其中:存储器,用于存储至少一可执行指令,该可执行指令使处理器执行本申请上述任一实施例所述用于检测视频中物体的方法对应的操作操作。
另外,本申请实施例还提供了一种计算机程序,包括计算机可读代码,当该计算机可读代码在设备上运行时,该设备中的处理器执行用于实现本申请上述任一实施例所述用于检测视频中物体的方法中各步骤的指令。
另外,本申请实施例还提供了一种计算机可读存储介质,用于存储计算机可读取的指令,该指令被执行时实现本申请上述任一所述用于检测视频中物体的方法中各步骤的操作。
下面参考图7,其示出了适于用来实现本申请实施例的终端设备或服务器的电子设备700的结构示意图:如图7所示,计算机系统700包括一个或多个处理器、通信部等,上述一个或多个处理器例如:一个或多个中央处理单元(CPU)701,和/或一个或多个图像处理器(GPU)713等,处理器可以根据存储在只读存储器(ROM)702中的可执行指令或者从存储部分708加载到随机访问存储器(RAM)703中的可执行指令而执行种适当的动作和处理。通信部712可包括但不限于网卡,上述网卡可包括但不限于IB(Infiniband)网卡。
处理器可与ROM 702和/或RAM 703通信以执行可执行指令,通过总线704与通信部712相连、并经通信部712与其他目标设备通信,从而完成本申请实施例提供的任一用于检测视频中物体的方法对应的操作,例如,确定待检测的视频中至少一图像帧为检测图像帧;获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域;分别提取各所述检测图像帧中各所述第一位置区域的第一特征和所述视频中相对各所述检测图像帧时序连续的至少一在后图像帧在各所述第一位置区域的第二特征;根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。或者例如,确定至少一目标物体在待检测的视频或所述视频子段包括的至少一图像帧中的位置区域;提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。
此外,在RAM 703中,还可存储有装置操作所需的各种程序和数据。CPU 701、ROM 702以及RAM703通过总线704彼此相连。在有RAM 703的情况下,ROM 702为可选模块。RAM 703存储可执行指令,或在运行时向ROM 702中写入可执行指令,可执行指令使CPU 701执行上述通信方法对应的操作。输入/输出(I/O)接口705也连接至总线704。通信部712可以集成设置,也可以设置为具有多个子模块(例如多个IB网卡),并在总线704链接上。
以下部件连接至I/O接口705:包括键盘、鼠标等的输入部分706;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分707;包括硬盘等的存储部分708;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分709。通信部分709经由诸如因特网的网络执行通信处理。驱动器710也根据需要连接至I/O接口705。可拆卸介质711,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器710上,以便于从其上读出的计算机程序根据需要被安装入存储部分708。
需要说明的,如图7所示的架构仅为一种可选实现方式,在实践过程中,可根据实际需要对上述图7的部件数量和类型进行选择、删减、增加或替换;在不同功能部件设置上,也可采用分离设置或集成设置等实现方式,例如GPU 713和CPU 701可分离设置或者可将GPU 713集成在CPU 701上,通信部 可分离设置,也可集成设置在CPU 701或GPU 713上,等等。这些可替换的实施方式均落入本申请公开的保护范围。
特别地,根据本申请的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本申请的实施例包括一种计算机程序产品,其包括有形地包含在机器可读介质上的计算机程序,计算机程序包含用于执行流程图所示的方法的程序代码,程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令,例如,确定待检测的视频中至少一图像帧为检测图像帧;获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域;分别提取各所述检测图像帧中各所述第一位置区域的第一特征和所述视频中相对各所述检测图像帧时序连续的至少一在后图像帧在各所述第一位置区域的第二特征;根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。或者例如,确定至少一目标物体在待检测的视频或所述视频子段包括的至少一图像帧中的位置区域;提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。在这样的实施例中,该计算机程序可以通过通信部分709从网络上被下载和安装,和/或从可拆卸介质711被安装。在该计算机程序被CPU 701执行时,执行本申请实施例的方法中限定的上述功能。
可能以许多方式来实现本申请的方法和装置、设备。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本申请的方法和装置、设备。用于方法的步骤的上述顺序仅是为了进行说明,本申请的方法的步骤不限于以上描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本申请实施为记录在记录介质中的程序,这些程序包括用于实现根据本申请的方法的机器可读指令。因而,本申请还覆盖存储用于执行根据本申请的方法的程序的记录介质。
本申请的描述是为了示例和描述起见而给出的,而并不是无遗漏的或者将本申请限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本申请的原理和实际应用,并且使本领域的普通技术人员能够理解本申请从而设计适于特定用途的带有各种修改的各种实施例。
Claims (60)
- 一种用于检测视频中物体的方法,其特征在于,所述方法包括:确定待检测的视频中至少一图像帧为检测图像帧;获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域;分别提取各所述检测图像帧中各所述第一位置区域的第一特征和所述视频中相对各所述检测图像帧时序连续的至少一在后图像帧在各所述第一位置区域的第二特征;根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求1所述的方法,其特征在于,所述确定待检测的视频中至少一图像帧为检测图像帧,包括:将所述待检测的视频的第一图像帧作为所述检测图像帧。
- 根据权利要求1所述的方法,其特征在于,所述确定待检测的视频中至少一图像帧为检测图像帧,包括:将所述待检测的视频的任一关键帧作为所述检测图像帧。
- 根据权利要求1项所述的方法,其特征在于,所述确定待检测的视频中至少一图像帧为检测图像帧,包括:将所述待检测的视频中至少一已知所述至少一个目标物体的位置区域的图像帧作为所述检测图像帧。
- 根据权利要求1所述的方法,其特征在于,所述待检测的视频包括多个时序连续的视频子段,至少两个时序相邻的视频子段包括至少一共同图像帧;所述确定待检测的视频中至少一图像帧为检测图像帧,包括:将所述至少一共同图像帧作为所述检测图像帧。
- 根据权利要求5所述的方法,其特征在于,每一所述视频子段中包括m个时序连续的图像帧;所述确定待检测的视频中至少一图像帧为检测图像帧,包括:将时序在前的m-1个图像帧作为所述检测图像帧。
- 根据权利要求1-6任一项所述的方法,其特征在于,所述获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:在所述检测图像帧中标注各所述目标物体对应的第一位置区域。
- 根据权利要求1-6任一项所述的方法,其特征在于,所述获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据所述检测图像帧中已知的所述至少一目标物体的位置区域确定所述第一位置区域。9、根据权利要求5或6所述的方法,其特征在于,所述获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据任两个时序相邻的视频子段中时序在前的视频子段中所述至少一共同图像帧中所述至少一目标物体的位置区域,确定时序在后的视频子段中所述检测图像帧的第一位置区域。
- 根据权利要求1-6任一项所述的方法,其特征在于,所述获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域,包括:根据所述至少一目标物体在所述检测图像帧中的位置的外接矩形区域或外接轮廓区域,确定所述第一位置区域。
- 根据权利要求1-10任一项所述的方法,其特征在于,所述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:根据所述至少一目标物体在任一所述检测图像帧中的第一特征及在任一在后图像帧中的第二特征,预测所述至少一目标物体在所述任一在后图像帧中的运动信息。
- 根据权利要求5-10任一项所述的方法,其特征在于,所述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:对于每个视频子段,根据时序在前的m-1个图像帧的第一特征、与所述第一特征对应的第一预设权重以及时序在后的第m个图像帧的第二特征、与所述第二特征对应的第二预设权重,预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,m为整数,且m>1。
- 根据权利要求12所述的方法,其特征在于,所述根据提取的所述第一特征和所述第二特征, 预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:根据提取的所述第一特征和所述第二特征,利用预先训练的第一神经网络预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,其中,所述预先训练的第一神经网络的权重矩阵包括所述第一预设权重以及所述第二预设权重。
- 根据权利要求13所述的方法,其特征在于,响应于m大于2,所述预先训练的第一神经网络通过以下训练步骤得到:将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将所述第三权重确定为所述m个图像帧中的第1个图像帧的特征的所述第一预设权重的初始值;将所述第四权重确定为第t个图像帧的特征的所述第二预设权重的初始值,其中,2≤t≤m,且t为正整数;所述预先训练的第二神经网络通过以下训练步骤得到:分别提取已标注的训练用视频中时序相邻的两个样本图像帧中所述目标物体的特征;根据提取的特征预测所述目标物体在时序在后的样本图像帧中的运动信息;根据所述运动信息的预测结果和所述训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足所述第二神经网络预定的训练完成条件。
- 根据权利要求1-14任一项所述的方法,其特征在于,所述根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息,包括:根据所述第一特征和所述第二特征,确定所述至少一在后图像帧在所述第一位置区域中的所述至少一目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息;至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息。
- 根据权利要求15所述的方法,其特征在于,所述相对位置变化信息包括:所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量。
- 根据权利要求15或16所述的方法,其特征在于,所述相对位置变化信息包括:所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量、所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量。
- 根据权利要求15-17任一项所述的方法,其特征在于,所述至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域,包括:根据所述第一位置区域、所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求16-18任一项所述的方法,其特征在于,所述至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息,包括:根据所述至少一在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,和所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,预测预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在水平方向的移动量确定;所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在竖直方向的移动量确定。
- 根据权利要求19所述的方法,其特征在于,所述至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息,包括:根据所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区 域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中所述第一位置区域在水平方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在水平方向的变化量确定;所述在后图像帧中所述第一位置区域在竖直方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在竖直方向的变化量确定。
- 根据权利要求11-20任一项所述的方法,其特征在于,所述至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域,包括:将所述第一位置区域作为所述至少一目标物体在所述至少一在后图像帧中的第二位置区域;根据所述在后图像帧在所述第一位置区域中的目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息,更新所述第二位置区域,得到所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求1-21任一项所述的方法,其特征在于,所述方法还包括:响应于所述至少一目标物体在所述待检测的视频或所述视频子段中的至少一图像帧中的位置区域确定完成,提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求22所述的方法,其特征在于,每个所述待检测的视频或每一所述视频子段包括n个时序连续的图像帧,n>1,且n为整数;以及所述提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征,包括:按照时序顺序提取所述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
- 根据权利要求23所述的方法,其特征在于,所述根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别,包括:根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果;根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求24所述的方法,其特征在于,所述根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果,包括:按照时序倒序,对所述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至所述n个图像帧的第三特征解码完成。
- 一种用于检测视频中物体的方法,其特征在于,所述方法包括:确定至少一目标物体在待检测的视频或所述视频子段包括的至少一图像帧中的位置区域;提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。
- 根据权利要求26所述的方法,其特征在于,所述待检测的视频或所述视频子段包括n个时序连续的图像帧,n>1,且n为整数;所述提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征,包括:按照时序顺序提取所述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
- 根据权利要求27所述的方法,其特征在于,所述根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别,包括:根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果;根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求27或28所述的方法,其特征在于,所述根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果,包括:按照时序倒序,对所述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至所述n个图像帧的第三特征解码完成。
- 一种用于检测视频中物体的装置,其特征在于,所述装置包括:检测图像帧确定单元,用于确定待检测的视频中至少一图像帧为检测图像帧;第一位置区域确定单元,用于获取所述检测图像帧所包含的至少一目标物体对应的第一位置区域;特征提取单元,用于分别提取各所述检测图像帧中各所述第一位置区域的第一特征和所述视频中相对各所述检测图像帧时序连续的至少一在后图像帧在各所述第一位置区域的第二特征;运动信息预测单元,用于根据提取的所述第一特征和所述第二特征,预测所述至少一目标物体分别在所述至少一在后图像帧中的运动信息;位置区域确定单元,用于至少根据所述至少一目标物体在至少一检测图像帧中的所述第一位置区域及在所述至少一在后图像帧中的运动信息的预测结果,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求30所述的装置,其特征在于,所述检测图像帧确定单元用于:将所述待检测的视频的第一图像帧作为所述检测图像帧。
- 根据权利要求30所述的装置,其特征在于,所述检测图像帧确定单元用于:将所述待检测的视频的任一关键帧作为所述检测图像帧。
- 根据权利要求30所述的装置,其特征在于,所述检测图像帧确定单元用于:将所述待检测的视频中至少一已知所述至少一目标物体的位置区域的图像帧作为所述检测图像帧。
- 根据权利要求30-33任一项所述的装置,其特征在于,所述待检测的视频包括多个时序连续的视频子段,至少两个时序相邻的视频子段包括至少一共同图像帧;所述检测图像帧确定单元用于:将所述至少一共同图像帧作为所述检测图像帧。
- 根据权利要求34所述的装置,其特征在于,每一所述视频子段中包括时序连续的m个图像帧;所述检测图像帧确定单元用于:将时序在前的m-1个图像帧作为所述检测图像帧。
- 根据权利要求30-35任一项所述的装置,其特征在于,所述第一位置区域确定单元用于:在所述检测图像帧中标注所述至少一目标物体对应的第一位置区域。
- 根据权利要求30-36任一项所述的装置,其特征在于,所述第一位置区域确定单元用于:根据所述检测图像帧中已知的所述至少一目标物体的位置区域确定所述第一位置区域。
- 根据权利要求34或35所述的装置,其特征在于,所述第一位置区域确定单元用于:根据任两个时序相邻的视频子段中时序在前的视频子段中所述至少一共同图像帧中所述至少一目标物体的位置区域,确定时序在后的视频子段中所述检测图像帧的第一位置区域。
- 根据权利要求30-38任一项所述的装置,其特征在于,所述第一位置区域确定单元用于:根据所述至少一目标物体在所述检测图像帧中的位置的外接矩形区域或外接轮廓区域,确定所述第一位置区域。
- 根据权利要求30-39任一项所述的装置,其特征在于,所述运动信息预测单元用于:根据所述至少一目标物体在任一所述检测图像帧中的第一特征及在任一在后图像帧中的第二特征,预测所述至少一目标物体在所述任一在后图像帧中的运动信息。
- 根据权利要求34-39任一项所述的装置,其特征在于,所述运动信息预测单元用于:对于每个视频子段,根据时序在前的m-1个图像帧的第一特征、与所述第一特征对应的第一预设权重以及时序在后的第m个图像帧的第二特征、与所述第二特征对应的第二预设权重,预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,m为整数,且m>1。
- 根据权利要求41所述的装置,其特征在于,所述运动信息预测单元用于:根据提取的所述第一特征和所述第二特征,利用预先训练的第一神经网络预测所述至少一目标物体在所述时序在后的第m个图像帧中的运动信息,其中,所述预先训练的第一神经网络的权重矩阵包括所述第一预设权重以及所述第二预设权重。
- 根据权利要求42所述的装置,其特征在于,响应于m大于2,所述预先训练的第一神经网络通过以下第一训练模块得到,所述第一训练模块用于:将预先训练的第二神经网络的权重矩阵分为第三权重和第四权重;将所述第三权重确定为所述m个图像帧中的第1个图像帧的的特征所述第一预设权重的初始值;将第四权重确定为第t个图像帧的特征的所述第二预设权重的初始值,其中,2≤t≤m,且t为正整数;所述预先训练的第二神经网络通过第二训练模块得到,所述第二训练模块用于:分别提取已标注的训练用视频中时序相邻的两个样本图像帧中所述目标物体的特征;根据提取的特征预测所述目标物体在时序在后的样本图像帧中的运动信息;根据所述运动信息的预测结果和所述训练用视频的标注信息,调整第二神经网络的权重矩阵,直至满足所述第二神经网络预定的训练完成条件。
- 根据权利要求30-43任一项所述的装置,其特征在于,所述运动信息预测单元包括:相对变化信息确定模块,用于根据所述第一特征和所述第二特征,确定所述至少一在后图像帧在所述第一位置区域中的所述至少一目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息;预测模块,用于至少根据所述至少一目标物体的相对变化信息,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息。
- 根据权利要求44所述的装置,其特征在于,所述相对位置变化信息包括:所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量。
- 根据权利要求44或45所述的装置,其特征在于,所述相对位置变化信息包括:所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量、所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量。
- 根据权利要求44-46任一项所述的装置,其特征在于,所述位置区域确定单元包括:位置区域确定模块,用于根据所述第一位置区域、所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量、所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,确定所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求45-47任一项所述的装置,其特征在于,所述预测模块用于:根据所述至少一在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,和所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量,预测预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中的所述第一位置区域中心点在水平方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在水平方向的移动量确定;所述在后图像帧中的所述第一位置区域中心点在竖直方向上较所述检测图像帧中的所述第一位置区域中心点的移动量根据所述在后图像帧中所述目标物体的第二特征较与其对应的所述目标物体的第一特征在竖直方向的移动量确定。
- 根据权利要求48所述的装置,其特征在于,所述预测模块用于:根据所述在后图像帧中的所述第一位置区域在水平方向上较所述检测图像帧中的所述第一位置区域的变化量和所述在后图像帧中的所述第一位置区域在竖直方向上较所述检测图像帧中的所述第一位置区域的变化量,预测所述至少一目标物体在所述至少一在后图像帧中的运动信息;其中,所述在后图像帧中所述第一位置区域在水平方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在水平方向的变化量确定;所述在后图像帧中所述第一位置区域在竖直方向上较所述检测图像帧中所述第一位置区域的变化量根据所述在后图像帧中所述目标物体的第二特征较与其对应的目标物体的第一特征在竖直方向的变化量确定。
- 根据权利要求40-49任一项所述的装置,其特征在于,所述位置区域确定单元用于:将所述第一位置区域作为所述至少一目标物体在所述至少一在后图像帧中的第二位置区域;根据所述在后图像帧在所述第一位置区域中的目标物体相对所述检测图像帧在所述第一位置区域中的目标物体的相对变化信息,更新所述第二位置区域,得到所述至少一目标物体在所述至少一在后图像帧中的位置区域。
- 根据权利要求30-50任一项所述的装置,其特征在于,所述装置还包括:第三特征提取单元,用于响应于所述至少一目标物体在所述待检测的视频或所述视频子段中的至少一图像帧中的位置区域确定完成,提取所述至少一目标物体在所述待检测的视频或所述视频子段的至少一图像帧中的位置区域中的第三特征;类别确定单元,用于根据提取的第三特征,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求51所述的装置,其特征在于,每个所述待检测的视频或每一所述视频子段包括n个时序连续的图像帧,n>1,且n为整数;以及所述第三特征提取单元用于:按照时序顺序提取所述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
- 根据权利要求52所述的装置,其特征在于,所述类别确定单元包括:解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一图像帧的第三特征的解码结果;类别确定模块,用于根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求53所述的装置,其特征在于,所述解码结果确定模块用于:按照时序倒序,对所述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至所述n个图像帧的第三特征解码完成。
- 一种用于检测视频中物体的装置,其特征在于,所述装置包括:第二位置区域确定单元,用于确定至少一目标物体在待检测的视频或视频子段包括的至少一图像帧中的位置区域;第一特征提取单元,用于提取所述至少一目标物体在所述至少一图像帧中的位置区域中的第三特征;第一类别确定单元,用于根据提取的第三特征,分别确定至少一个图像帧中的目标物体的类别。
- 根据权利要求55所述的装置,其特征在于,所述待检测的视频或所述视频子段包括n个时序连续的图像帧,n>1,且n为整数;所述第一特征提取单元用于:按照时序顺序提取所述n个图像帧的第三特征;对于第i个图像帧,对其第三特征和该图像帧之前的i-1个图像帧的第三特征进行编码,直至对第n个图像帧的第三特征编码完成,其中,1≤i≤n。
- 根据权利要求56所述的装置,其特征在于,所述第一类别确定单元包括:第一解码结果确定模块,用于根据提取的第三特征和第n个图像帧的第三特征的编码结果,确定所述至少一个图像帧的第三特征的解码结果;第一类别确定模块,用于根据所述至少一图像帧的第三特征的解码结果,分别确定所述至少一图像帧中的目标物体的类别。
- 根据权利要求56或57所述的装置,其特征在于,所述第一解码结果确定模块用于:按照时序倒序,对所述n个图像帧的第三特征的编码结果进行解码;对于第j个图像帧,根据第j个图像帧的第三特征和第n个图像帧的第三特征的编码结果,确定第j个图像帧的第三特征的解码结果,直至所述n个图像帧的第三特征解码完成。
- 一种电子设备,其特征在于,包括:处理器和存储器;存储器,用于存储至少一可执行指令,所述可执行指令使所述处理器执行权利要求1~29任一项所述方法对应的操作操作。
- 一种计算机程序,包括计算机可读代码,其特征在于,当所述计算机可读代码在设备上运行时,所述设备中的处理器执行用于实现权利要求1~29任一项所述方法中各步骤的指令。
- 一种计算机可读存储介质,用于存储计算机可读取的指令,其特征在于,所述指令被执行时实现权利要求1~29任一项所述方法中各步骤的操作。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710093583.0 | 2017-02-21 | ||
CN201710093583.0A CN106897742B (zh) | 2017-02-21 | 2017-02-21 | 用于检测视频中物体的方法、装置和电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018153323A1 true WO2018153323A1 (zh) | 2018-08-30 |
Family
ID=59185036
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/076708 WO2018153323A1 (zh) | 2017-02-21 | 2018-02-13 | 用于检测视频中物体的方法、装置和电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106897742B (zh) |
WO (1) | WO2018153323A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111242008A (zh) * | 2020-01-10 | 2020-06-05 | 河南讯飞智元信息科技有限公司 | 打架事件检测方法、相关设备及可读存储介质 |
CN113297949A (zh) * | 2021-05-20 | 2021-08-24 | 科大讯飞股份有限公司 | 高空抛物检测方法、装置、计算机设备和存储介质 |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106897742B (zh) * | 2017-02-21 | 2020-10-27 | 北京市商汤科技开发有限公司 | 用于检测视频中物体的方法、装置和电子设备 |
CN107274434B (zh) * | 2017-07-14 | 2021-02-02 | 浙江大华技术股份有限公司 | 一种目标物体小幅度移动的检测方法和检测装置 |
CN108229290B (zh) * | 2017-07-26 | 2021-03-02 | 北京市商汤科技开发有限公司 | 视频物体分割方法和装置、电子设备、存储介质 |
CN108876812B (zh) * | 2017-11-01 | 2021-11-19 | 北京旷视科技有限公司 | 用于视频中物体检测的图像处理方法、装置及设备 |
CN109993789B (zh) * | 2017-12-29 | 2021-05-25 | 杭州海康威视数字技术股份有限公司 | 一种共享单车的违停判定方法、装置及相机 |
CN108764026B (zh) * | 2018-04-12 | 2021-07-30 | 杭州电子科技大学 | 一种基于时序检测单元预筛选的视频行为检测方法 |
CN109040664A (zh) * | 2018-06-01 | 2018-12-18 | 深圳市商汤科技有限公司 | 视频流处理方法及装置、电子设备和存储介质 |
CN110660254B (zh) * | 2018-06-29 | 2022-04-08 | 北京市商汤科技开发有限公司 | 交通信号灯检测及智能驾驶方法和装置、车辆、电子设备 |
CN111127510B (zh) * | 2018-11-01 | 2023-10-27 | 杭州海康威视数字技术股份有限公司 | 一种目标对象位置的预测方法及装置 |
CN109635740B (zh) * | 2018-12-13 | 2020-07-03 | 深圳美图创新科技有限公司 | 视频目标检测方法、装置及图像处理设备 |
CN109726684B (zh) * | 2018-12-29 | 2021-02-19 | 百度在线网络技术(北京)有限公司 | 一种地标元素获取方法和地标元素获取系统 |
CN109800678A (zh) * | 2018-12-29 | 2019-05-24 | 上海依图网络科技有限公司 | 一种视频中对象的属性确定方法及装置 |
CN109815840A (zh) * | 2018-12-29 | 2019-05-28 | 上海依图网络科技有限公司 | 一种确定识别信息的方法及装置 |
CN109614956A (zh) * | 2018-12-29 | 2019-04-12 | 上海依图网络科技有限公司 | 一种视频中对象的识别方法及装置 |
CN110569703B (zh) * | 2019-05-10 | 2020-09-01 | 阿里巴巴集团控股有限公司 | 计算机执行的从图片中识别损伤的方法及装置 |
US10885625B2 (en) | 2019-05-10 | 2021-01-05 | Advanced New Technologies Co., Ltd. | Recognizing damage through image analysis |
CN110225398B (zh) * | 2019-05-28 | 2022-08-02 | 腾讯科技(深圳)有限公司 | 多媒体对象播放方法、装置和设备及计算机存储介质 |
CN110348369B (zh) * | 2019-07-08 | 2021-07-06 | 北京字节跳动网络技术有限公司 | 一种视频场景分类方法、装置、移动终端及存储介质 |
CN112543372A (zh) * | 2019-09-20 | 2021-03-23 | 珠海格力电器股份有限公司 | 分配视频码率的方法、装置及存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120134538A1 (en) * | 2010-11-25 | 2012-05-31 | Canon Kabushiki Kaisha | Object tracking device capable of tracking object accurately, object tracking method, and storage medium |
CN105989613A (zh) * | 2015-02-05 | 2016-10-05 | 南京市客运交通管理处 | 一种适用于公交场景的客流跟踪算法 |
CN106022263A (zh) * | 2016-05-19 | 2016-10-12 | 西安石油大学 | 一种融合特征匹配和光流法的车辆跟踪方法 |
CN106204640A (zh) * | 2016-06-29 | 2016-12-07 | 长沙慧联智能科技有限公司 | 一种运动目标检测系统及方法 |
CN106295716A (zh) * | 2016-08-23 | 2017-01-04 | 广东工业大学 | 一种基于视频信息的交通运动目标分类方法及装置 |
CN106326837A (zh) * | 2016-08-09 | 2017-01-11 | 北京旷视科技有限公司 | 对象追踪方法和装置 |
CN106897742A (zh) * | 2017-02-21 | 2017-06-27 | 北京市商汤科技开发有限公司 | 用于检测视频中物体的方法、装置和电子设备 |
-
2017
- 2017-02-21 CN CN201710093583.0A patent/CN106897742B/zh active Active
-
2018
- 2018-02-13 WO PCT/CN2018/076708 patent/WO2018153323A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120134538A1 (en) * | 2010-11-25 | 2012-05-31 | Canon Kabushiki Kaisha | Object tracking device capable of tracking object accurately, object tracking method, and storage medium |
CN105989613A (zh) * | 2015-02-05 | 2016-10-05 | 南京市客运交通管理处 | 一种适用于公交场景的客流跟踪算法 |
CN106022263A (zh) * | 2016-05-19 | 2016-10-12 | 西安石油大学 | 一种融合特征匹配和光流法的车辆跟踪方法 |
CN106204640A (zh) * | 2016-06-29 | 2016-12-07 | 长沙慧联智能科技有限公司 | 一种运动目标检测系统及方法 |
CN106326837A (zh) * | 2016-08-09 | 2017-01-11 | 北京旷视科技有限公司 | 对象追踪方法和装置 |
CN106295716A (zh) * | 2016-08-23 | 2017-01-04 | 广东工业大学 | 一种基于视频信息的交通运动目标分类方法及装置 |
CN106897742A (zh) * | 2017-02-21 | 2017-06-27 | 北京市商汤科技开发有限公司 | 用于检测视频中物体的方法、装置和电子设备 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111242008A (zh) * | 2020-01-10 | 2020-06-05 | 河南讯飞智元信息科技有限公司 | 打架事件检测方法、相关设备及可读存储介质 |
CN111242008B (zh) * | 2020-01-10 | 2024-04-12 | 河南讯飞智元信息科技有限公司 | 打架事件检测方法、相关设备及可读存储介质 |
CN113297949A (zh) * | 2021-05-20 | 2021-08-24 | 科大讯飞股份有限公司 | 高空抛物检测方法、装置、计算机设备和存储介质 |
CN113297949B (zh) * | 2021-05-20 | 2024-02-20 | 科大讯飞股份有限公司 | 高空抛物检测方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN106897742A (zh) | 2017-06-27 |
CN106897742B (zh) | 2020-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2018153323A1 (zh) | 用于检测视频中物体的方法、装置和电子设备 | |
US11270124B1 (en) | Temporal bottleneck attention architecture for video action recognition | |
Ren et al. | Faster R-CNN: Towards real-time object detection with region proposal networks | |
WO2018090912A1 (zh) | 目标对象检测方法、装置及系统和神经网络结构 | |
KR20200087784A (ko) | 목표 검출 방법 및 장치, 트레이닝 방법, 전자 기기 및 매체 | |
CN110598558B (zh) | 人群密度估计方法、装置、电子设备及介质 | |
CN109086873B (zh) | 递归神经网络的训练方法、识别方法、装置及处理设备 | |
Seidel et al. | pROST: a smoothed ℓ p-norm robust online subspace tracking method for background subtraction in video | |
WO2018121737A1 (zh) | 关键点预测、网络训练及图像处理方法和装置、电子设备 | |
US8917907B2 (en) | Continuous linear dynamic systems | |
US10762644B1 (en) | Multiple object tracking in video by combining neural networks within a bayesian framework | |
CN110998594A (zh) | 检测动作的方法和系统 | |
US20240257423A1 (en) | Image processing method and apparatus, and computer readable storage medium | |
JP7547652B2 (ja) | 動作認識の方法および装置 | |
US20200219269A1 (en) | Image processing apparatus and method, and image processing system | |
US12106541B2 (en) | Systems and methods for contrastive pretraining with video tracking supervision | |
KR20200010971A (ko) | 광학 흐름 추정을 이용한 이동체 검출 장치 및 방법 | |
CN114170558B (zh) | 用于视频处理的方法、系统、设备、介质和产品 | |
CN114495006A (zh) | 遗留物体的检测方法、装置及存储介质 | |
JP7115579B2 (ja) | 情報処理装置、情報処理方法、及びプログラム | |
CN113177483B (zh) | 视频目标分割方法、装置、设备以及存储介质 | |
US20240220848A1 (en) | Systems and methods for training video object detection machine learning model with teacher and student framework | |
CN112434629B (zh) | 一种在线时序动作检测方法及设备 | |
JP7202995B2 (ja) | 時空間事象予測装置、時空間事象予測方法及び時空間事象予測システム | |
CN111860070A (zh) | 识别发生改变的对象的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18757928 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18757928 Country of ref document: EP Kind code of ref document: A1 |