CN106354816A

CN106354816A - Video image processing method and video image processing device

Info

Publication number: CN106354816A
Application number: CN201610765659.5A
Authority: CN
Inventors: 邹博; 刘玉洁; 周玲武
Original assignee: Neusoft Corp
Current assignee: Neusoft Corp
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2017-01-25
Anticipated expiration: 2036-08-30
Also published as: CN106354816B

Abstract

The invention provides a video image processing method and a video image processing device. The method includes: acquiring a video image sequence; recognizing a target object from a video image frame in the video image sequence; tracking the target object, and determining a motion trail of the target object; acquiring video structural information on the basis of the target object and the motion trail of the target object; performing target object retrieval and/or video compression on the video image sequence on the basis of the video structural information. On the basis of the video image processing method and the video image processing device, quickness in target investigation is realized, target investigation is accelerated, and case cracking speed is increased.

Description

A kind of method of video image processing and device

Technical field

The present invention relates to technical field of image processing, more particularly, to a kind of method of video image processing and device.

Background technology

Perfect with video monitoring system, video image investigative technique has become as Public Security Organss and continues Forensic Science, OK The fourth-largest solving criminal cases technology after technology detectd by dynamic technology, net.And current video image investigative technique with tactics of human sea is Main, that is, need a large amount of investigators target of investication from each video frame image of video, this investigation mode needs to consume greatly The manpower of amount, and need to expend longer time, that is, existing investigation mode wastes time and energy, and it is bad to investigate effect.

Content of the invention

In view of this, the invention provides a kind of method of video image processing and device, of the prior art in order to solve Video image investigation mode wastes time and energy and then leads to the slow problem of cracking of cases, and its technical scheme is as follows:

A kind of method of video image processing, methods described includes:

Obtain sequence of video images；

Destination object is identified in picture frame from described sequence of video images；

Described destination object is tracked and determines with the movement locus of described destination object；

Movement locus based on described destination object and described destination object obtain video structural information；

Destination object retrieval is carried out based on described video structural information and/or video is carried out to described sequence of video images Concentrate.

Wherein, identify destination object in described each frame video image from described sequence of video images, comprising:

Based on identification target pair in depth convolutional neural networks each video frame image from described sequence of video images As.

Wherein, described described destination object is tracked, comprising:

Based on the light flow point extracted on the described destination object from described video frame image, using lucas kanade light Stream method track algorithm is tracked to described destination object.

Wherein, described video structural information includes: the text message of destination object and/or image feature information, described The text message of destination object includes attribute information and the movable information of described destination object；

Then described destination object retrieval is carried out based on described video structural information, comprising:

When receiving the search instruction to text message to be retrieved, based on described text message to be retrieved in described target Retrieve in the text message of object, or, when receiving the search instruction to image to be retrieved, based on described image to be retrieved The image feature information of described destination object is retrieved, or, when receiving the search instruction to event information to be retrieved, Based on described event to be retrieved and the event model that pre-builds is retrieved in described text message, obtain retrieval result；

Export the target object information associating with described retrieval result.

Wherein, the image feature information of described destination object includes depth convolution feature and local feature；

Described retrieved in the image feature information of described destination object based on described image to be retrieved, obtain retrieval knot Really, comprising:

Depth convolution feature based on described image to be retrieved presses first in the image feature information of described destination object Matched rule is mated, and obtains candidate characteristic set；

Depth convolution feature based on described image to be retrieved and local feature are concentrated in described candidate feature and are pressed second Join rule to be mated, obtain target image characteristics as described testing result.

Wherein, the described depth convolution feature based on described image to be retrieved is in the image feature information of described destination object In mated by the first matched rule, obtain candidate characteristic set, comprising:

Obtain depth convolution feature and the local feature of described image to be retrieved, and the depth volume to described image to be retrieved Long-pending feature carries out the binary-coding feature that binary-coding obtains described image to be retrieved；

By the binary-coding feature of described image to be retrieved respectively with the image feature information of described destination object in each The corresponding binary-coding feature of individual depth convolution feature is mated, by with the binary-coding feature of described image to be retrieved The binary-coding feature that degree of joining is more than the first preset value is defined as target binary-coding feature, and will be with described target code feature Corresponding target depth convolution feature and target local feature are as candidate characteristic set；

The described depth convolution feature based on described image to be retrieved and local feature are concentrated by the in described candidate feature Two matched rules are mated, and obtain target image characteristics as described testing result, comprising:

Each depth convolution feature that the depth convolution feature of described image to be retrieved and described candidate feature are concentrated is entered Row coupling, and, each local feature that the local feature of described image to be retrieved is concentrated with described candidate feature carries out Join, depth convolution feature and the comprehensive matching degree of corresponding local feature are more than the characteristics of image of the second preset value as retrieval knot Really；

The target object information that then described output is associated with described retrieval result, particularly as follows:

Export described depth convolution feature and the comprehensive matching degree of corresponding local feature is more than the image spy of the second preset value Levy associated destination object image, described destination object image is in advance from the video frame image that described destination object is located The image of the described destination object extracting.

Wherein, described video structural information includes: in described sequence of video images, the image of each video frame image is close Degree, described image density is used for characterizing the situation of destination object in described video frame image；

Then the movement locus based on described destination object and described destination object obtain described structured message and include:

Determine institute by the movement locus of position in video frame image for the described destination object and described destination object State the structural information of monitor area in sequence of video images, the structural information of described monitor area includes described destination object in institute State the area information occurring in monitor area；

Determine the image of each video frame image in described sequence of video images based on the structural information of described monitor area Density.

Wherein, described video concentration is carried out to described sequence of video images based on described video structural information, comprising:

Image density based on each video frame image described carries out segmentation to described sequence of video images, and from each point Video-frequency band to be concentrated is determined in section；

Described video-frequency band to be concentrated is carried out with video concentration, and does not enter carrying out the video-frequency band after video concentration with other The video-frequency band that row video concentrates merges, and obtains the sequence of video images concentrating.

Wherein, described segmentation is carried out to described sequence of video images based on described image density, and from each segmentation really Video-frequency band to be concentrated in fixed, comprising:

Image density threshold value set in advance is utilized by described video image by the image density of each video frame image Sequence is divided into multiple video-frequency bands, and each described video-frequency band includes multiple continuous video frame images；

The video-frequency band that the image density of each video frame image is all higher than described image density threshold is defined as to be concentrated Video-frequency band.

Wherein, described video concentration is carried out to described video-frequency band to be concentrated, comprising:

By space-time concentration model determine will at least one destination object in described video-frequency band to be concentrated time dimension with The optimum shift strategy of movement on Spatial Dimension；

Image co-registration is carried out based on described optimum shift strategy, obtains the video-frequency band after concentrating.

A kind of video image processing device, described device includes: video acquiring module, target recognition module, target following Module, video structural data obtaining module and processing module；

Described video acquiring module, for obtaining sequence of video images；

Described target recognition module, for the video from the described sequence of video images that described video acquiring module obtains Destination object is identified in picture frame；

Described target tracking module, the described destination object for identifying to described target recognition module is tracked simultaneously Determine the movement locus of described destination object；

Described information acquisition module, the described destination object being identified based on described target recognition module and described target with The movement locus of the described destination object that track module determines obtain video structural information；

Described processing module, the described video structural information for being obtained based on described information acquisition module carries out target Object retrieval and/or video concentration is carried out to described sequence of video images.

Wherein, described target recognition module, specifically for based on depth convolutional neural networks from described sequence of video images In each video frame image in identify destination object.

Wherein, described target tracking module, specifically for based on the described destination object from described video frame image The light flow point extracted, is tracked to described destination object using lucas kanade optical flow method track algorithm.

Described processing module, comprising: retrieval module and output module；

Described retrieval module, for when receiving the search instruction to text message to be retrieved, based on described to be retrieved Text message is retrieved in the text message of described destination object, or, when receiving the search instruction to image to be retrieved, Retrieved in the image feature information of described destination object based on described image to be retrieved, or, when receiving to thing to be retrieved During the search instruction of part information, based on described event to be retrieved and the event model that pre-builds is examined in described text message Rope, obtains retrieval result；

Described output module, the target object information associating with the described retrieval result of described retrieval module for output.

Described retrieval module includes: thick matching module and accurately mate module；

Described thick matching module, for the depth convolution feature based on described image to be retrieved described destination object figure As being mated by the first matched rule in characteristic information, obtain candidate characteristic set；

Described accurately mate module, for the depth convolution feature based on described image to be retrieved and local feature described Candidate feature is concentrated and is mated by the second matched rule, obtains target image characteristics as described testing result.

Wherein, described thick matching module includes: feature obtains and processes submodule and thick matched sub-block；

Described feature obtains and processes submodule, and depth convolution feature and local for obtaining described image to be retrieved are special Levy, and the depth convolution feature of described image to be retrieved is carried out with binary-coding, obtain the binary-coding of described image to be retrieved Feature, is additionally operable to carry out binary-coding respectively to each depth convolution feature in the image feature information of described destination object, Obtain and each the corresponding binary-coding feature of depth convolution feature in the image feature information of described destination object；

Described thick matched sub-block, for by the binary-coding feature of described image to be retrieved respectively with described destination object Image feature information in each corresponding binary-coding feature of depth convolution feature mated, will be with described figure to be retrieved The binary-coding feature that the matching degree of the binary-coding feature of picture is more than the first preset value is defined as target binary-coding feature, and Will target depth convolution feature corresponding with described target code feature and target local feature as candidate characteristic set；

Described accurately mate module, specifically for by the depth convolution feature of described image to be retrieved and described candidate feature Each depth convolution feature concentrated is mated, and, by the local feature of described image to be retrieved and described candidate feature Each local feature concentrated is mated, and depth convolution feature and the comprehensive matching degree of corresponding local feature is more than second pre- If the characteristics of image of value is as retrieval result；

Then described output module, specifically for exporting the comprehensive matching degree of described depth convolution feature and corresponding local feature More than the destination object image associated by the characteristics of image of the second preset value, described destination object image is in advance from described target The image of the described destination object extracting in the video frame image that object is located.

Described information acquisition module includes: monitor area structure determination module and image density determining module；

Described monitor area structure determination submodule, for by position in video frame image for the described destination object with And the movement locus of described destination object determine the structural information of monitor area in described sequence of video images, described monitor area Structural information include the area information that described destination object occurs in described monitor area；

Described image density determination sub-module, described in being determined based on described monitor area structure determination submodule The structural information of monitor area determines the image density of each video frame image in described sequence of video images.

Wherein, described processing module includes: video pre-filtering module and video concentrate module；

Described video pre-filtering module, for based on described image density described sequence of video images is carried out segmentation and from Video-frequency band to be concentrated is determined in each segmentation；

Described video concentrates module, and for carrying out video concentration to described video-frequency band to be concentrated, and it is dense to carry out video The video-frequency band that video-frequency band after contracting does not carry out video concentration with other merges, and obtains the sequence of video images concentrating.

Wherein, described video pre-filtering module, comprising: video segmentation submodule and video-frequency band determination sub-module to be concentrated；

Described video segmentation submodule, for utilizing image set in advance by the image density of each video frame image Described sequence of video images is divided into multiple video-frequency bands by density threshold；

Described video-frequency band determination sub-module to be concentrated, for being all higher than described figure by the image density of each video frame image As the video-frequency band of density threshold is defined as video-frequency band to be concentrated.

Wherein, described video concentrates module and includes: optimum concentration strategy determination sub-module and image co-registration submodule；

Described optimum concentration strategy determination sub-module, for being determined described video-frequency band to be concentrated by space-time concentration model In the movement on time dimension and Spatial Dimension of at least one destination object optimum shift strategy；

Described image merges submodule, for carrying out image co-registration based on described optimum shift strategy, obtains after concentrating Video-frequency band.

Technique scheme has the advantages that

Method of video image processing and device that the present invention provides, can know to the destination object in sequence of video images And follow the tracks of, and then video structural information can be obtained by the movement locus based on destination object and destination object, do not regard getting After frequency structured message, line retrieval can be entered based on video structural information, quickly can investigate target by this way, in addition, It is also based on video structural information and carries out video concentration, due to concentrating video bag target containing original video all the information and frame Number is few, therefore, can quickly investigate target based on concentrating video.That is, how user is known a priori by some information of destination object, Then directly can enter line retrieval using these information, thus quickly investigating target, if user does not know destination object Information, then can directly browse concentration video, thus quickly investigating target.The method of video image processing being provided based on the present invention And device can quickly investigate target, that is, the present invention improves the speed of target detection, and then improves the detection speed of case Degree.

Brief description

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Have technology description in required use accompanying drawing be briefly described it should be apparent that, drawings in the following description be only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing providing obtains other accompanying drawings.

Fig. 1 is the schematic flow sheet of method of video image processing provided in an embodiment of the present invention；

Fig. 2 is in method of video image processing provided in an embodiment of the present invention, carries out target detection, in video frame image Generate a series of schematic diagram of target candidate frames；

Fig. 3 is in method of video image processing provided in an embodiment of the present invention, on the destination object from video frame image The schematic diagram of the light flow point extracted；

Fig. 4 is in method of video image processing provided in an embodiment of the present invention, based on image to be retrieved in destination object Retrieve in image feature information, obtain the schematic flow sheet of the specific implementation of retrieval result；

Fig. 5 is the fortune in method of video image processing provided in an embodiment of the present invention, based on destination object and destination object Dynamic rail mark obtains the schematic flow sheet of video structural information；

Fig. 6 is in method of video image processing provided in an embodiment of the present invention, based on image density to sequence of video images Carry out the schematic flow sheet realizing process of video-frequency band to be concentrated in segmentation, and determination from each segmentation；

Fig. 7 is in method of video image processing provided in an embodiment of the present invention, carries out video to video-frequency band to be concentrated dense The schematic flow sheet realizing process of contracting；

Fig. 8 is the structural representation of video image processing device provided in an embodiment of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation description is it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of not making creative work Embodiment, broadly falls into the scope of protection of the invention.

Embodiments provide a kind of method of video image processing, refer to Fig. 1, show this Computer Vision The schematic flow sheet of method, may include that

Step s101: obtain sequence of video images.

Step s102: identify destination object in the video frame image from sequence of video images.

Step s103: destination object is tracked and determines with the movement locus of destination object.

Step s104: the movement locus based on destination object and destination object obtain video structural information.

Step s105: destination object retrieval is carried out based on video structural information and/or sequence of video images regarded Frequency concentrates.

The present invention provide method of video image processing, the destination object in sequence of video images can be identified and with Track, and then video structural information can be obtained by the movement locus based on destination object and destination object, getting video structure After change information, line retrieval can be entered based on video structural information, quickly can investigate target by this way, in addition, also can base Carry out video concentration in video structural information, due to concentrating quantity of information and the frame number that video bag contains containing raw video image Few, therefore, can quickly investigate suspicious object based on concentrating video.That is, how user is known a priori by the concrete letter of destination object Breath, then directly can enter line retrieval using these information, thus quickly investigating destination object, if user does not know target The information of object, then can directly browse concentration video, thus quickly investigating destination object.Based on provided in an embodiment of the present invention Method of video image processing quickly can investigate destination object from video, and that is, the embodiment of the present invention improves target detection Speed, and then improve the detection speed of case, better user experience.

In view of the method using background modeling more than traditional target identification method, first the background of image is built Mould, after model is set up, image is compared with background model, determines foreground target according to comparative result.However, the method exists Adaptability under the environment such as low contrast and light change is poor, usually can produce a lot of knowledges by mistake to moving target when being identified Not, and to static target missing inspection can usually be produced.In view of existing recognition methodss have problems, follow-up in order to improve Retrieval accuracy, the invention provides identify the identification of destination object in a kind of convolutional neural networks video frame image based on depth Method, that is, in above-described embodiment, identifies destination object in the video frame image from sequence of video images

Identify that the process of destination object includes from video frame image based on depth convolutional neural networks: first with being based on The target detection model of depth convolutional neural networks carries out target detection, generates a series of target candidate in video frame image Then frame, as shown in Fig. 2 carry out target recognition using the object-class model based on depth convolutional neural networks, and to target Candidate frame is corrected.

It should be noted that needing it is trained before being identified using depth convolutional neural networks: one Plant in possible implementation, for public security monitors environment and target (car/people) feature, alexnet network structure can be chosen and enter Row training, carries out pre-training using imagenet2012 data set, and using public security monitoring sample, network is entered on this basis Row tuning.Widely different between due to different vehicle type, so sample is divided into into car, passenger vehicle, lorry, three-wheel during training Car, non-motor vehicle (motorcycle/electric car/bicycle), pedestrian amount to 6 big class.

In addition, in order to improve follow-up recognition speed, target detection network model and target classification network model can be entered Row convolution feature is shared.

After identifying destination object, destination object is tracked.Carrying out in view of existing optical flow tracking algorithm When light flow point is extracted, generally light flow point is extracted to entire image, but be often concerned with target pair when carrying out target following As, the light flow point of other extraneous areas can interfere to the tracking of destination object, in order to improve speed and the accuracy of tracking, The present invention is tracked to destination object using improved lucas kanade optical flow method track algorithm, that is, be based on from video image The upper light flow point extracted of destination object (foreground image) in frame, using lucas kanade optical flow method track algorithm to target pair As being tracked.

Specifically, extract light flow point first against in view picture video frame image, then from foreground image (i.e. video frame image In destination object) in extract light flow point, be finally based on from foreground image extract light flow point by view picture video frame image Light flow point outside the light flow point being in foreground image filters, as shown in Figure 3.Wherein, foreground image passes through two neighboring regarding Frequency frames differencing is got, and the white portion in the second width image in Fig. 3 is foreground area, and black portions are background area.

In the above-described embodiments, the video structural information that the movement locus based on destination object and destination object obtain can To include: the text message of destination object and/or image feature information.Wherein, the text message of destination object can include mesh The attribute information of mark object and movable information.

Exemplary, destination object is vehicle, then the attribute information of destination object can include the classification of vehicle, vehicle Color, vehicle license plate number, vehicle brand model etc., the movable information of destination object can be the direction of motion, the vehicle of vehicle Position in video frame image etc..

After getting above-mentioned video structural information, just line retrieval can be entered based on video structural information, thus soon Speed investigates destination object.Kind is had based on the implementation that video structural information carries out destination object retrieval.

In a kind of possible implementation, line retrieval can be entered based on text, that is, when receiving to text message to be retrieved Search instruction when, retrieved in the text message of destination object based on text message to be retrieved, obtain retrieval result, output with The target object information of retrieval result association.It should be noted that in the present embodiment, can be by the text envelope of all destination objects Breath composition text message storehouse, when carrying out text retrieval, enters line retrieval in text information bank.

Preferably, target object information is destination object image, and destination object image is to be located from destination object in advance The image of the destination object extracting in video frame image, the text message in destination object image and video structural information and/ Or image feature information association.

Specifically, obtain user input text message to be retrieved, based on text message to be retrieved destination object literary composition The target text information mated with text message to be retrieved, the target pair of output and target text information association is searched in this information As image.

Exemplary, user enters line retrieval in search interface input vehicle license plate number, if the target information of destination object In comprise this vehicle license plate number, then can directly the image of the vehicle being associated with this vehicle license plate number be shown, so, and soon Speed has investigated target.

In alternatively possible implementation, can rule-based enter line retrieval, that is, when receive to event to be retrieved believe During the search instruction of breath, based on event to be retrieved and the event model that pre-builds is examined in the text message of destination object Rope, obtains retrieval result, exports the target object information associating with retrieval result.

Specifically, obtain the event information to be retrieved of user input, looked in text message based on event information to be retrieved Look for the text message related to event information to be retrieved, from the text message related to event information to be retrieved, determine target Text message, wherein, the corresponding event information of target text information is event information to be retrieved；Output is closed with target text information The destination object image of connection.

Wherein, event information can for region invasion, line of stumbling, hover, the text envelope related to event information to be retrieved Breath can be positional information in video frame image for the destination object, or the movement locus of destination object, based on destination object The change of position or movement locus can determine which event destination object there occurs, if a destination object there occurs with to be checked Rope event identical event, then export this destination object image.

In alternatively possible implementation, line retrieval can be entered based on image, when receiving the inspection to image to be retrieved During Suo Zhiling, retrieved in the image feature information storehouse of destination object based on image to be retrieved, obtain retrieval result, output and inspection The target object information of hitch fruit association.It should be noted that in the present embodiment, can be by the characteristics of image of all destination objects Information forms image feature information storehouse, when carrying out image retrieval, can the feature based on image to be retrieved believe in this characteristics of image Breath enters line retrieval in storehouse.

Specifically, obtain the image to be retrieved of user input first, and extract the characteristics of image of image to be retrieved as treating Retrieval characteristics of image, is then based on characteristics of image to be retrieved and searches in image feature information and Image Feature Matching to be retrieved Target image characteristics, export the destination object image associating with target image characteristics.For a user, carried out based on image During retrieval, user only needs to input image to be retrieved in search interface, just can obtain and treat the destination object of clue images match Image.

In a kind of preferred implementation, characteristics of image includes depth convolution feature and local feature, in the present embodiment In, the depth convolution feature of image to be retrieved can be primarily based in the image feature information of destination object by the first matched rule Mated, obtained candidate characteristic set；It is then based on the depth convolution feature of image to be retrieved and local feature in candidate feature Concentrate and mated by the second matched rule, obtain target image characteristics as testing result.I.e. final output is and target figure Information as the destination object of feature association.

Then refer to Fig. 4, show and retrieved in the image feature information of destination object based on image to be retrieved, obtain inspection The schematic flow sheet of the specific implementation of hitch fruit, may include that

Step s401: obtain depth convolution feature and the local feature of image to be retrieved, and the depth to image to be retrieved Convolution feature carries out binary-coding, obtains the binary-coding feature of image to be retrieved.

It should be noted that depth convolution feature is based on cnn depth convolutional neural networks extracting from high-rise, local feature closes The local attribute of note image, can be used as the auxiliary of depth convolution feature and supplement.Wherein, local feature favor speed is fast, Shandong The high surf feature of rod, is designated as f_surf.

In a kind of preferred implementation, dimensionality reduction can be carried out using pca method to depth convolution feature, remove redundancy special Levy, the depth convolution feature removing after redundancy feature is carried out follow-up coupling as final depth convolution feature, is designated as f_cnn+pca.Binary-coding is carried out using lsh method to depth convolution feature, generates binary-coding feature, f_cnnh.

Step s402: by the binary-coding feature of image to be retrieved respectively with the image feature information of destination object in each The corresponding binary-coding feature of individual depth convolution feature is mated, by the matching degree with the binary-coding feature of image to be retrieved It is defined as target binary-coding feature more than the binary-coding feature of the first preset value, and will mesh corresponding with target code feature Mark depth convolution feature and local feature are as candidate characteristic set.

When mating to binary-coding feature, matching degree can pass through similarity characterization, and similarity can pass through calculating two The Hamming distances of individual binary-coding feature obtain.

It should be noted that because characteristics of image is associated with destination object image, thus determine that candidate characteristic set phase When in defining candidate image collection { o₁, o₂... ..., o_n}:

o_{i} = \{\begin{matrix} 1, & s_{i} > θ_{h} \\ 0, & s_{i} \leq θ_{h} \end{matrix},

Wherein, o_iRepresent i-th image to be matched, s_iRepresent image to be retrieved and the similarity of i-th image to be matched, θ_hFor similarity threshold.

The above-mentioned process that binary-coding feature is mated is thick matching process, after the completion of thick coupling, further base Carry out accurately mate in depth convolution feature and local feature.

Step s403: each depth convolution feature that depth convolution feature and the candidate feature of image to be retrieved are concentrated is entered Row coupling, and, the local feature of image to be retrieved is mated with each local feature that candidate feature is concentrated, by depth Convolution feature is more than the characteristics of image of the second preset value as retrieval result with the comprehensive matching degree of corresponding local feature.

The process of accurately mate carries out Similarity Measure using Euclidean distance, specifically by following formula calculating similarity:

S (k)=α × s_cnn+pca(k)+β×s_surf(k)

Wherein, α and β represents the weights of the similarity of depth convolution feature and local feature calculation, s respectively_cnn+pca(k) table Show the similarity of depth convolution feature calculation, s_surfK () represents the similarity that local feature calculates.

Then export the target object information associating with retrieval result, particularly as follows: output depth convolution feature and corresponding local The comprehensive matching degree of feature is more than the destination object image associated by characteristics of image of the second preset value.

Export destination object image when, if the destination object image meeting condition have multiple, can by similarity by High to Low order shows each destination object image.

Said process gives and improves a kind of implementation that image investigates speed, enters line retrieval based on target information and obtains Obtain target, the premise that should obtain target in this way is to know the key word for retrieval in advance, and that is, user knows in advance The partial information of target of investication, however, sometimes, investigator may know nothing to target, that is, not used for retrieval Key word, in this case, sequence of video images can only be browsed one by one it is contemplated that sequence of video images generally comprise more Individual video frame image, and a lot of picture frames may be had in these video frame images not comprise the information of user's concern, in order to carry High investigation speed, the embodiment of the present invention carries out video concentration based on video structural information to video image frame sequence, makes concentration Video frame image afterwards has less frame number and but comprises substantial amounts of information.

In the present embodiment, video structural information may include that the figure of each video frame image in sequence of video images As density.Wherein, image density is used for characterizing the situation of destination object in video frame image.

Refer to Fig. 5, show in above-described embodiment, the movement locus based on destination object and destination object obtain video The schematic flow sheet of structured message, may include that

Step s501: determined by the movement locus of position in video frame image for the destination object and destination object and regard The structural information of monitor area in frequency image sequence.

Wherein, the structural information of monitor area includes the area information that destination object occurs in monitor area.

Step s502: determine the image of each video frame image in sequence of video images based on the structural information of monitor area Density.

After the above-mentioned video structural information obtaining, just based on this video structural information, sequence of video images can be entered Row video concentrates.In order to improve the speed of video concentration, the image that the embodiment of the present invention is primarily based on each video frame image is close Degree carries out segmentation to target video image sequence, and determines video-frequency band to be concentrated from each segmentation, then to be concentrated Video-frequency band carry out video concentration, and enter carrying out the video-frequency bands that the video-frequency band after video concentration do not carry out video concentration with other Row merges, and obtains the sequence of video images concentrating.

Further, refer to Fig. 6, show and segmentation is carried out to sequence of video images based on image density, and from each In determining in segmentation, the schematic flow sheet realizing process of video-frequency band to be concentrated, may include that

Step s601: image density threshold value set in advance is utilized by video by the image density of each video frame image Image sequence is divided into multiple video-frequency bands.

Step s602: the video-frequency band that the image density of each video frame image is respectively less than image density threshold value is defined as treating The video-frequency band concentrating.

Exemplary, sequence of video images includes 100 video frame images, each image in front 30 video frame images The image density of frame of video is respectively less than the image density threshold value setting, each image/video frame in the 31st to 70 video frame image Image density be all higher than the image density threshold value that sets, and each image/video frame in the 71st to the 100th video frame image Image density be respectively less than the image density threshold value setting, then sequence of video images can be divided into 3 video-frequency bands, 1-30 frame is the 1 video-frequency band, 31-70 frame is the 2nd video-frequency band, and 71-100 frame is the 3rd video-frequency band, due to scheming in 1-30 frame, 71-100 frame As density is respectively less than the image density threshold value setting, then 1-30 frame, this two video-frequency bands of 31-70 frame are defined as to be concentrated regarding Frequency range.

After determining video-frequency band to be concentrated, video concentration is carried out to video-frequency band to be concentrated, refers to Fig. 7, illustrate Video-frequency band to be concentrated carried out with the schematic flow sheet realizing process of video concentration, may include that

Step s701: determined at least one destination object in video-frequency band to be concentrated in time dimension by space-time concentration model The optimum shift strategy of movement on degree and Spatial Dimension.

Step s702: image co-registration is carried out based on optimum shift strategy, obtains the video-frequency band after concentrating.

Space-time concentration model provided in an embodiment of the present invention is not being lost any target, is being ensured the same of target original temporal When, carry out maximum video concentration in time, two, space dimension, video collisionless after concentration, no stroboscopic, vision is imitated Fruit is good.

Specifically, the energy function of space-time concentration model is characterized as:

E (m)=min { σ e_a(b)+{ασe_c(b,b')+βσe_t(b,b')}},(b,b'∈b)

Wherein, b is the image sequence with first object object, and b' is the image sequence with the second destination object, σ e_aB () is movable energy damage threshold, if the big target of area is not mapped onto concentrating in video, this is representative Penalty value is bigger than normal, on the contrary then little it is to be understood that the big target of area more should be retained in concentration video in.e_c(b,b') For colliding conflict penalty term,Inner product for two tracks conflict periods.Concentrating video is will be former Target moves on time shafts and spatial distribution, situations such as the intersection-type collision between track being produced unavoidably and blocks, such as two There is the shared period in individual target sequence, and there is track cross, then penalty term is the inner product operation of corresponding overlapping region.e_t (b, b') item is sequential penalty term, and sequential penalty term meaning is to keep the sequencing of original video life event as far as possible, As in original video, two people one in front and one in back walk or talk when walking side by side, also to reasonably keep in concentrating video This relativeness.e_t(b, b')=exp (- (d (b, b')/ω)), d (b, b') represent two tracks and share period center pixel Euclidean distance, ω is custom parameter, adjusts event-order serie.It should be noted that above-mentioned optimum shift strategy is to work as space-time The value of the energy function of concentration model be during minima corresponding destination object in the time and the strategy that spatially moves.

Corresponding with said method, the embodiment of the present invention additionally provides a kind of video image processing device, refers to Fig. 8, Show the structural representation of this device, may include that video acquiring module 801, target recognition module 802, target following mould Block 803, data obtaining module 804 and processing module 805.

Video acquiring module 801, for obtaining sequence of video images.

Target recognition module 802, for the video frame image from the sequence of video images that video acquiring module 801 obtains Middle identification destination object.

Target tracking module 803, the destination object for identifying to target recognition module 802 is tracked and determines mesh The movement locus of mark object.

Data obtaining module 804, the destination object being identified based on target recognition module 802 and target tracking module 803 The movement locus of the destination object determining obtain video knotization information.

Processing module 805, for being carried out based on the video structural information that video structural data obtaining module 804 obtains Destination object is retrieved and/or is carried out video concentration to sequence of video images.

The present invention provide video image processing device, the destination object in sequence of video images can be identified and with Track, and then video structural information can be obtained by the movement locus based on destination object and destination object, getting video structure After change information, line retrieval can be entered based on video structural information, quickly can investigate target by this way, in addition, also can base Carry out video concentration in video structural information, contain, due to concentrating video, quantity of information and the frame number that raw video image contains Few, therefore, can quickly investigate target based on concentrating video.That is, how user is known a priori by some information of destination object, then Directly line retrieval can be entered using these information, thus quickly investigating destination object, if user does not know destination object Information, then can directly browse concentration video, thus quickly investigating destination object.Based on video provided in an embodiment of the present invention Image processing apparatus quickly can investigate destination object from video, and that is, the embodiment of the present invention improves the speed of target detection Degree, and then improve the detection speed of case, better user experience.

In the video image processing device that above-described embodiment provides, target recognition module 802, specifically for being rolled up based on depth Destination object is identified in long-pending neutral net each video frame image from sequence of video images.

In the video image processing device that above-described embodiment provides, target tracking module 803, specifically for based on from video The light flow point extracted on destination object in picture frame, is carried out to destination object using lucas kanade optical flow method track algorithm Follow the tracks of.

In the video image processing device that above-described embodiment provides, what video structural data obtaining module 804 obtained regards Frequency structured message includes: the text message of destination object and/or image feature information, and the text message of destination object includes mesh The attribute information of mark object and movable information.

Then processing module 805, comprising: retrieval module and output module.

Retrieval module, for when receiving the search instruction to text message to be retrieved, based on described text to be retrieved Information is retrieved in the text message of described destination object, or, when receiving the search instruction to image to be retrieved, it is based on Described image to be retrieved is retrieved in the image feature information of described destination object, or, when receiving, event to be retrieved is believed During the search instruction of breath, based on described event to be retrieved and the event model that pre-builds is retrieved in described text message, Obtain retrieval result；

Output module, the target object information associating with the described retrieval result of described retrieval module for output.

In the above-described embodiments, the image feature information of destination object include depth convolution feature and with described depth convolution The local feature of feature association.

Retrieval module may include that thick matching module and accurately mate module.

Thick matching module, for the depth convolution feature based on image to be retrieved in the image feature information of destination object Mated by the first matched rule, obtained candidate characteristic set.

Accurately mate module, concentrates in candidate feature for the depth convolution feature based on image to be retrieved and local feature Mated by the second matched rule, obtained target image characteristics as testing result.

Further, thick matching module includes: feature obtains and processes submodule and thick matched sub-block.

Feature obtains and processes submodule, for obtaining depth convolution feature and the local feature of image to be retrieved and right The depth convolution feature of image to be retrieved carries out binary-coding, obtains the binary-coding feature of image to be retrieved, is additionally operable to mesh Each depth convolution feature in the image feature information of mark object carries out binary-coding respectively, obtains the image with destination object Each corresponding binary-coding feature of depth convolution feature in characteristic information.

Thick matched sub-block, for believing the binary-coding feature of image to be retrieved respectively with the characteristics of image of destination object Each corresponding binary-coding feature of depth convolution feature in breath is mated, by the binary-coding feature with image to be retrieved Matching degree be more than the binary-coding feature of the first preset value and be defined as target binary-coding feature, and will be with target code feature Corresponding target depth convolution feature and target local feature are as candidate characteristic set.

Accurately mate module, specifically for each depth concentrating the depth convolution feature of image to be retrieved and candidate feature Degree convolution feature is mated, and, each office that the local feature of described image to be retrieved is concentrated with described candidate feature Portion's feature is mated, and the image that depth convolution feature is more than the second preset value with the comprehensive matching degree of corresponding local feature is special Levy as retrieval result.

Then output module, is more than second specifically for output depth convolution feature with the comprehensive matching degree of corresponding local feature Destination object image associated by the characteristics of image of preset value, wherein, destination object image is in advance from described destination object institute Video frame image in extract destination object image.

In the video image processing device that above-described embodiment provides, what video structural data obtaining module 804 obtained regards Frequency structured message includes: the image density of each video frame image in sequence of video images, and image density is used for characterizing video The situation of destination object in picture frame.

Then video structural data obtaining module includes: monitor area structure determination module and image density determining module.

Monitor area structure determination submodule, for by position in video frame image for the destination object and described mesh The movement locus of mark object determine the structural information of monitor area in sequence of video images, wherein, the structural information of monitor area The area information occurring in described monitor area including described destination object.

Image density determination sub-module, for the described monitoring determined based on described monitor area structure determination submodule The structural information in region determines the image density of each video frame image in sequence of video images.

In the video image processing device that above-described embodiment provides, processing module includes: video pre-filtering module and video Concentrate module.

Video pre-filtering module, for based on image density sequence of video images is carried out segmentation and from each segmentation really Make video-frequency band to be concentrated.

Video concentrates module, for carrying out video concentration to described video-frequency band to be concentrated, and after carrying out video concentration Video-frequency band do not carry out the video-frequency bands of video concentration with other and merge, obtain the sequence of video images concentrating.

Further, video pre-filtering module, comprising: video segmentation submodule and video-frequency band determination sub-module to be concentrated.

Video segmentation submodule, for utilizing image density set in advance by the image density of each video frame image Sequence of video images is divided into multiple video-frequency bands by threshold value；

Video-frequency band determination sub-module to be concentrated, close for the image density of each video frame image is all higher than described image The video-frequency band of degree threshold value is defined as video-frequency band to be concentrated.

Further, video concentrates module and includes: optimum concentration strategy determination sub-module and image co-registration submodule.

Optimum concentrate tactful determination sub-module, for by space-time concentration model determine by described video-frequency band to be concentrated extremely The optimum shift strategy of few destination object movement on time dimension and Spatial Dimension.

Image co-registration submodule, for carrying out image co-registration based on described optimum shift strategy, obtains the video after concentrating Section.

In this specification, each embodiment is described by the way of going forward one by one, and what each embodiment stressed is and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.

It should be understood that disclosed method, device and equipment in several embodiments provided herein, permissible Realize by another way.For example, device embodiment described above is only schematically, for example, described unit Divide, only a kind of division of logic function, actual can have other dividing mode when realizing, for example multiple units or assembly Can in conjunction with or be desirably integrated into another system, or some features can be ignored, or does not execute.Another, shown or The coupling each other discussing or direct-coupling or communication connection can be by some communication interfaces, between device or unit Connect coupling or communicate to connect, can be electrical, mechanical or other forms.

The described unit illustrating as separating component can be or may not be physically separate, show as unit The part showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple On NE.The mesh to realize this embodiment scheme for some or all of unit therein can be selected according to the actual needs 's.In addition, can be integrated in a processing unit or each in each functional unit in each embodiment of the present invention Unit is individually physically present it is also possible to two or more units are integrated in a unit.

If described function realized using in the form of SFU software functional unit and as independent production marketing or use when, permissible It is stored in a computer read/write memory medium.Based on such understanding, technical scheme is substantially in other words Partly being embodied in the form of software product of part that prior art is contributed or this technical scheme, this meter Calculation machine software product is stored in a storage medium, including some instructions with so that a computer equipment (can be individual People's computer, server, or network equipment etc.) execution each embodiment methods described of the present invention all or part of step. And aforesaid storage medium includes: u disk, portable hard drive, read only memory (rom, read-only memory), random access memory are deposited Reservoir (ram, random access memory), magnetic disc or CD etc. are various can be with the medium of store program codes.

Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses the present invention. Multiple modifications to these embodiments will be apparent from for those skilled in the art, as defined herein General Principle can be realized without departing from the spirit or scope of the present invention in other embodiments.Therefore, the present invention It is not intended to be limited to the embodiments shown herein, and be to fit to and principles disclosed herein and features of novelty phase one The scope the widest causing.

Claims

1. a kind of method of video image processing is it is characterised in that methods described includes:

Obtain sequence of video images；

Destination object retrieval is carried out based on described video structural information and/or to carry out video to described sequence of video images dense Contracting.

2. method of video image processing according to claim 1 it is characterised in that described from described sequence of video images Each frame video image in identify destination object, comprising:

Based on identification destination object in depth convolutional neural networks each video frame image from described sequence of video images.

3. method of video image processing according to claim 1 it is characterised in that described described destination object is carried out with Track, comprising:

Based on the light flow point extracted on the described destination object from described video frame image, using lucas kanade optical flow method Track algorithm is tracked to described destination object.

4. method of video image processing according to claim 1 is it is characterised in that described video structural information includes: The text message of destination object and/or image feature information, the text message of described destination object includes described destination object Attribute information and movable information；

When receiving the search instruction to text message to be retrieved, based on described text message to be retrieved in described destination object Text message in retrieve, or, when receiving the search instruction to image to be retrieved, based on described image to be retrieved in institute State retrieval in the image feature information of destination object, or, when receiving the search instruction to event information to be retrieved, it is based on Described event to be retrieved and the event model pre-building are retrieved in described text message, obtain retrieval result；

5. method of video image processing according to claim 4 is it is characterised in that the characteristics of image of described destination object is believed Breath includes depth convolution feature and local feature；

Described retrieved in the image feature information of described destination object based on described image to be retrieved, obtain retrieval result, bag Include:

Depth convolution feature based on described image to be retrieved presses the first coupling in the image feature information of described destination object Rule is mated, and obtains candidate characteristic set；

Depth convolution feature based on described image to be retrieved and local feature are concentrated by the second coupling rule in described candidate feature Then mated, obtained target image characteristics as described testing result.

6. method of video image processing according to claim 5 it is characterised in that described based on described image to be retrieved Depth convolution feature is mated by the first matched rule in the image feature information of described destination object, obtains candidate feature Collection, comprising:

Obtain depth convolution feature and the local feature of described image to be retrieved, and special to the depth convolution of described image to be retrieved Levy the binary-coding feature carrying out that binary-coding obtains described image to be retrieved；

By the binary-coding feature of described image to be retrieved respectively with the image feature information of described destination object in each is deep Degree convolution feature corresponding binary-coding feature is mated, by the matching degree with the binary-coding feature of described image to be retrieved It is defined as target binary-coding feature more than the binary-coding feature of the first preset value, and will be corresponding with described target code feature Target depth convolution feature and target local feature as candidate characteristic set；

The described depth convolution feature based on described image to be retrieved and local feature are concentrated in described candidate feature and are pressed second Join rule to be mated, obtain target image characteristics as described testing result, comprising:

Each depth convolution feature that the depth convolution feature of described image to be retrieved is concentrated with described candidate feature is carried out Join, and, the local feature of described image to be retrieved is mated with each local feature that described candidate feature is concentrated, will Depth convolution feature is more than the characteristics of image of the second preset value as retrieval result with the comprehensive matching degree of corresponding local feature；

Export described depth convolution feature and the comprehensive matching degree of corresponding local feature is more than the characteristics of image institute of the second preset value The destination object image of association, described destination object image is to extract from the video frame image that described destination object is located in advance Described destination object image.

7. method of video image processing according to claim 1 is it is characterised in that described video structural information includes: The image density of each video frame image in described sequence of video images, described image density is used for characterizing described video frame image The situation of middle destination object；

By the movement locus of position in video frame image for the described destination object and described destination object determine described in regard The structural information of monitor area in frequency image sequence, the structural information of described monitor area includes described destination object in described prison The area information occurring in control region；

Determine the image density of each video frame image in described sequence of video images based on the structural information of described monitor area.

8. method of video image processing according to claim 7 is it is characterised in that described believed based on described video structural Breath carries out video concentration to described sequence of video images, comprising:

Image density based on each video frame image described carries out segmentation to described sequence of video images, and from each segmentation Determine video-frequency band to be concentrated；

Described video-frequency band to be concentrated is carried out with video concentration, and is regarded carrying out the video-frequency band after video concentration with other The video-frequency band that frequency concentrates merges, and obtains the sequence of video images concentrating.

9. method of video image processing according to claim 8 it is characterised in that described based on described image density to institute State sequence of video images and carry out video-frequency band to be concentrated in segmentation, and determination from each segmentation, comprising:

Image density threshold value set in advance is utilized by described sequence of video images by the image density of each video frame image It is divided into multiple video-frequency bands, each described video-frequency band includes multiple continuous video frame images；

The video-frequency band that the image density of each video frame image is all higher than described image density threshold is defined as to be concentrated regarding Frequency range.

10. method of video image processing according to claim 8 it is characterised in that described to described video to be concentrated Duan Jinhang video concentrates, comprising:

Determined at least one destination object in described video-frequency band to be concentrated in time dimension and space by space-time concentration model The optimum shift strategy of movement in dimension；

A kind of 11. video image processing devices are it is characterised in that described device includes: video acquiring module, target recognition mould Block, target tracking module, video structural data obtaining module and processing module；

Described video acquiring module, for obtaining sequence of video images；

Described target recognition module, for the video image from the described sequence of video images that described video acquiring module obtains Destination object is identified in frame；

Described target tracking module, the described destination object for identifying to described target recognition module is tracked and determines The movement locus of described destination object；

Described information acquisition module, the described destination object being identified based on described target recognition module and described target following mould The movement locus of the described destination object that block determines obtain video structural information；

Described processing module, the described video structural information for being obtained based on described information acquisition module carries out destination object Retrieve and/or video concentration is carried out to described sequence of video images.

12. video image processing devices according to claim 11 it is characterised in that described target recognition module, specifically For identifying destination object based in each video frame image from described sequence of video images of depth convolutional neural networks.

13. video image processing devices according to claim 11 it is characterised in that described target tracking module, specifically For based on the light flow point extracted on the described destination object from described video frame image, using lucas kanade optical flow method Track algorithm is tracked to described destination object.

14. video image processing devices according to claim 11 are it is characterised in that described video structural packet Include: the text message of destination object and/or image feature information, the text message of described destination object includes described destination object Attribute information and movable information；

Described processing module, comprising: retrieval module and output module；

Described retrieval module, for when receiving the search instruction to text message to be retrieved, based on described text to be retrieved Information is retrieved in the text message of described destination object, or, when receiving the search instruction to image to be retrieved, it is based on Described image to be retrieved is retrieved in the image feature information of described destination object, or, when receiving, event to be retrieved is believed During the search instruction of breath, based on described event to be retrieved and the event model that pre-builds is retrieved in described text message, Obtain retrieval result；

15. video image processing devices according to claim 14 are it is characterised in that the characteristics of image of described destination object Information includes depth convolution feature and local feature；

Described thick matching module, special in the image of described destination object for the depth convolution feature based on described image to be retrieved Mated by the first matched rule in reference breath, obtained candidate characteristic set；

Described accurately mate module, for the depth convolution feature based on described image to be retrieved and local feature in described candidate Mated by the second matched rule in feature set, obtained target image characteristics as described testing result.

16. video image processing devices according to claim 15 are it is characterised in that described thick matching module includes: special Levy acquisition and process submodule and thick matched sub-block；

Described feature obtains and processes submodule, for obtaining depth convolution feature and the local feature of described image to be retrieved, And the depth convolution feature of described image to be retrieved is carried out with binary-coding, the binary-coding obtaining described image to be retrieved is special Levy, be additionally operable to carry out binary-coding respectively to each depth convolution feature in the image feature information of described destination object, obtain Obtain and each the corresponding binary-coding feature of depth convolution feature in the image feature information of described destination object；

Described thick matched sub-block, for by the binary-coding feature of the described image to be retrieved figure with described destination object respectively As each the corresponding binary-coding feature of depth convolution feature in characteristic information is mated, by with described image to be retrieved The binary-coding feature that the matching degree of binary-coding feature is more than the first preset value is defined as target binary-coding feature, and will be with The corresponding target depth convolution feature of described target code feature and target local feature are as candidate characteristic set；

Described accurately mate module, specifically for concentrating the depth convolution feature of described image to be retrieved with described candidate feature Each depth convolution feature mated, and, the local feature of described image to be retrieved is concentrated with described candidate feature Each local feature mated, the comprehensive matching degree of depth convolution feature and corresponding local feature is more than the second preset value Characteristics of image as retrieval result；

Then described output module, is more than with the comprehensive matching degree of corresponding local feature specifically for exporting described depth convolution feature Destination object image associated by the characteristics of image of the second preset value, described destination object image is in advance from described destination object The image of the described destination object extracting in the video frame image being located.

17. video image processing devices according to claim 11 are it is characterised in that described video structural packet Include: the image density of each video frame image in described sequence of video images, described image density is used for characterizing described video figure Situation as destination object in frame；

Described monitor area structure determination submodule, for by position in video frame image for the described destination object and institute The movement locus stating destination object determine the structural information of monitor area in described sequence of video images, the knot of described monitor area Structure information includes the area information that described destination object occurs in described monitor area；

Described image density determination sub-module, for the described monitoring determined based on described monitor area structure determination submodule The structural information in region determines the image density of each video frame image in described sequence of video images.

18. video image processing devices according to claim 17 are it is characterised in that described processing module includes: video Pretreatment module and video concentrate module；

Described video pre-filtering module, for carrying out segmentation and from each based on described image density to described sequence of video images Video-frequency band to be concentrated is determined in segmentation；

Described video concentrates module, for carrying out video concentration to described video-frequency band to be concentrated, and after carrying out video concentration Video-frequency band do not carry out the video-frequency bands of video concentration with other and merge, obtain the sequence of video images concentrating.

19. video image processing devices according to claim 18, it is characterised in that described video pre-filtering module, wrap Include: video segmentation submodule and video-frequency band determination sub-module to be concentrated；

Described video segmentation submodule, for utilizing image density set in advance by the image density of each video frame image Described sequence of video images is divided into multiple video-frequency bands by threshold value；

Described video-frequency band determination sub-module to be concentrated, close for the image density of each video frame image is all higher than described image The video-frequency band of degree threshold value is defined as video-frequency band to be concentrated.

20. video image processing devices according to claim 18 are it is characterised in that described video concentration module includes: Optimum concentration strategy determination sub-module and image co-registration submodule；

Described optimum concentrate tactful determination sub-module, for by space-time concentration model determine by described video-frequency band to be concentrated extremely The optimum shift strategy of few destination object movement on time dimension and Spatial Dimension；

Described image merges submodule, for carrying out image co-registration based on described optimum shift strategy, obtains the video after concentrating Section.