CN108846384A

CN108846384A - Merge the multitask coordinated recognition methods and system of video-aware

Info

Publication number: CN108846384A
Application number: CN201810744934.4A
Authority: CN
Inventors: 明悦
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2018-11-20

Abstract

The present invention provides a kind of multitask coordinated recognition methods and system for merging video-aware, belong to multi-source heterogeneous video data processing identification technology field, in conjunction with biology perception mechanism, the shared semantic description for studying multi-source heterogeneous video data feature collaboration obtains the generic features description of multi-source heterogeneous video data；Using suitable border computational theory, feature association study and the task forecasting mechanism of task cooperation are established, realizes the task interaction prediction mechanism of suitable border perception；In conjunction with it is long when rely on, propose that the vision multitask depth of context collaboration cooperates with identification model, realize have the multitask depth collaboration identification model of long-term memory, the generalization for solving video multitask identification is poor, the problems such as robustness is low, computation complexity is high.The present invention proposes that intelligence, generalization, the video general character description method of mobile and multitask depth cooperate with identification model, can promote intelligent information push, the personalized development for controlling the fields such as service of the multi-source heterogeneous video data in smart city.

Description

Merge the multitask coordinated recognition methods and system of video-aware

Technical field

The present invention relates to multi-source heterogeneous video datas to handle identification technology field, and in particular to a kind of fusion video-aware Multitask coordinated recognition methods and system.

Background technique

Artificial intelligence is support with the development of the technologies such as big data, cloud computing, intelligent terminal, using deep neural network as base Plinth will enter all-round developing new era.Upper ultrahigh speed, mobile and generalization are being stored and processed in face of mass data Urgent need, the Special artificial based on single mode single task intelligently have become the important bottleneck for keeping field development in check.

Traditional single task identifies the generalization requirement being unable to satisfy under artificial intelligence background, with wherein most representative For the mission requirements such as the face video identification, Human bodys' response, the vehicle classification identification that are related to simultaneously in the construction of smart city, Video acquisition camera is many kinds of, specification is different, causes video data that massive multi-source is presented, needs regular isomorphism The recognition mechanism that video features describe method and efficiently cooperate with, realization accurately identify target, scene, behavior, anomalous event. Therefore, the visual identity mechanism towards the collaboration of multitask depth can be the reality of the following intelligent information push and personalized control service It is existing, establish important theoretical basis.

The multitask depth collaboration Study of recognition of so-called multisource video perception refers to based on biology perception mechanism, extracts The generic features of multi-source heterogeneous video data carry out feature association study and task prediction in conjunction with suitable border theory, and foundation has length When remember depth collaboration identification network, i.e., realization context layer multitask collaborative perception identification.Such as：One section " in dining room Xiao Ming greets to me " video clip in, achieve the effect that while identifying a variety of visual tasks, i.e., identify that scene (is eaten simultaneously Hall), target (Xiao Ming), behavior (greetings), the visual tasks such as expression (laughing at), rather than each identification mission establish it is a set of independent Identification model, export recognition result respectively, not only waste computing resource, but also be difficult to handle mass data, realize practical require.

In Current vision identification technology, based on the feature extracting method of deep learning in scene, target, behavior, expression Etc. single identification mission test in show superior performance.However, for massive multi-source data, as user advises Mould, scene change and time passage, and generate some new problems：

Extensive bottleneck：Data distribution significant difference between different task mode is also easy to produce over-fitting to small-scale data task, And mass data task faces high training and label cost, so that balance generalization can not be obtained between different task, model Generalization Capability is decreased obviously under changing environment or scene；

Efficiency bottle neck：Depth network model is complicated, and number of parameters is huge, although having generation confrontation network, capsule network Deng having carried out good try to reducing data requirements and resource consumption, but facing different identification missions, heterogeneous networks structure, Still it is difficult to realize resource balanced efficient distribution rapidly；

Migrate bottleneck：It can not be associated prediction according to the historical information of data when scene changes, it is selective when establishing long Memory and Forgetting Mechanism realize the adaptive learning of suitable border migration.Such as Xiao Ming moves towards dining room from classroom, realizes to goal behavior The identification mission having a meal migration is transformed to from study.

Therefore, the depth of the interaction prediction in vision multi-task learning with task cooperation ability and context layer cooperates with identification Modeling becomes key problem urgently to be resolved in the identification of Current vision Intellisense.

With the lasting promotion of communication bandwidth and transmission speed, the video data volume is exponentially increased, to it is limited calculating and Storage resource generates immense pressure, and the other single task treatment mechanism of heritage knowledge divides different data to different task, finds Its corresponding character description method, causes resource utilization low, the descriptive difference of data.Conventional visual perceives identification mission In, learning model generally no longer changes after establishing, but in the Intellisense epoch, as the time passage of different scenes and space become It changes, original model is non-optimal, and there are potential incidence relations between different mode in scene, for the variation field of space-time passage Scape is learnt feature association under scene change, is reached mass data using the mission mode association mining mechanism under the perception of suitable border The dynamic prediction of lower task cooperation and certainly mark, the adaptive optimal of learning characteristic Model checking to keep.Single-mode Identification can not carry out effective long-term memory reasoning to the feature learnt, realize a variety of visions according to the dynamic mapping of scene Task identifies simultaneously.When having pop-up mission or target is added, model interoperability to be identified can not be handled, realize identification network The balance of underloadingization and high usage.

Summary of the invention

The purpose of the present invention is to provide one kind can be directed to multi-source heterogeneous data, establishes generalization feature collaboration description machine System, the video information effective supplement that different data sources are obtained, is evolved to multi-source elastic model for traditional single source fixed mode, It effectively removes data redundancy, retains shared semantic information, establish a kind of high dynamic admission rate, high resource utilization, low network and disappear The multitask recognition methods and system of the fusion video-aware of consumption rate, to solve technical problem present in above-mentioned background technique.

To achieve the goals above, this invention takes following technical solutions：

On the one hand, the present invention provides a kind of multitask recognition methods for merging video-aware, include the following steps：

Step S110：In conjunction with biology perception mechanism, the shared semanteme based on the collaboration of multi-source heterogeneous video data feature Mechanism extracts the generic features of multi-source heterogeneous video data；

Step S120：Using suitable border computational theory, the feature association study mechanism of task cooperation is established, it is different to the multi-source The generic features of structure video data carry out continuous learning as priori knowledge, generate the task interaction prediction model of suitable border perception；

Step S130：For it is long when input video stream, the task interaction prediction model foundation perceived in conjunction with the suitable border is long When the generation memory models that rely on, establish the depth based on cooperative kinetics it is autonomous it is semi-supervised persistently identifies system, realize more Business identification.

Further, in the step S110, the shared semantic mechanism packet of the multi-source heterogeneous video data feature collaboration It includes：

Establish the primitive collaboration based on multi-source heterogeneous video data, the dictionary collaboration based on time synchronization and based on semantic phase As theme cooperate with three-level feature synergistic mechanism establish multi-source heterogeneous video counts in conjunction with the attribute of multi-source heterogeneous video data According to feature cooperation model, the regular shared semantic association relationship of dimension is determined；Wherein,

Primitive based on multi-source heterogeneous video data cooperates with：Using independent component analysis training video picture element, The video image primitive is matched according to this using Gabor function, estimate the corresponding scale of each video image primitive and The primitive feature of video image is extracted in direction, realizes the time-space domain efficient coding of video image internal structure；

Dictionary based on time synchronization cooperates with：It is encoded using local linear, using local distance as sparse basis letter Several regular terms calculates the best response signal of original dictionary, and the best response signal is recycled to calculate feasible dictionary search A dictionary updating is completed in direction, establishes a Coded concepts stream for each data channel, and the reference as complicated event is semantic The low-level feature stream newly inputted and the semantic coding that refers to are flowed into Mobile state Time alignment by coding, and generation time translates letter Number realizes the alignment of dictionary semanteme；

Include based on semantic similar theme collaboration：Using hidden semantic analysis, dictionary and video image primitive feature are constructed Between co-occurrence matrix, the corresponding semantic concept of theme is embodied using hidden node, is realized by probability inference method to word It converges, theme node and the description of scene mapping relations, by the video conditional probability under theme distribution, as particular category similarity, Calculate the likelihood function of probability and prediction probability between true vocabulary and scene.

Further, described to establish multi-source heterogeneous video data feature cooperation model, determine the regular shared semanteme of dimension Incidence relation includes：

Assuming that have C class heterogeneous channel feature, it willIt is denoted as n_iThe feature of a training sample Matrix, data noise part are E, and Γ is twiddle factor, establishes the majorized function under orthogonality constraint：

Wherein, λ indicates sharing matrix coefficient,_TRepresenting matrix carries out transposition operation, Y_iIndicate ith feature classification mark, F Indicate Frobenius norm,Indicate projection matrix Θ_iTransposition, α, β, μ₁And μ₂For multiplier factor, rank (X) is characterized square The order of battle array X；

Obtain general semantics feature low dimensional manifold subspace { Θ_i, the semantic sharing matrix W under Unified frame₀With specific spy Levy modular matrix { W_i, using least square method for solving prediction loss function R₁(W₀,{W_i},{Θ_i), reconstruct loss letter Number R₂({Θ_i) and regular function R₃(W₀,{W_i) joint optimal solution；

By the way that the multi-source heterogeneous video data newly inputted to be extracted to the high-rise generic features with dimension to eigenspace projection Shared semantic association relationship is established in description.

Further, in the step S120, using suitable border computational theory, the feature association learning machine of task cooperation is established System, the task interaction prediction model for generating suitable border perception include：

The mapping function under low-rank constrains between vision mark and generic features is constructed, realizes feature mark collaboration；It introduces Nuclear norm models mark correlation, feature correlation, while introducing the intrinsic structure that figure regular terms retains data with existing, It realizes the mark prediction without mark characteristic, establishes following unconstrained function：

Wherein, g is the mapping function of feature association study, and data fidelity term Q () is used to evaluate and test given mark and by g letter The loss function of number acquisition task prediction result error minimizes,For being fitted given mark, Φ (g) and Λ It (g) is the regular terms based on a priori assumption, λ and γ are regular terms parameters；

Task interaction prediction model interactive environment, environmental model and the loss model of suitable border perception；

Environmental model is used to learn the environment dynamic change of input feature vector, and loss model is used to estimate that environmental model to lose, Predict visual zone, target and the task to be identified for needing to pay close attention in the future；

The interactive environment includes that definition status space is made of t moment and the description of the generic features at t-1 moment, current t Moment state specifies identification mission a_t, predict next t+1 moment task status to be identified；

The environmental model includes giving historical informationGeneric features with go through History mapping function ξ:H → X and true value mark and history mapping function η:H → Y carrys out academic environment model mapping function ξ (h) → η (h)；Note ω is environmental model ω (ξ (h)) ∈ Y, when every subtask is predicted, introduces loss model L_wm(ω (ξ (h)), η (h)) appoints Business prediction is related to H={ h=(s_t-k,a_t-k,···,s_t,a_t,s_t+1), ξ (h)=(s_t-k,a_t-k,···,s_t,a_t) and η (h)=s_t+1, inverse kinematics forecasting mechanism and softmax cross entropy loss forecasting state in future, the nerve based on stochastic gradient descent Network model ω_φ, to encode, the stateful low-dimensional latent space completion visual attention location region into one comprising shared weight is mentioned It takes and status predication；

The loss model includes given state s_tWith suggest next step task, for predicting environmental model R_lA task hair Raw probability distribution, softmax cross entropy loss function encode the state of next step task as penalty term.

Further, in the step S130, it is described for it is long when input video stream, being perceived in conjunction with the suitable border for task The generation memory models that interaction prediction model foundation relies on when long include：

Model is generated using memory external system enhancing timing, generic features description is stored since the early stage of sequence Effective information, establish sustainable generation memory models to information has been stored；Specifically,

Generate the generic features description collection e that memory models include feature collaboration_≤T={ e₁,e₂,···,e_TAnd task association Same hidden variable collection z_≤T={ z₁,z₂,···,z_T, h is mapped using translation_t=f_h(h_t-1,e_t,z_t) correct each time point The hidden state variable h of certainty_t, priori mapping function f_z(h_t-1) description past observing and hidden variable non-linear dependence and offer Hidden variable distribution parameter；Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided；Benefit With memory external Modifying model timing variable autocoder, a memory text ψ is generated at every point of time_t, priori letter Breath and posterior information respectively indicate as follows：

Prior information p_θ(z_t|z_{< T},e_{< T})=N (z_t|f_z ^μ(Ψ_t),f_z ^σ(Ψ_t-1))

Posterior information q_φ(z_t|z_{< T},e_≤T)=N (z_t|f_q ^μ(Ψ_t-1,e_t),f_q ^σ(Ψ_t-1,e_t))

Wherein, f_z ^μIt is the translation mapping function of hidden variable z state μ, f_z ^σIt is the translation mapping function of hidden variable z state σ, f_q ^μIt is the translation mapping function of posterior probability q state μ, f_q ^σThe translation mapping function of posterior probability q state σ, prior information are to rely on F is mapped in priori_zRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp is depended on and is reflected by posteriority Penetrate function f_qAssociated memory text Ψ_t-1With current observation e_t。

Further, the depth of the foundation based on cooperative kinetics is independently semi-supervised persistently identifies system, realizes more Business identifies：

Recognizer is cooperateed with based on the depth for generating memory models, using the evolutionary process of collaboration potential-energy function, will be remembered Model is introduced into the dynamic process of coevolution, will solve prototype pattern and adjoint mode is attributed to solution nonlinear optimization Problem obtains optimization contract network weight；

Long memory network f in short-term_rnnFor promoting state history h_t, memory external M_tUse the hidden change from previous moment Amount and external text information c_tIt generating, generation model is as follows,

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

Memory M is derived from order to be formed_tTask recognition instruction, introduce one collection key value, using cosine similarity evaluate and test willWith memory M_t-1Each row compares, and generation task pays attention to weight, the memory of retrievalBy attention weight and memory M_t-1's Weighted sum obtains, and realizes multitask identification；Wherein,

Key value

Task weighting

Retrieval memory

Identification generates

Wherein,It is the crucial value function of r item for promoting state history, f_attIt is attention mechanism function,It is t moment r The memory weight of i-th point of item,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is to be remembered by retrieval The Setover relatedly value learnt, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result, Ψ_t=[φ_t ¹,φ_t ²,···,φ_t ^R,h_t], as the output for generating memory models.

On the other hand, including general the present invention also provides a kind of multitask coordinated identifying system for merging video-aware Characteristic extracting module, collaboration feature learning module, depth cooperate with identification module；

The generic features extraction module, it is special based on multi-source heterogeneous video data for combining biology perception mechanism The shared semantic mechanism of sign collaboration, extracts the generic features of multi-source heterogeneous video data；

The collaboration feature learning module, for establishing the feature association study of task cooperation using suitable border computational theory Mechanism carries out continuous learning as priori knowledge to the generic features of the multi-source heterogeneous video data, generates suitable border perception Task interaction prediction model；

The depth cooperates with identification module, input video stream when for being directed to long, closes in conjunction with the task that the suitable border perceives The generation memory models that connection prediction model relies on when establishing long establish the autonomous semi-supervised lasting knowledge of the depth based on cooperative kinetics Complicated variant system realizes multitask identification.

Further, the generic features extraction module includes primitive collaboration submodule, dictionary collaboration submodule and theme Cooperate with submodule；

The primitive cooperates with submodule, for utilizing independent component analysis training video picture element, utilizes Gabor letter It is several that the video image primitive is matched according to this, estimate the corresponding scale of each video image primitive and direction, extracts view The primitive feature of frequency image realizes the time-space domain efficient coding of video image internal structure；

The dictionary cooperates with submodule, for being encoded using local linear, using local distance as sparse basic function Regular terms calculates the best response signal of original dictionary, and the best response signal is recycled to calculate feasible dictionary search direction, A dictionary updating is completed, establishes a Coded concepts stream for each data channel, as the reference semantic coding of complicated event, The low-level feature stream newly inputted and the semantic coding that refers to are flowed into Mobile state Time alignment, generation time translation function is real Existing dictionary semanteme alignment；

The theme cooperates with submodule, for using hidden semantic analysis, constructs between dictionary and video image primitive feature Co-occurrence matrix, the corresponding semantic concept of theme is embodied using hidden node, is realized by probability inference method to vocabulary, master Node and the description of scene mapping relations are inscribed, the video conditional probability under theme distribution is calculated true as particular category similarity The likelihood function of probability and prediction probability between notional word remittance and scene.

Further, the collaboration feature learning module includes the task pass of feature association study submodule and the perception of suitable border Connection prediction submodule；

The feature association learns submodule, for constructing the mapping under low-rank constrains between vision mark and generic features Function realizes feature mark collaboration；

The task interaction prediction submodule of the suitable border perception, for the feature association relationship by learning, in conjunction with view Feel that the priori knowledge of perception, the task cooperation treatment mechanism based on environmental model and loss function are realized dynamic according to scene changes State is adaptively adjusted task to be identified, completes the dynamic adjustment of the perception of visual attention location region and mission requirements prediction.

Further, the generation memory models submodule relied on when the depth collaboration identification module includes long and multitask Depth collaboration identification submodule；

The generation memory models submodule relied on when described long, input video stream when for being directed to long, in conjunction with the suitable border The generation memory models that the task interaction prediction model foundation of perception relies on when long；

The multitask depth collaboration identification submodule, for establishing, the depth based on cooperative kinetics is independently semi-supervised to be held Continuous identification system realizes multitask identification.

Beneficial effect of the present invention：Complete effective extraction of multi-source heterogeneous video data information can be achieved, settling time is synchronous Dictionary synergistic mechanism, reduce vision it is semantic between time uncertainty, improve model to the generalization ability of scene change；It can Task to be identified is adjusted according to scene changes dynamic self-adapting, the perception of completion visual attention location region and mission requirements are predicted dynamic State adjustment；Memory external is established in conjunction with long-range data dependence and generates model, enhances e-learning performance, with the storage of lesser data Capacity reduces model parameter calculation complexity, extracts useful information at once, is applied to different type video sequence, solves multiple Miscellaneous, long-range sequence data can not selective memory and forgetting problem；It realizes that identification feature independently selects, improves without labeled data Identification, persistently promoted multitask identification accuracy rate and robustness, adjusted according to scene changes dynamic self-adapting wait know Other task completes the dynamic adjustment of the perception of visual attention location region and mission requirements prediction.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is that the multitask of the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention identifies original Manage block diagram.

Fig. 2 is that the generic features of the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention are extracted Module principle block diagram.

Fig. 3 is the collaboration feature learning of the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention Module principle block diagram.

Fig. 4 is that the depth of the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention cooperates with identification Module principle block diagram.

Fig. 5 is the task prediction model of the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention Functional block diagram.

Fig. 6 is that the multitask coordinated identifying system of fusion video-aware described in the embodiment of the present invention is relied on based on outside Generate memory models schematic diagram.

Fig. 7 is the multi-source heterogeneous multitask coordinated identification verification platform frame structure of video data described in the embodiment of the present invention Figure.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or module with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and for explaining only the invention, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in specification of the invention Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or modules, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, module and/or their group.

It should be noted that in embodiment of the present invention unless specifically defined or limited otherwise, term is " even Connect ", " fixation " etc. shall be understood in a broad sense, may be a fixed connection, may be a detachable connection, or is integral, can be machine Tool connection, is also possible to be electrically connected, can be and be directly connected to, be also possible to be indirectly connected with by intermediary, can be two The interaction relationship of connection or two elements inside element, unless having specific limit.For those skilled in the art For, the concrete meaning of above-mentioned term in embodiments of the present invention can be understood as the case may be.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology Term and scientific term) there is meaning identical with the general understanding of those of ordinary skill in fields of the present invention.Also answer It should be appreciated that those terms such as defined in the general dictionary should be understood that have in the context of the prior art The consistent meaning of meaning, and unless defined as here, it will not be explained in an idealized or overly formal meaning.

In order to facilitate understanding of embodiments of the present invention, further by taking specific embodiment as an example below in conjunction with attached drawing to be solved Explanation is released, and embodiment does not constitute the restriction to the embodiment of the present invention.

Those of ordinary skill in the art are it should be understood that attached drawing is the schematic diagram of one embodiment, the portion in attached drawing Part or device are not necessarily implemented necessary to the present invention.

Embodiment one

As shown in Figure 1, a kind of multitask coordinated identifying system for fusion video-aware that the embodiment of the present invention one provides.It should System includes generic features extraction module, collaboration feature learning module, depth collaboration identification module；

It is special to study multi-source heterogeneous video data for combining biology perception mechanism for the generic features extraction module The shared semantic description for levying collaboration obtains the generic features description of multi-source heterogeneous video data；

The collaboration feature learning module, for establishing the feature association study of task cooperation using suitable border computational theory With task forecasting mechanism, the task interaction prediction mechanism of suitable border perception is realized；

The depth cooperates with identification module, relies on when for combining long, proposes the vision multitask depth association of context collaboration Same identification model realizes have the multitask depth collaboration identification model module of long-term memory, solves video multitask identification The problems such as generalization is poor, robustness is low, computation complexity is high.

As shown in Fig. 2, the generic features extraction module includes primitive collaboration in specific embodiments of the present invention one The theme of module, the dictionary collaboration submodule of time synchronization and shared semantic information cooperates with submodule.Wherein,

Primitive cooperates with submodule --- and multi-source data processing mode requires can be simultaneously in time-space domain accurately detecting and tracking target With the variation of scene, and the visual cortex cell of biology perception have sparsity, it is dilute in order to combine nerve signal process Otherness is dredged, and is able to achieve the information completeness of multi-source heterogeneous data.It needs to combine multi-source heterogeneous video intrinsic characteristics, research tool Standby scale, translation, rotational invariance primitive synergistic mechanism, realize complete effective extraction of multi-source heterogeneous data information.

The dictionary of time synchronization cooperates with submodule --- and include potential semantic information in multi-source heterogeneous data, realizes low layer The time synchronization of primitive feature and high-level semantic is to overcome the matter of utmost importance of semantic gap.Therefore, it is necessary to combine hidden semantic feature The dictionary synergistic mechanism that expression settling time synchronizes reduces the time uncertainty between vision semanteme.

The theme of shared semantic information cooperates with submodule --- come from social activity, information, physical space different platform and mode number Comprising nature abundant and social property in, the dimension that has different characteristics and data distribution, but the synchronous multi-source obtained Video contains potentially large number of semantic association information.Therefore, it is necessary to study the theme of the visual information of different modalities and dictionary semanteme Relation mechanism proposes that semantic similar theme cooperates with characteristic analysis method, establishes the regular general spy of shared semantic association of dimension Levy description method.

As shown in figure 3, the collaboration feature learning module includes under suitable border is theoretical in specific embodiments of the present invention one Feature association study submodule and visual attention location extracted region and task predict submodule, wherein

Feature association under suitable border theory learns submodule --- and it is different in the visual perception identification mission of scene changes There are High relevancy between scene characteristic, same feature the relevance in different identification missions between other features exist compared with Big difference, and there are relevances between the label of same feature different task, and there are larger for the relevance between different labels Difference.Therefore, it when constructing different task label and feature space mapping function, needs to model label and feature correlation, Retain the intrinsic structure of data with existing (mark and unlabeled data), generalization ability of the lift scheme to scene change.

Visual attention location extracted region and task predict submodule --- learning tasks and feature be associated with mark after, need really Recognize visual attention location region, carry out task prediction to be identified, such as identification pupilage and expression are main identification missions in classroom； Outdoor Scene identifies target and behavior is main identification mission.Therefore, by the feature association relationship learnt, in conjunction with visual impression The priori knowledge known proposes the task cooperation treatment mechanism based on environmental model and loss function, realizes dynamic according to scene changes State is adaptively adjusted task to be identified, completes the dynamic adjustment of the perception of visual attention location region and mission requirements prediction.

As shown in figure 4, in specific embodiments of the present invention one, what the depth collaboration identification module relied on when including long Generate memory models submodule and multitask depth collaboration identification submodule, wherein

The generation memory models submodule relied on when long --- for long-range, the feature stream of more sequences input, without memory capability Study mechanism, need constantly to mark new input data, relearn network model realize identification mission, to calculate, storage It is all huge waste with human resources.Therefore, it is necessary to establish memory external in conjunction with long-range data dependence to generate model, enhance net Network learning performance reduces model parameter calculation complexity with lesser data storage capacity, extracts useful information at once, be applied to Different type video sequence, to solve the problems, such as that complicated, long-range sequence data can not selective memory and forgetting.

Multitask depth collaboration identification submodule --- for lasting input without mark feature stream, need accurately and efficiently Study provides to identify away from the joint optimal characteristics with maximum kind spacing for multitask in standby infima species, and without labeled data without Method manually provides classification markup information, inevitably results in recognition performance loss.Therefore, it is necessary to combine to have long-term memory Cooperative kinetics principle establishes the multitask recognition mechanism of the depth continuous learning under context collaboration, realizes identification feature certainly Main selection improves the identification without labeled data, persistently promotes the accuracy rate and robustness of multitask identification.

Embodiment two

Fusion multisource video perception number is carried out using system described in embodiment one second embodiment of the present invention provides a kind of According to multitask recognition methods.This method comprises the following steps：

Firstly, the shared semantic description of multi-source heterogeneous video data feature collaboration is studied in conjunction with biology perception mechanism, Obtain the generic features description of multi-source heterogeneous video data.

Then, using suitable border computational theory, feature association study and the task forecasting mechanism of task cooperation is established, is realized suitable The task interaction prediction mechanism of border perception.

Finally, in conjunction with it is long when rely on, propose context collaboration vision multitask depth cooperate with identification model, realization have length When remember multitask depth collaboration identification model, solve video multitask identification generalization it is poor, robustness is low, calculating is complicated Spend the problems such as high.

It is especially more in recognition of face, Expression analysis and behavior understanding etc. in recent years in visual perception multitask identification The research achievement obtained in task scene to feature description, interaction prediction and collaboration identification etc., brings forward magnanimity multi-source The generic features of video data, which describe method, suitable border perceives lower task interaction prediction and continue video flowing inputs lower long-term memory Depth cooperate with identification model, introduce shared semantic association description, suitable border Perception Features study and semi-supervised lasting collaboration and identify Etc. frontier theories, improve the multitask coordinated identification of multi-source vision extensive robustness and it is long when intelligence.

In the specific embodiment of the invention two, maximized for the big feature of the video data volume from visual perception mechanism While compressed data, retain the identification information of multi-source heterogeneous data.In face of complicated transformation scene, calculates and manage in conjunction with suitable border By the relevant task interaction prediction mechanism of research characteristic improves the generalization of feature learning.For the video flowing of lasting input, It introduces the semi-supervised depth that continues and cooperates with identification model, realize the dynamic multitask identification demand of timing memory, build multi-source vision Multitask coordinated identification verification platform, carries out the generalization and robustness verifying of theoretical method, while continuously improving and being promoted institute Propose the performance of method.

As shown in Fig. 2, needing to establish primitive collaboration, dictionary collaboration and theme collaboration in generic features extraction step Three-level synergistic mechanism.

Biology perception mechanism thinks that the interaction between visible elements is exactly the interconnection behavior between cellula visualis. Visual behaviour is exactly the treatment process of visual cortex neural network, can be divided into characteristic layer, task layer, context layer.Therefore, it to solve The certainly multitask coordinated identification of multisource video information first has to carry out feature collaboration, that is, extracts the general spy of multi-source heterogeneous data Sign description.Although visual perception data source difference, various structures, storage format are changeable, it includes visual information and language Adopted information.The key of feature collaboration is to realize the " semantic of human cognitive how by video image and semantic progress efficient association It is similar " and " vision is similar " of data processing between consistent generic features mechanism is described.For the complete of visual perception primitive Property, the signal of visual cortex simple cell responds the low dimensional manifold of sparsity and task scene.

Multi-source heterogeneous video primitives collaboration is intended to that extracted feature is made not only to meet the otherness for keeping nerve signal sparse, but also The various possible signals in natural scene can effectively be captured.Although traditional global or local image feature representation can local solution Certainly video data scale and rotational invariance, but the individual poor appearance that not competent target itself generates in things of the like description It is different, therefore it is only applicable to single data source single task treatment mechanism, complete effective description space can not be provided, for higher Visual perception task is even more helpless.

Primitive collaboration obtains one group of sparse, independent filter group first, has different description energy for detecting in video A possibility that power feature occurs.0 norm or 1 norm for generalling use primitive coefficient evaluate sparsity.Independence requirement Correlation is as small as possible between primitive vector.Using independent component analysis training video picture element, using Gabor function to base Member matches according to this, estimates the corresponding scale of each primitive and direction, and primitive collaboration discloses primary visual cortex to a certain extent Neural treatment process can be realized the time-space domain efficient coding of natural video frequency image internal structure.

The dictionary collaboration, utilizes Unsupervised clustering process it is assumed that potential applications refer to based on similar target appearance consistency Local description is marked, using appearance similarity degree as the decision condition of objective attribute target attribute.Dictionary collaboration is a kind of word-based The typical enigmatic language justice character representation method of remittance packet model.It is encoded using local linear, using local distance as sparse basic function Regular terms, calculate the best response signal of original dictionary；The optimization signal is recycled to calculate feasible dictionary search direction, To complete a dictionary updating, a Coded concepts stream is established for each data channel.As the semantic coding of complicated event, The low-level feature stream of all new inputs flows into Mobile state Time alignment with reference to semantic coding, and generation time translation function is realized The alignment of dictionary semanteme.This method require strong signal response dictionary and it is other between otherness it is larger, sample can be efficiently differentiated Video block indicates, guarantees the consistency of similar video primitive set.

The theme collaboration defines its potential semantic topic analysis for the description of dictionary and scene symbiosis, will Difference appearance under different context environment is mapped as certain potential low-dimensional in the groove, realizes theme Cooperative Analysis process, wherein It not only had included that visual signature is mapped to more mappings, but also including visual signature to category label multipair 1 to category label 1.The present invention adopts With hidden semantic analysis, the co-occurrence matrix between dictionary and video image is constructed, using hidden node by the corresponding semantic concept of theme It embodies, is realized by probability inference method and vocabulary, theme node and scene mapping relations are described, the video under theme distribution Conditional probability calculates the likelihood function of probability and prediction probability between true vocabulary and scene as particular category similarity, complete At the building of projection matrix.

In the specific embodiment of the invention two, according to Semantic Similarity between different channels video, in order to effectively quantify difference The shared semantic information of channel different dimensions overcomes noise, blocks, the influence to feature identification such as illumination, extracts more visions The description of discrimination property maximum generic features, increases class spacing in task, reduces in class away from establishing isomeric data feature collaboration mould Type.Assuming that having C class heterogeneous channel feature, to each characteristic typeIt is denoted as n_iA trained sample This eigenmatrix, data noise part are E, and Γ is twiddle factor.Semantic sharing heterogeneous characteristic cooperation model purport under multitask For each X_iLearn a projection matrix Θ_i.Heterogeneous characteristic is projected as to equal intrinsic dimensionality, reduces data redundancy, Majorized function is expressed as under orthogonality constraint：

The heterogeneous characteristic cooperation model is intended to obtain general semantics feature low dimensional manifold subspace { Θ_i, under Unified frame Semantic sharing matrix W₀With special characteristic modular matrix { W_i, least square method is for solving prediction loss function R₁(W₀, {W_i},{Θ_i), reconstruct loss function R₂({Θ_i) and regular function R₃(W₀,{W_i) joint optimal solution.By will newly input Data extract the high-rise generic features with dimension to eigenspace projection and describe, establish shared semantic association relationship.

In conclusion primitive collaboration belongs to feature extraction phases, it is therefore intended that acquisition to the greatest extent may be used in the embodiment of the present invention two It can sparse complete characteristic response signal；Dictionary collaboration belongs to the feature coding stage, it is therefore intended that local video block feature into Row unsupervised learning obtains the semantic dictionary with local holding capacity；Theme collaboration belongs to the Feature Semantics stage, it is therefore intended that Its hiding Semantic mapping space is solved by probabilistic framework, and all kinds of similarity in analysis space is realized feature collaboration, established Perceive the similar generic features describing framework of semanteme of identification mission environment.

The learning ability of human vision is by the signal response and transmitting realization between cellula visualis, particular visual task Need the mutual synergistic effect between a large amount of cellula visualises.Due to concurrency, hierarchy and the feedback between human vision cell, Signal transmitting also has different collaboration meanings, and the transfer mode otherness of synergistic signal is the difficult point of its learning tasks.For In vision multitask perception identification, scene is complicated and changeable, collaboration identification needs intelligent Forecasting a variety of visual tasks to be identified, It proposes the task forecasting mechanism based on generic features association study under suitable border environment, task layer coevolution is realized, to solve to regard Feel perception and the problem of being connected harmonious between natural environment.

As shown in figure 3, the collaboration feature learning includes what feature association learnt in specific embodiments of the present invention two Task under mark collaboration and suitable border develop is predicted.Wherein,

The mark collaboration of feature association study --- there are stronger relevance between feature in practical application scene, and it is same There is larger difference again in relevance of the feature in different identification missions；And exist with the closely related markup information of identification mission Stronger relevance, and there is larger difference again in the relevance between different labeled.Coodination theory thinks that sample mark divides With process, the feature of sample itself is depended not only upon, it is often more important that the data spatial and temporal distributions relationship that neighbour's space sample provides. Same target may correspond to multiple marks, and point of the sample in higher dimensional space is often positioned in multiple when also taking into account feature learning Ambiguousness in the decision boundaries for classification of being engaged in.

The mapping function between lower vision mark and generic features is constrained by building low-rank, realizes feature mark collaboration. Nuclear norm is introduced to model mark correlation, feature correlation, at the same introduce figure regular terms retain data with existing (mark and Unlabeled data) intrinsic structure, realize without mark characteristic mark prediction, the extensive energy of lifting feature learning model Power overcomes semantic ambiguity, keeps model as simple as possible, reduces computation complexity.It is possible thereby to establish following unconstrained function：

Wherein g is the mapping function of feature association study, and data fidelity term Q () is used to evaluate and test given mark and by g letter The loss function of number acquisition task prediction result error minimizes,For being fitted given mark.Φ (g) and Λ It (g) is the regular terms based on a priori assumption.The former keeps low-rank to constrain in practical application, and the latter is for retaining intrinsic structure.λ and γ is the contribution that regular terms parameter is used for three in balance model.

As shown in figure 5, proposing the task prediction technique under suitable border develops, use and people in the specific embodiment of the invention two The suitable border computational theory that cognitive process is close is formed being learnt based on linked character for task and predicts evolutionary model mechanism.The mould Type is by interactive environment, environmental model and loss model composition.Environmental model learns the environment dynamic change of input feature vector, loses mould Type estimates environmental model loss, predicts visual zone, target and the task to be identified for needing to pay close attention in the future.Wherein,

Interactive environment --- definition status space is made of t moment and the description of the generic features at t-1 moment, current t moment State specifies identification mission a_t, predict next t+1 moment task status to be identified.

Environmental model --- given historical informationGeneric features and history map Function ξ:H → X and true value mark and history mapping function η:H → Y carrys out academic environment model mapping function ξ (h) → η (h).Remember ω For environmental model ω (ξ (h)) ∈ Y.When every subtask is predicted, loss model L is introduced_wm(ω(ξ(h)),η(h)).Task prediction It is related to H={ h=(s_t-k,a_t-k,···,s_t,a_t,s_t+1), ξ (h)=(s_t-k,a_t-k,···,s_t,a_t) and η (h)= s_t+1, inverse kinematics forecasting mechanism and softmax cross entropy loss forecasting state in future, the neural network based on stochastic gradient descent Model ω_φ, to encode it is stateful enter the low-dimensional latent space comprising shared weight complete visual attention location extracted region and Status predication.

Loss model --- given state s_tWith suggestion next step task, loss model is for predicting environmental model R_lA The probability distribution that business occurs, softmax cross entropy loss function encode the state of suggestion task as penalty term, promote task The accuracy of prediction.

It fits border prediction by the way that association study and task will be marked and develops, gradual task forecasting mechanism is established, by layer-by-layer Storage environment aware migration knowledge establishes valuable loss model, defines the excitation and inhibition of identification mission in coevolution Element, the current identification mission to be treated of decision solve the problems, such as to migrate knowledge from simulated environment to true environment, are promoted The generalization and stability of feature association study and task prediction.

Learn to realize high-level semantic automatic marking using linked character, according to suitable border perception theory, in conjunction with complicated and changeable Application environment, the suitable border region-of-interest offered the challenge under cooperateing with extracts and multitask forecasting mechanism.On this basis, according to suitable border Perceptibility, low-rank are restrictive, pay attention to the synthesis limitation of the constraint conditions such as regionality, task relevance, realize and describe to generic features Optimal association study, generalization ability of the lift scheme to mass data and multiple-task.By a priori assumption, aposterior reasoning, It is associated with the theoretical research that optimization design completes relevant programme, further completes new departure by tools such as algorithm simulating platforms Simulating, verifying work.

As shown in figure 4, organically combining biological neural network and Synergetic Pattern Recognition in the embodiment of the present invention two, utilize Visual perception carries out effective semantic generic features description to target, and it is pre- to consider that the structured message of target scene carries out task It surveys, realizes to the context layer cooperating analysis of visual task, reach to prototype pattern (task to be identified) and adjoint mode (single task Recognition result) learn simultaneously, pattern dependency is effectively reduced, proposes the depth collaboration recognition methods of reduction.

The distribution of target-like State evolution is described as heat-supplied potential function by Coodination theory, it is believed that human brain memory system Signal self-organization process is exactly human associative memory process.In general, the video flowing persistently inputted is based on time interval and sees in the past The Long-range dependence examined can by the recognizable element of long time series and can not recognition element separate, to can not recognition element mark it is not true It is qualitative, and quickly identification can be with the new element in aid forecasting future.This research generates mould using memory external system enhancing timing Type, since the early stage of sequence store-memory feature describe effective information, and efficiently to stored information foundation can Lasting generation memory models.

Generate the generic features description collection e that memory models include feature collaboration_≤T={ e₁,e₂,···,e_TAnd task association Same hidden variable collection z_≤T={ z₁,z₂,···,z_T, h is mapped using translation_t=f_h(h_t-1,e_t,z_t) correct each time point The hidden state variable h of certainty_t, priori mapping function f_z(h_t-1) description past observing and hidden variable non-linear dependence and offer Hidden variable distribution parameter.Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided.This Memory external Modifying model timing variable autocoder is utilized in research, generates a memory text ψ at every point of time_t, Its priori and posterior probability are expressed as follows：

Wherein prior information is to rely on priori mapping f_zRemember the diagonal gauss of distribution function of text, and diagonal Gauss is close It is depended on like Posterior distrbutionp and passes through posteriority mapping function f_qAssociated memory text Ψ_t-1With current observation e_t。

As shown in fig. 6, for the treatment process for using random calculating to scheme to generate model as memory timing.In order to make the structure There is higher versatility and flexibility to different perception tasks, the memory and controller architecture for introducing high-level semantic are with stabilization Storage information extracted for future, and carry out corresponding calculate to extract use information at once.

Depth collaboration identification improves collaboration prototype pattern modification method, from the angle that prototype pattern and adjoint mode learn simultaneously Degree proposes to cooperate with recognizer that will remember directly using the evolutionary process of collaboration potential-energy function based on the depth for generating memory models Recall the dynamic process that model is introduced into coevolution, will solve prototype pattern and adjoint mode be attributed to solve it is non-linear optimal Change problem, to obtain better contract network weight.Long memory network f in short-term_rnnFor promoting state history h_t, memory external M_t Use the hidden variable and external text information c from previous moment_tIt generating, generation model is as follows,

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

Memory M is derived from order to be formed_tTask recognition instruction, the network generate one collection key value, use cosine similarity Evaluation and test willWith memory M_t-1Each row compares, and generates task weight-sets, the memory φ of retrieval_t ^rBy attention weight and memory M_t-1Weighted sum obtain, realize the sustainable multitask recognition mechanism of dynamic.

Key value

Task weighting

Retrieval memory

Identification generates

Wherein,It is the Setover relatedly value arrived by retrieving mnemonic learning, σ () is sigmoid function, memory external M_t For storing hidden variable z_t, the recognition mechanism Ψ of controller formation informing memory storage and retrieval_t=[φ_t ¹,φ_t ²,···, φ_t ^R,h_t].It is the output for generating memory models, and the multitask coordinated identification of the vision unknown for task definition and number is real Now to the non-supervisory self-adapting estimation of the video flowing of lasting input.For new identification mission, training before being remained in training The hiding layer state of model cooperates with the reward of each hidden layer in network before combining with task cooperation hierarchy based on feature Biasing realizes that the depth of context collaboration cooperates with recognition mechanism, so that it is possessed the priori knowledge relied on for a long time, formed and appointed for identification The completed policy of business improves the robustness of identification.

As shown in fig. 7, for a kind of multitask coordinated identification verifying for merging video-aware provided by Embodiment 2 of the present invention Platform.

With the lasting input of extensive multi-source heterogeneous video data and being continuously increased for suitable border perception identification mission, need A large amount of data storage and computing resource.Using the how intelligent coordinated processing of distribution, multinode, more GPU in high-performance calculation Mechanism, builds the multitask coordinated identification verification platform of multi-source vision, and purposes is to carry out the more visual task collaboration identifications of multi-source data Research, and theoretical study results involved in platform are assessed, it is intended to realization and testing vision collaboration are provided for researcher The extensibility framework of identification model, the basic test environment of model is provided and to the System Performance Analysis method of related data and Index provides for AI developer using the tool being integrated with fundamental research.It can be used as face in the following smart city construction It is provided to the further research and development of the push of the intelligent information of multi-source heterogeneous data, personalized control service etc. valuable Research verification platform.

In conjunction with above-mentioned intelligent verification demo platform, the result that multitask coordinated identification is collected from visual perception data is realized Output provides a standard platform for subsequent further investigation and functionization.Visual perception more are considered in test method The features such as high efficiency in Cooperative Analysis, dynamic, intelligence of being engaged in, in conjunction with the software design specification of soft project, using towards The Programming Methodology of object designs a verifying demo system easily extended.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It realizes by means of software and necessary general hardware platform.Based on this understanding, technical solution of the present invention essence On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment or embodiment of the invention Method described in part.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. a kind of multitask recognition methods for merging video-aware, which is characterized in that include the following steps：

Step S110：In conjunction with biology perception mechanism, the shared semantic mechanism based on the collaboration of multi-source heterogeneous video data feature, Extract the generic features of multi-source heterogeneous video data；

Step S120：Using suitable border computational theory, the feature association study mechanism of task cooperation is established, to the multi-source heterogeneous view The generic features of frequency evidence carry out continuous learning as priori knowledge, generate the task interaction prediction model of suitable border perception；

Step S130：For it is long when input video stream, when the task interaction prediction model foundation perceived in conjunction with the suitable border is long according to Bad generation memory models, depth of the foundation based on cooperative kinetics is independently semi-supervised persistently to identify system, realizes that multitask is known Not.

2. the multitask recognition methods of fusion video-aware according to claim 1, which is characterized in that the step S110 In, the shared semantic mechanism of the multi-source heterogeneous video data feature collaboration includes：

Establish the primitive collaboration based on multi-source heterogeneous video data, the dictionary collaboration based on time synchronization and based on semantic similar It is special to establish multi-source heterogeneous video data in conjunction with the attribute of multi-source heterogeneous video data for the three-level feature synergistic mechanism of theme collaboration Cooperation model is levied, determines the regular shared semantic association relationship of dimension；Wherein,

Primitive based on multi-source heterogeneous video data cooperates with：Using independent component analysis training video picture element, utilize Gabor function matches the video image primitive according to this, estimates the corresponding scale of each video image primitive and direction, The primitive feature of video image is extracted, realizes the time-space domain efficient coding of video image internal structure；

Dictionary based on time synchronization cooperates with：It is encoded using local linear, using local distance as sparse basic function Regular terms calculates the best response signal of original dictionary, and the best response signal is recycled to calculate feasible dictionary search direction, A dictionary updating is completed, establishes a Coded concepts stream for each data channel, as the reference semantic coding of complicated event, The low-level feature stream newly inputted and the semantic coding that refers to are flowed into Mobile state Time alignment, generation time translation function is real Existing dictionary semanteme alignment；

Include based on semantic similar theme collaboration：Using hidden semantic analysis, construct between dictionary and video image primitive feature Co-occurrence matrix, the corresponding semantic concept of theme is embodied using hidden node, is realized by probability inference method to vocabulary, master Node and the description of scene mapping relations are inscribed, the video conditional probability under theme distribution is calculated true as particular category similarity The likelihood function of probability and prediction probability between notional word remittance and scene.

3. the multitask recognition methods of fusion video-aware according to claim 2, which is characterized in that described to establish multi-source Isomery video data feature cooperation model determines that the regular shared semantic association relationship of dimension includes：

Assuming that there is C class heterogeneous channel feature, by i (i=1 ..., C),It is denoted as n_iThe feature square of a training sample Battle array, data noise part are E, and Γ is twiddle factor, establishes the majorized function under orthogonality constraint：

Wherein, λ indicates sharing matrix coefficient,_TRepresenting matrix carries out transposition operation, Y_iIndicate that ith feature classification mark, F indicate Frobenius norm,Indicate projection matrix Θ_iTransposition, α, β, μ₁And μ₂For multiplier factor, rank (X) is characterized matrix X Order；

Obtain general semantics feature low dimensional manifold subspace { Θ_i, the semantic sharing matrix W under Unified frame₀With special characteristic mould Block matrix { W_i, using least square method for solving prediction loss function R₁(W₀,{W_i},{Θ_i), reconstruct loss function R₂ ({Θ_i) and regular function R₃(W₀,{W_i) joint optimal solution；

By the way that the multi-source heterogeneous video data newly inputted is extracted the high-rise generic features description with dimension to eigenspace projection, Establish shared semantic association relationship.

4. the multitask recognition methods of fusion video-aware according to claim 3, which is characterized in that the step S120 In, using suitable border computational theory, the feature association study mechanism of task cooperation is established, generates the task interaction prediction of suitable border perception Model includes：

The mapping function under low-rank constrains between vision mark and generic features is constructed, realizes feature mark collaboration；Introduce core model Several pairs of mark correlations, feature correlations model, while introducing the intrinsic structure that figure regular terms retains data with existing, realize Mark prediction without mark characteristic, establishes following unconstrained function：

Wherein, g is the mapping function of feature association study, and data fidelity term Q () is used to evaluate and test given mark and is obtained by g function The loss function for obtaining task prediction result error minimizes,For being fitted given mark, Φ (g) and Λ (g) are Regular terms based on a priori assumption, λ and γ are regular terms parameters；

Environmental model is used to learn the environment dynamic change of input feature vector, and loss model is for estimating that environmental model loses, prediction Visual zone, target and the task to be identified for needing to pay close attention in the future；

The environmental model includes giving historical informationGeneric features are reflected with history Penetrate function ξ:H → X and true value mark and history mapping function η:H → Y carrys out academic environment model mapping function ξ (h) → η (h)；Note ω is environmental model ω (ξ (h)) ∈ Y, when every subtask is predicted, introduces loss model L_wm(ω (ξ (h)), η (h)), task prediction It is related to H={ h=(s_t-k,a_t-k,…,s_t,a_t,s_t+1), ξ (h)=(s_t-k,a_t-k,…,s_t,a_t) and η (h)=s_t+1, inverse kinematics Forecasting mechanism and softmax cross entropy loss forecasting state in future, the neural network model ω based on stochastic gradient descent_φ, come Coding institute is stateful to enter a low-dimensional latent space completion visual attention location extracted region and status predication comprising shared weight；

The loss model includes given state s_tWith suggest next step task, for predicting environmental model R_lWhat a task occurred Probability distribution, softmax cross entropy loss function encode the state of next step task as penalty term.

5. the multitask recognition methods of fusion video-aware according to claim 4, which is characterized in that the step S130 In, it is described for it is long when input video stream, the life that the task interaction prediction model foundation perceived in conjunction with the suitable border relies on when long Include at memory models：

Model is generated using memory external system enhancing timing, having for generic features description is stored since the early stage of sequence Information is imitated, establishes sustainable generation memory models to information has been stored；Specifically,

Generate the generic features description collection e that memory models include feature collaboration_≤T={ e₁,e₂,…,e_TAnd task cooperation hidden change Quantity set z_≤T={ z₁,z₂,…,z_T, h is mapped using translation_t=f_h(h_t-1,e_t,z_t) correct the hidden shape of certainty of each time point State variable h_t, priori mapping function f_z(h_t-1) describe the non-linear dependence of past observing and hidden variable and hidden variable distribution ginseng is provided Number；Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided；Utilize memory external mould Type corrects timing variable autocoder, generates a memory text ψ at every point of time_t, prior information and posterior information It respectively indicates as follows：

Wherein,It is the translation mapping function of hidden variable z state μ,It is the translation mapping function of hidden variable z state σ,It is The translation mapping function of posterior probability q state μ,The translation mapping function of posterior probability q state σ, prior information are to rely on Priori maps f_zRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp is depended on and is mapped by posteriority Function f_qAssociated memory text Ψ_t-1With current observation e_t。

6. the multitask recognition methods of fusion video-aware according to claim 5, which is characterized in that the foundation is based on The depth of cooperative kinetics is independently semi-supervised persistently to identify system, realizes that multitask identification includes：

Recognizer is cooperateed with based on the depth for generating memory models, using the evolutionary process of collaboration potential-energy function, by memory models It is introduced into the dynamic process of coevolution, prototype pattern will be solved and adjoint mode is attributed to solution nonlinear optimization and asks Topic obtains optimization contract network weight；

Long memory network f in short-term_rnnFor promoting state history h_t, memory external M_tUsing from previous moment hidden variable and External text information c_tIt generating, generation model is as follows,

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

Memory M is derived from order to be formed_tTask recognition instruction, introduce one collection key value, using cosine similarity evaluate and test willWith note Recall M_t-1Each row compares, and generation task pays attention to weight, the memory of retrievalBy attention weight and memory M_t-1Weighted sum It obtains, realizes multitask identification；Wherein,

Key value

Task weighting

Retrieval memory

Identification generates

Wherein,It is the crucial value function of r item for promoting state history, f_attIt is attention mechanism function,It is t moment r item i-th The memory weight of a point,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is by retrieving mnemonic learning The Setover relatedly value arrived, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result,As the output for generating memory models.

7. a kind of multitask coordinated identifying system for merging video-aware, it is characterised in that：Including generic features extraction module, association Identification module is cooperateed with feature learning module, depth；

The generic features extraction module is assisted for combining biology perception mechanism based on multi-source heterogeneous video data feature Same shared semantic mechanism, extracts the generic features of multi-source heterogeneous video data；

The collaboration feature learning module, for establishing the feature association study mechanism of task cooperation using suitable border computational theory, Continuous learning is carried out as priori knowledge to the generic features of the multi-source heterogeneous video data, the generating suitable border perception of the task is closed Join prediction model；

The depth cooperates with identification module, input video stream when for being directed to long, and the task association perceived in conjunction with the suitable border is pre- The generation memory models that survey model foundation relies on when long establish the autonomous semi-supervised lasting identifier of the depth based on cooperative kinetics System realizes multitask identification.

8. the multitask coordinated identifying system of fusion video-aware according to claim 7, it is characterised in that：It is described general Characteristic extracting module includes primitive collaboration submodule, dictionary collaboration submodule and theme collaboration submodule；

The primitive cooperates with submodule, for utilizing independent component analysis training video picture element, using Gabor function to institute It states video image primitive to match according to this, estimates the corresponding scale of each video image primitive and direction, extract video image Primitive feature, realize video image internal structure time-space domain efficient coding；

The dictionary cooperates with submodule, for being encoded using local linear, using local distance as the canonical of sparse basic function , the best response signal of original dictionary is calculated, recycles the best response signal to calculate feasible dictionary search direction, completes Dictionary updating establishes a Coded concepts stream for each data channel, will be new as the reference semantic coding of complicated event The low-level feature stream of input and the semantic coding that refers to flow into Mobile state Time alignment, and generation time translation function realizes word The alignment of allusion quotation semanteme；

The theme cooperates with submodule, for using hidden semantic analysis, constructs being total between dictionary and video image primitive feature The corresponding semantic concept of theme is embodied using hidden node, is realized by probability inference method to vocabulary, theme section by raw matrix Point and the description of scene mapping relations, by the video conditional probability under theme distribution, as particular category similarity, calculate true word The likelihood function of probability and prediction probability between remittance and scene.

9. the multitask coordinated identifying system of fusion video-aware according to claim 8, it is characterised in that：The collaboration Feature learning module includes the task interaction prediction submodule of feature association study submodule and the perception of suitable border；

The feature association learns submodule, for constructing the mapping letter under low-rank constrains between vision mark and generic features Number realizes feature mark collaboration；

The task interaction prediction submodule of the suitable border perception, for the feature association relationship by learning, in conjunction with visual impression The priori knowledge known, the task cooperation treatment mechanism based on environmental model and loss function are realized according to scene changes dynamic certainly Task to be identified is adaptively adjusted, the dynamic adjustment of the perception of visual attention location region and mission requirements prediction is completed.

10. the multitask coordinated identifying system of fusion video-aware according to claim 9, it is characterised in that：The depth The generation memory models submodule and multitask depth collaboration identification submodule that degree collaboration identification module relies on when including long；

The generation memory models submodule relied on when described long, input video stream when for being directed to long are perceived in conjunction with the suitable border Task interaction prediction model foundation it is long when the generation memory models that rely on；

The multitask depth collaboration identification submodule, for establishing the autonomous semi-supervised lasting knowledge of the depth based on cooperative kinetics Complicated variant system realizes multitask identification.