CN108804715A

CN108804715A - Merge multitask coordinated recognition methods and the system of audiovisual perception

Info

Publication number: CN108804715A
Application number: CN201810746362.3A
Authority: CN
Inventors: 明悦
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2018-11-13
Also published as: CN109947954A; CN109947954B

Abstract

The present invention provides a kind of multitask coordinated recognition methods of fusion audiovisual perception and systems, belong to multi-source heterogeneous data processing identification technology field, which includes generic features extraction module, collaboration feature learning module, suitable border feedback with evaluation identification module；Time synchronization matching mechanisms based on multi-source heterogeneous data, extract the generic features of multi-source heterogeneous data；Memory models are relied on when establishing long, and in conjunction with the collaboration attention mechanism relied on based on outside, continuous learning is carried out as priori to the generic features；The environment sensing parameter in multi-source heterogeneous data is extracted, progressive network depth collaboration enhancing recognition mechanism is established and realizes multitask identification in conjunction with the learning characteristic and mission requirements of the memory models.The suitable border computational theory of combining environmental perception of the present invention is judged the weight of task to be identified by depth enhancing feedback, the priority of task to be identified is adaptively adjusted according to environmental change, realizes the effect of multiple audio visual perception recognition results while output.

Description

Merge multitask coordinated recognition methods and the system of audiovisual perception

Technical field

The present invention relates to multi-source heterogeneous data processing identification technology fields, and in particular to a kind of merging audiovisual perception more Business collaboration recognition methods and system.

Background technology

Artificial intelligence experienced cycle of sixty years it is of flowing rhythm after, borrow internet, mobile Internet and Internet of Things etc. Information-technology age tide, based on deep neural network algorithm, with big data, cloud computing, intelligent terminal be support, i.e., will enter The new era broken out comprehensively.The continuous lasting promotion increased with transmission speed of communication bandwidth so that magnanimity audio/video data obtains The threshold taken reduces rapidly.In face of mass data in the upper ultrahigh speed of storage and processing, the active demand of mobile and generalization, pass Weak artificial intelligence based on the processing of single mode single task in system meaning has become the main bottleneck for keeping field development in check.

So-called audio-visual media multitask perception identification refers to extracting multi-source heterogeneous audiovisual based on biological audio visual mechanism of perception The generic features for feeling information, in conjunction with duration depth level recursive models, language is shared in the time-space domain that standby long-term memory is provided in study Adopted related information is realized under enhancing feedback mechanism, to the suitable border perception collaboration recognition result of different audio visual tasks.For example, One section " Xiao Ming bounces about says that ' teacher is good to school！' " audio, video data in, realize class brain cognition under a variety of audio visuals The effect that task identifies simultaneously identifies scene (school), target (Xiao Ming), goal behavior (jumping), target emotion simultaneously (happiness) and object language (teacher is good), rather than a set of individual identification frame is established to each identification mission in conventional method Frame exports recognition result respectively, not only wastes computing resource, but also is difficult to handle mass data.

On the one hand the big data epoch derive from social, information and physical space different platform and terminal audio-visual media number According to magnanimity isomerism is showed, traditional mode identification method based on artificial selection feature has been unable to meet multitask coordinated identification Demand.On the other hand, these multi-source heterogeneous data share identical semantic information again, have abundant potential association.With For the theme of " horse ", all images, video, audio fragment, stereo-picture and threedimensional model can be from the angles of complementary support Degree preferably describes " horse " this identical semantic concept.In order to preferably meet the need of the strong Artificial Intelligence Development of current generalization It wants, a kind of generic features semantic, multi-source audio-visual media data based on association of searching describe method, become and further increase intelligence Processing speed, the premise and basis of memory capacity and robustness that identification can be perceived are that the multitask coordinated perception of audio-visual media is known Indescribably ensure for effective data.Therefore, the generic features description side of semantic multi-source audio-visual media data is shared in conjunction with upper layer Method becomes the research hotspot of intelligent perception technology in recent years.

On the other hand, from the point of view of multitask perception identification, the feature learning method based on deep learning is schemed in processing Great advantage is shown in picture, voice and video.However for magnanimity multi-source data, with userbase, Regional Distribution And time-evolution, and produce some new problems：

Deep neural network needs a large amount of training datas when training, and keeps it helpless to small-scale data task, faces The high training label cost of mass data, the true identification mission poor performance for making it input duration data stream.

Deep neural network model is complicated, and number of parameters is huge, and training process needs powerful calculating facility, while in face When to different identification missions, using different convolutional layer structures, it is difficult to realize the rapid and balanced configuration of Internet resources.

Selectivity can not be associated with when establishing long according to processed data time sequence information in face of complicated and diversified scene changes Memory and Forgetting Mechanism are realized and fit the efficient adaptive learning mechanism in border.Such as target move towards dining room from teaching building one section regards Goal behavior can be identified from Reasoning With Learning the recognition memory of two scenes of teaching building and dining room in, according to early period by frequency To have a meal, corresponding topic of talking with can also change.

Therefore, the duration depth Cooperative Study in audio visual perception identification towards multitask and enhancing are fed back, and are become and are worked as One of key problem urgently to be resolved hurrily in preceding audio visual Intellisense identification.

Invention content

The purpose of the present invention is to provide a kind of suitable border computational theories of combinable environment sensing, are enhanced by depth and are fed back Judge the weight of task to be identified, the priority of task to be identified is adaptively adjusted according to environmental change, realizes multiple regard The multitask coordinated recognition methods for the fusion audio visual perception that Auditory Perception identification mission differentiates and system, to solve above-mentioned background Technical problem present in technology.

To achieve the goals above, this invention takes following technical solutions：

On the one hand, the present invention provides a kind of multitask recognition methods of fusion audiovisual perception, include the following steps：

Step S110：The generic features of multi-source heterogeneous data describe：Time synchronization matching machine based on multi-source heterogeneous data System is established based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the general of multi-source heterogeneous data Feature；

Step S120：The depth of long-term memory cooperates with feature learning：Memory models are relied on when establishing long, in conjunction with based on outside The collaboration attention mechanism of dependence carries out continuous learning as priori to the generic features, generates memory models；

Step S130：Task based on suitable border feedback with evaluation mechanism differentiates：Extract the environment sensing in multi-source heterogeneous data Parameter establishes progressive network depth collaboration enhancing recognition mechanism, in conjunction with the learning characteristic and mission requirements of the memory models, Realize multitask identification.

Further, in the step S110, the time synchronization matching mechanisms of the multi-source heterogeneous data include：Extraction institute The low-level feature stream of multi-source heterogeneous data is stated, a Coded concepts stream is established for the data of each channel, as complicated event Semantic coding carries out dynamic time warping to the low-level feature stream with reference to semantic coding, and generation time translation function is realized Semanteme alignment.

Further, the low-level feature stream of the extraction multi-source heterogeneous data includes：

To audio signal, wave sample pretreatment being carried out first, then carries out Spectrum Conversion, frequency is built in conjunction with prosodic features Spectrogram；

To two-dimensional video signal, Spectrum Conversion is carried out first, is introduced symbiosis statistical property and is obtained that there is rotation translation invariant The two-dimentional clock signal of property；

To three-dimensional video sequence, the low-level feature abstract technology that Multiscale Theory carries out quick scale space transformation is introduced, Spectrum Conversion and statistics symbiosis are carried out again, generate sequential pyramid spectrum signature.

Further, described to establish based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extraction The generic features of multi-source heterogeneous data include：

For each characteristic type X_iLearn a projection matrix Θ_i, heterogeneous characteristic is projected as to equal intrinsic dimensionality, is led to It crosses to general semantics proper subspace { Θ_i, the sharing matrix W under Unified frame₀With special characteristic modular matrix { W_iJoint It practises, counting loss function R₁(W₀,{W_i},{Θ_i), reconstruct loss function R₂({Θ_i) and regular function R₃(W₀,{W_i) connection Optimal solution is closed, establishes to have and shares semantic heterogeneous characteristic study；

Specifically,

To S class heterogeneous characteristics, by i (i=1 ..., S),It is denoted as n_iThe eigenmatrix of a training sample, number It is E according to noise section, Γ is twiddle factor, and the majorized function established under orthogonality constraint is：

Wherein λ is sharing matrix coefficient,_TRepresenting matrix carries out transposition operation, Y_iIt is marked for ith feature classification, F is Frobenius norms,It is projection matrix Θ_iTransposition, α, β, μ₁And μ₂For multiplier factor, rank (X) is characterized matrix X's Order, E are noise matrixes.

Further, described to establish based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extraction The generic features of multi-source heterogeneous data further include：

The migration of unlabeled data in multi-source heterogeneous data is integrated from study, note unlabeled data is marked as transfer learning Label target collection makes object set pass through { Θ with supplement collection_iCombined optimization feature independently marks study, rememberTo mend The description of collection sample characteristics and markup information are filled,For the description of object set sample characteristics and markup information, migration marks certainly Learning model indicates as follows：

Wherein F () is object function, and ρ is multiplier factor, and described problem migration is solved certainly using three perfecting by stage algorithms Learning model is marked, the generic features description is obtained.

Further, the depth collaboration feature learning of the long-term memory includes：

It includes generic features description collection e that memory models are relied on when long_≤T={ e₁,e₂,···,e_TAnd corresponding hidden variable Collect z_≤T={ z₁,z₂,···,z_T, map h using translation_t=f_h(h_t-1,e_t,z_t) correct the certainty of each time point Hidden state variable h_t, priori mapping function f_z(h_t-1) describe the non-linear dependence of past observing and hidden variable and hidden variable point is provided Cloth parameter；

Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided, utilize outside Memory models correct sequential variable autocoder, generate a memory text ψ at every point of time_t, prior information is obtained with after It is as follows to test information：

Prior information p_θ(z_t|z_{< T},e_{< T})=Ν (z_t|f_z ^μ(Ψ_t),f_z ^σ(Ψ_t-1))

Posterior information q_φ(z_t|z_{< T},e_≤T)=Ν (z_t|f_q ^μ(Ψ_t-1,e_t),f_q ^σ(Ψ_t-1,e_t))

Wherein,It is the translation mapping function of hidden variable z states μ,It is the translation mapping function of hidden variable z states σ,It is the translation mapping function of posterior probability q states μ,The translation mapping function of posterior probability q states σ, prior information be according to Rely and maps f in priori_zRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp depends on and passes through posteriority Mapping function f_qAssociated memory text Ψ_t-1With current observation e_t。

Further, the depth collaboration feature learning of the long-term memory further includes：

Using Cooperative Mode perception theory, the timing memory bias that generic features generate under the influence of task, root are calculated It is generated according to the timing memory bias and generic features and pays attention to time zone with the relevant adaptive perception of identification mission；

Use long memory network (LSTM) f in short-term_rnnTo promote state history h_t, memory external M_tWhen using coming from previous The hidden variable and external text information c at quarter_tIt generating, generation state more new model is as follows,

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

It is derived from memory M in order to be formed_tR item content informations, introduce a key value, use cosine similarity evaluation and test willWith Remember M_t-1Each row is compared, and is generated and is paid attention to weight, the memory of retrievalBy attention weight and memory M_t-1Weighted sum obtain , wherein

Key value

Attention mechanism

Retrieval memory φ_t ^r=w_t ^r·M_t-1

Generate memory

Wherein,It is the crucial value function of r items for promoting state history, f_attIt is attention mechanism function,It is t moment r The memory weight of i-th point of item,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is to be remembered by retrieving The Setover relatedly value learnt, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result, Ψ_t=[φ_t ¹,φ_t ²,···,φ_t ^R,h_t], as the output for generating memory models.

Further, the environment sensing parameter in the multi-source heterogeneous data of extraction, establishes the collaboration of progressive network depth Enhancing recognition mechanism includes：

Brightness perception is obtained by the normalized cumulant value of the pixel average and normal brightness information that calculate image/video Parameter；It inputs the sound intensity average value of audio by calculating and obtains loudness with the normalized cumulant value of standard sound intensity information and perceive and join Number；The average information acquisition value for including using high frequency imaging is bigger, and image detail information is abundanter, i.e., visual angle is more excellent calculates Visual angle perceptual parameters；Sound field perceptual parameters are calculated by the average energy of the transmission function inside sound source to ear；By heterogeneous characteristic Audio visual in study notices that the attention rule parameter in time zone indicates to pay attention to perceptual parameters；

By the brightness perceptual parameters, the loudness perceptual parameters, the visual angle perceptual parameters, the sound field perceptual parameters With the weighted sum for paying attention to perceptual parameters as border decision is fitted, progressive network depth collaboration enhancing recognition mechanism is established, is led to After successively storage migration knowledge, and award feature is extracted, the current identification mission to be treated of decision.

On the other hand, the present invention also provides a kind of multitask coordinated identifying system of fusion audiovisual perception, feature exists In：Including generic features extraction module, collaboration feature learning module, suitable border feedback with evaluation identification module；

The generic features extraction module, is used for the time synchronization matching mechanisms based on multi-source heterogeneous data, and foundation is based on The potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the generic features of multi-source heterogeneous data；

The collaboration feature learning module, relies on memory models when for establishing long, in conjunction with the collaboration relied on based on outside Attention mechanism carries out continuous learning as priori to the generic features, generates memory models；

The suitable border feedback with evaluation identification module is established gradually for extracting the environment sensing parameter in multi-source heterogeneous data Enhancing recognition mechanism is cooperateed with into formula network depth, in conjunction with the learning characteristic and mission requirements of the memory models, realizes multitask Identification.

Further, the generic features extraction module includes time synchronization submodule and shared semantic association feature description Submodule；

The time synchronization submodule passes through probability and knowledge for the low-level feature in conjunction with the multi-source heterogeneous data Driver framework, establish have scale, translation, rotation, time invariance multi-source heterogeneous data time synchronize securing mechanism；

The shared semantic association feature description submodule, for being dug according to semantic vector mechanism, multi-source information association Pick mechanism establishes the shared semantic feature of the synchronous multi-source heterogeneous data obtained, extracts generic features stream.

Further, the generation memory models submodule relied on when the collaboration feature learning module includes long and depth association With feature learning model submodule；

The generation memory models submodule relied on when described long, the generic features for extracting the multi-source heterogeneous data are made Stored for priori, in conjunction with it is long when data dependence establish memory external generate model；

The depth cooperates with feature learning model submodule, right for combining the collaboration attention mechanism relied on based on outside The generic features carry out continuous learning as priori, and output identification feature generates memory models as aposterior knowledge.

Further, the suitable border feedback with evaluation identification module includes suitable border perceptible feedback evaluation system submodule and depth Collaboration enhancing joint recognition mechanism submodule；

The suitable border perceptible feedback evaluation system submodule is used for extraction environment perceptual parameters, by joining to environment sensing It is several and identification feature to organically blend, realize that the weighting to identification mission updates layering；

The depth collaboration enhancing joint recognition mechanism submodule, for the power according to environment sensing parameter and identification mission Weight extracts the generic features description of multi-source heterogeneous data, exports recognition result.

Advantageous effect of the present invention：Multitask coordinated recognition methods of the present invention compared to existing fusion audiovisual perception has more Good validity and high efficiency can be the further research and development of the cognition machint theory and application under the following strong artificial intelligence Valuable achievement in research and theory and technology guidance are provided.Specifically：

(1) mechanism is described based on generic features, the audio-visual media information that different channels are obtained carries out effective complementary branch Support, multi-source elastic model is evolved to from traditional single source fixed mode, not only effectively removes data redundancy, but also learns to provide standby The feature description of versatility.

(2) be directed to the depth collaboration feature learning mechanism that the multi-source data that persistently inputs establishes persistent memory, in conjunction with it is long when Data dependence establishes memory external and generates model, enhances learning network performance by memory external, on the one hand with smaller number According to memory capacity stable model parameter complexity, useful information on the other hand can be extracted at once, be applied to different types of sequence Array structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.

(3) the suitable border computational theory of combining environmental perception judges the weight of task to be identified by depth enhancing feedback, The priority that task to be identified is adaptively adjusted according to environmental change is realized multiple audio visual perception recognition results while being exported Effect.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, others are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is that the multitask identification of the multitask coordinated identifying system of the fusion audiovisual perception described in the embodiment of the present invention is former Manage block diagram.

Fig. 2 is the multitask coordinated recognition methods of the fusion audiovisual perception described in the embodiment of the present invention based on shared semanteme Linked character descriptive model schematic diagram.

Fig. 3 is dependence outside the combination of the multitask coordinated recognition methods of the fusion audiovisual perception described in the embodiment of the present invention Generation memory models schematic diagram.

Fig. 4 is gradual depth collaboration enhancing feedback identifying mechanism principle frame under the suitable border frame described in the embodiment of the present invention Figure.

Specific implementation mode

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning Same or similar element or module with the same or similar functions are indicated to same or similar label eventually.Below by ginseng The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that is used in the specification of the present invention arranges It refers to there are the feature, integer, step, operation, element and/or module, but it is not excluded that presence or addition to take leave " comprising " Other one or more features, integer, step, operation, element, module and/or their group.

It should be noted that in embodiment of the present invention unless specifically defined or limited otherwise, term is " even Connect ", " fixation " etc. shall be understood in a broad sense, may be a fixed connection, may be a detachable connection, or is integral, can be machine Tool connects, and can also be electrical connection, can be directly connected to, can also be to be indirectly connected with by intermediary, can be two The interaction relationship of connection or two elements inside element, unless being limited with specific.For those skilled in the art For, the concrete meaning of above-mentioned term in embodiments of the present invention can be understood as the case may be.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology Term and scientific terminology) there is meaning identical with the general understanding of the those of ordinary skill in fields of the present invention.Also answer It should be appreciated that those terms such as defined in the general dictionary should be understood that with in the context of the prior art The consistent meaning of meaning, and unless defined as here, will not be explained with the meaning of idealization or too formal.

For ease of the understanding to the embodiment of the present invention, solved below in conjunction with attached drawing is further by taking specific embodiment as an example Explanation is released, and embodiment does not constitute the restriction to the embodiment of the present invention.

Those of ordinary skill in the art are it should be understood that attached drawing is the schematic diagram of one embodiment, the portion in attached drawing Part or device are not necessarily implemented necessary to the present invention.

Embodiment one

As shown in Figure 1, the embodiment of the present invention one provide a kind of fusion audiovisual perception multitask coordinated recognition methods and System.

A kind of multitask coordinated identifying system of fusion audiovisual perception disclosed in the embodiment of the present invention one, including：

Generic features extraction module, the time synchronization matching mechanisms for establishing multi-source heterogeneous data are realized based on potential High level shares semantic multi-source data and is associated with descriptive model, realizes that the efficient support between different channels data, message complementary sense are maximum Realize Elimination of Data Redundancy to limit；

Depth cooperates with feature learning module, the generation memory models relied on when for establishing long to explore and pay attention to based on collaboration The autonomous semi-supervised continuous learning system with depth, realizes the dynamic self study for having selective memory and forgeing ability, reaches To the effect of existing learning model performance incremental improvements；

Intelligent multitask depth collaboration enhances feedback identifying module, based on the suitable border perception by being cooperated based on intelligent body Calculation is theoretical, introduces adaptive depth collaboration enhancing and feeds back and combine recognition mechanism with multitask, to solve audio visual perception and nature The theory and technology problem of harmonious linking between environment.

It is excellent using bandwidth by the intelligent recognition demo platform of distributed treatment on research multinode, multithreading, more GPU Change algorithm, realize the efficient calling of resource, greatly reduce traffic load between calculating and storage device and realizes that extension is set on demand Standby resource, hardware supported is provided for the Effec-tive Function of system.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception in preferred generic characteristic extracting module, including is used for The submodule that multi-source heterogeneous data time synchronizes, including：Multi-source data processing mode requires accurately to examine in time-space domain simultaneously The change information of target and scene is surveyed and tracks, and the time in actual acquired data between different modalities mismatches, and can not keep away It causes effective information to lose and judge by accident with exempting from, causes damages to recognition result.Therefore, it is necessary to combine multi-source audio-visual media data Intrinsic characteristics, by probability and Knowledge driving frame, research have scale, translation, rotation, time invariance isomeric data when Between synchronization mechanism, reduce multi-data source between time uncertainty.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, in preferred generic characteristic extracting module, including it is shared Semantic association feature description submodule, including：Comprising rich in social activity, information, physical space different platform and modal data Rich nature and social property, the dimension that has different characteristics and data distribution, but the synchronous multi-source data obtained is but shared Similar semantic information, a large amount of potential incidence relation is contained.Therefore, it is necessary to explore the semantic vector of different modalities data Mechanism, multi-source information association mining mechanism study potential shared semantic feature under audio-visual media different channels, it is regular to establish dimension Association semanteme generalization feature description model.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferred depth cooperate in feature learning module, including The generation memory models submodule relied on when long, including：For it is long when, more sequences input Expressive Features stream, one is not remembered The study mechanism of ability needs constantly to mark the data newly inputted, and relearns network model according to new input, to meter Calculation, storage and human resources are all huge wastes, are also unfavorable for effective extraction of identification information.When therefore, it is necessary to combine long Data dependence establishes memory external and generates model, enhances learning network performance by memory external, on the one hand with smaller number According to memory capacity stable model parameter complexity, useful information on the other hand can be extracted at once, be applied to different types of sequence Array structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferred depth cooperate in feature learning module, including Depth cooperates with feature learning model submodule, including：For lasting input without mark feature stream, need accurately and efficiently to learn Standby minimize in class is provided to identify for multitask away from the joint optimal characteristics with maximization class spacing, and can not without labeled data Classification markup information is manually provided, performance loss is inevitably resulted in.Therefore, it is necessary to combine the collaboration note for having long-term memory Meaning mechanism establishes depth and continues composite character learning model, realizes that identification feature independently selects, improves distinguishing without labeled data The property known, implementation model increment is dynamically refined.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferably intelligent multitask depth collaboration enhancing feedback are known In other module, including suitable border perceptible feedback evaluation system submodule, including：Scene in being perceived for audiovisual is uncertain, needs Extraction environment perceptual parameters are wanted, adaptive feedback is provided by organically blending for parameter information for multitask identifying system and comments Estimate, realizes that the weighting to important identification mission identifies.Such as identify that pupilage and expression are main identification missions in classroom；Family Identify that target and behavior are main identification missions in outer scene；And identify that voice and action are mainly to identify in human-computer interaction scene Task.

The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferably intelligent multitask depth collaboration enhancing feedback are known In other module, including depth collaboration enhancing joint recognition mechanism submodule, including：For multitask coordinated identification in current scene Demand, need the data flow to inputting online, while exporting a variety of audiovisual recognition results.It is strong therefore, it is necessary to establish generalization Intelligent body, by feedback parameter and task weight, extraction generic features description carries out task enhancing to collaboration feature learning parameter Study, exports correct recognition result, and computer is made to have certain " thinking understands " ability.

Embodiment two

It is provided by Embodiment 2 of the present invention it is a kind of using above system carry out multitask sentence method for distinguishing, this method includes： The generic features of magnanimity multi-source audio-visual media perception data describe, including establish the time synchronization matching machine of multi-source heterogeneous data System is realized and is associated with descriptive model based on the potential high-rise multi-source data for sharing semanteme；Stream medium data towards lasting input is long When the depth collaboration feature learning remembered, including generation memory models that while establishing long relies on are explored and are paid attention to based on collaboration and deep The autonomous semi-supervised continuous learning system of degree；Intelligent multitask depth collaboration enhancing feedback identifying model under suitable border frame, including The suitable border perceptual computing to be cooperated based on intelligent body is theoretical, introduces adaptive depth collaboration enhancing feedback and multitask joint is known Other mechanism.The present invention compared to it is existing fusion audiovisual perception multitask coordinated recognition methods have better validity and efficiently Property, valuable research can be provided for the further research and development that the cognition machint under the following strong artificial intelligence is theoretical and applies Achievement and theory and technology guidance

The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, preferably magnanimity multi-source audio-visual media perception data is logical With in feature description, the isomeric data Time Synchronization Mechanism includes：Multi-source data processing mode requires can be simultaneously in space-time The change information of domain accurately detect and track target and scene, and the time in actual acquired data between different modalities is not Match, effective information will be inevitably resulted in and lose and judge by accident, caused damages to recognition result.Therefore, it is necessary to be regarded in conjunction with multi-source The intrinsic characteristics for listening media data, by probability and Knowledge driving frame, research has scale, translation, rotation, time invariance Isomeric data Time Synchronization Mechanism, reduce multi-data source between time uncertainty.

The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, preferably magnanimity multi-source audio-visual media perception data is logical With in feature description, the shared semantic association feature description model includes：It is flat from social activity, information, physical space difference Comprising abundant nature and social property in platform and modal data, the dimension that has different characteristics and data distribution, but it is synchronous The multi-source data of acquisition but shares similar semantic information, contains a large amount of potential incidence relation.Therefore, it is necessary to explore difference Semantic vector mechanism, the multi-source information association mining mechanism of modal data are studied potential shared under audio-visual media different channels Semantic feature establishes the regular association semanteme generalization feature description model of dimension.

The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, when the stream medium data preferably towards lasting input is long In the depth collaboration feature learning of memory, the generation memory models relied on when described long include：For it is long when, more sequences input Expressive Features stream, the study mechanism of a not no memory capability, needs constantly to mark the data newly inputted, and according to new defeated Enter to relearn network model, is all huge waste to calculating, storage and human resources, is also unfavorable for the effective of identification information Extraction.Data dependence establishes memory external and generates model when therefore, it is necessary to combine long, enhances learning network by memory external On the other hand performance can extract useful letter at once on the one hand with smaller data storage capacity stable model parameter complexity Breath is applied to different types of sequential structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.

The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, when the stream medium data preferably towards lasting input is long In the depth collaboration feature learning of memory, depth collaboration feature learning model includes：For lasting input without mark Feature stream needs accurately and efficiently to learn to provide standby minimize in class away from the joint optimal characteristics with maximization class spacing for more Task recognition, and classification markup information can not be manually provided without labeled data, inevitably result in performance loss.Therefore, it needs It to establish depth in conjunction with the collaboration attention mechanism for having long-term memory and continue composite character learning model, realize identification feature Autonomous selection, improves the identification without labeled data, and implementation model increment is dynamically refined.

Intelligent multitask depth collaboration under the frame of border is preferably fitted in the multitask coordinated recognition methods of above-mentioned fusion audiovisual perception Enhance in feedback identifying model, the suitable border perceptible feedback evaluation system includes：Scene in being perceived for audiovisual is uncertain Property, needs extraction environment perceptual parameters, by parameter information organically blend provided for multitask identifying system it is adaptive anti- Feedback assessment realizes that the weighting to important identification mission identifies.Such as identify that pupilage and expression are that main identification is appointed in classroom Business；Identify that target and behavior are main identification missions in Outdoor Scene；And identify that voice and action are main in human-computer interaction scene Want identification mission.

Intelligent multitask depth collaboration under the frame of border is preferably fitted in the multitask coordinated recognition methods of above-mentioned fusion audiovisual perception Enhance in feedback identifying model, depth collaboration enhancing joint recognition mechanism includes：It is assisted for multitask in current scene With the demand of identification, the data flow to inputting online is needed, while exporting a variety of audiovisual recognition results.It is logical therefore, it is necessary to establish With strong intelligent body is changed, by feedback parameter and task weight, collaboration feature learning parameter is appointed in extraction generic features description Business enhancing study, exports correct recognition result, and computer is made to have certain " thinking understands " ability.

Embodiment three

As shown in Figure 1, a kind of multitask coordinated recognition methods for fusion audiovisual perception that the embodiment of the present invention three provides.

First, a kind of generic features description side of migration formula algorithm foundation towards multi-source audio-visual media perception data is explored Method.

In order to realize the efficient Cooperative Analysis for being directed to different audio visual tasks, the audio visual perception data of multi-source is extracted Feature description with height robustness and versatility, the prototype feature as follow-up Cooperative Study, it is necessary first to analyze audiovisual The characteristics of feeling perception data.The audio data actually obtained is mostly One-dimension Time Series, it is main it is descriptive be embodied in its frequency spectrum-when Between in clue, need the prosodic information of the Spectrum Conversion combination audio consecutive frame using class Auditory Perception domain to be described.And it regards Feel that perception data is mostly two dimension or three-dimensional image or video sequence.The main descriptive variation for being embodied in its ken and spatial domain On, need to take into consideration it in many-sided characteristic such as color, depth, scale, rotation.And the cross-module state of audio visual perception data is total Enjoy the characteristics of semantic feature needs to have time, scale, rotation and translation invariance.

For audio visual perception data multichannel, multiple dimensioned, multi-modal characteristic, generalization feature description of the present invention by with Under several key steps composition：Multi-source perceives low-level feature description, the matching of across media data time synchronization, multiple features channels associated Learning model and migration feature fusion.

Multi-source perception low-level feature describes：

Feature is obtained for the multi-source of audio visual perceptual signal, across media, multichannel, audio, video data is extracted respectively low Layer feature description.To audio signal, wave sample pretreatment is carried out first, Spectrum Conversion is then carried out, in conjunction with prosodic features structure Build the spectrogram low-level feature regular as its.To two-dimensional video signal, Spectrum Conversion is carried out first, and symbiosis statistical property is drawn Enter to obtain the two-dimentional clock signal with rotation translation invariance.To three-dimensional video sequence, introduces Multiscale Theory and carry out quickly The low-level feature abstract technology of scale space transformation, then Spectrum Conversion and statistics symbiosis are carried out, it is special to generate sequential pyramid frequency spectrum Sign.

Across media data time synchronization matches：

The accurate detect and track target in time-space domain is required in being perceived for audio visual multitask, needs to realize multi-medium data Between time unifying.In order to realize the non-linear alignment of heterogeneous data flow, dynamic time warping technology is used first, realizes sequential The optimal alignment of signal.A Coded concepts stream is established for the data flow of each channel.As the semantic coding of complicated event, institute There is the low-level feature stream newly inputted to flow into Mobile state Time alignment with reference to semantic coding, generation time translation function realizes language Justice alignment.

Multiple features channels associated learning model includes：

Due to sharing similar high-level semantic structural information between different channels media, in order to effectively quantify different dimensions difference The shared information of feature extracts the maximum generic features description of discrimination property in a variety of audio visual tasks, increases class spacing, reduce class It is interior away from needing the combination learning model for establishing heterogeneous characteristic.Assuming that having S class heterogeneous characteristics, to each characteristic typeIt is denoted as n_iThe eigenmatrix of a training sample, data noise part are E, and Γ is twiddle factor. Combine heterogeneous characteristic learning model under multitask frame it is intended that each X_iLearn a projection matrix Θ_i.By matrix heterogeneous characteristic It is projected as equal intrinsic dimensionality, reduces the redundancy of multiple features data, the majorized function under orthogonality constraint is expressed as：

The heterogeneous characteristic learning model is intended to combination learning general semantics proper subspace { Θ_i, being total under Unified frame Enjoy matrix W₀With special characteristic modular matrix { W_i, prediction loss function R is solved using least square method₁(W₀,{W_i}, {Θ_i), reconstruct loss function R₂({Θ_i) and regular function R₃(W₀,{W_i) joint optimal solution.Pass through the number that will newly input It is described with the high-rise generic features of dimension according to eigenspace projection extraction, establishes and share semantic association relationship, as shown in Figure 2.

Migration feature fusion learns：

For the problem that training sample in mass data is limited, introducing transfer learning model enhances unlabeled data from principal mark Learning ability is noted, note unlabeled data integrates the label target collection as transfer learning, by providing powerful prior information, makes target Collection passes through { Θ with supplement collection_iCombined optimization feature independently marks study, rememberFor supplement collection sample characteristics description and Markup information,For the description of object set sample characteristics and markup information, migration combination learning model indicates as follows：

Wherein F () is the object function of model, using the three above-mentioned optimization problems of perfecting by stage algorithm solution, obtains audiovisual matchmaker The generic features of the decorum one describe.

It is realized using migration formula algorithm under this model and the generic features of multi-source audio-visual media perception data is described.According to The different modalities of perception data are established in conjunction with the application environment of perception identification mission and share semantic generic features based on high-rise Descriptive model.On this basis, it is limited according to the comprehensive of the constraintss such as intrinsic dimensionality, computing relay, time unifying, frame frequency, Using the joint isomery optimization method of multi-source data, the shared semantic information of extraction different characteristic information is realized.It is built by theory Mould, mathematical derivation, Optimization Algorithm complete the theoretical study method of relevant programme, further pass through mathematical simulation platform etc. Tool completes the simulating, verifying work of new departure.

Method described in the embodiment of the present invention three completes the generic features description towards multi-source audio-visual media perception data Afterwards, continue to explore it is a kind of establishing sustainable depth collaboration feature learning mechanism using generating memory models dynamic, use external note The sequential for recalling system enhancing generates model, and under variation Framework for Reasoning, store-memory feature is retouched since the early stage of sequence The effective information stated, and efficiently sustainable collaboration recycling is carried out to having stored information.

Generic features describe process can merge time-space domain identifying information in audio-visual media perception data well, connect down The basic theory relied on when generating memory models and long collaboration from research is started with, identification mission is perceived to compatibility for audiovisual Property, intelligent and flexibility requirement, research be suitable for memory external system enhancing sequential generate model and collaboration feature Learning algorithm.Under normal conditions, for the audiovisual stream medium data of sustainable input, the length based on time interval and past observing Long-range rely on separates the predictable element of long time series and unpredictable element, and uncertainty is indicated to unpredictable element, And quickly identification can be with the new element in aid forecasting future.

It includes generic features description collection e that sequential, which generates model,_≤T={ e₁,e₂,···,e_TAnd corresponding hidden variable collection z_≤T={ z₁,z₂,···,z_T, map h using translation_t=f_h(h_t-1,e_t,z_t) correct the hidden shape of certainty of each time point State variable h_t, priori mapping function f_z(h_t-1) describe the non-linear dependence of past observing and hidden variable and hidden variable distribution ginseng is provided Number.Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided.Using outer in the present invention Portion's memory models correct sequential variable autocoder, generate a memory text ψ at every point of time_t, priori and posteriority Probability indicates as follows：

Wherein prior information is to rely on priori mapping f_zRemember the diagonal gauss of distribution function of text, and diagonal Gauss is close It is depended on like Posterior distrbutionp and passes through posteriority mapping function f_qAssociated memory text Ψ_t-1With current observation e_t。

As shown in figure 3, using the random processing procedure for calculating figure and generating model as memory sequential.In order to make the structure pair Different perception tasks have higher versatility and flexibility, present invention introduces the memory of high-level semantic and controller architecture with Stable storage information will be extracted for future, and carry out corresponding calculate to extract use information at once.

Specifically, memory first in, first out buffering different from the past, quasi- using the association being close with people's cognitive process It with mode page theory, is formed and describes the notable time zone of the relevant audio visual of task with generic features, calculate generic features in task Under the influence of the timing memory biasing that generates, generate the relevant adaptive perception attention time zone of task by bias and generic features. The memory structure versatility, which is embodied in, allows information position reading at any time and write-in.

Controller uses long memory network (LSTM) f in short-term_rnnTo promote state history h_t, memory external M_tUsing coming from The hidden variable and external text information c of previous moment_tIt generating, generation model is as follows,

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

It is derived from memory M in order to be formed_tR item content informations, controller generate one collection key value, commented using cosine similarity Surveying willWith memory M_t-1Each row is compared, and soft attention weight-sets, the memory of retrieval are generatedBy attention weight and memory M_t-1Weighted sum obtain.

Key value

Attention mechanism

Retrieval memory

Generate memory

Wherein,It is the Setover relatedly value arrived by retrieving mnemonic learning, σ () is sigmoid functions.It is external as a result, Remember M_tFor storing hidden variable z_t, controller, which is formed, informs the expression mechanism Ψ of memory storage and retrieval_t=[φ_t ¹, φ_t ²,···,φ_t ^R,h_t].It is the output for generating memory models, the audio visual more times unknown for task definition and number Business collaboration feature learning is, it can be achieved that the non-supervisory feature learning of the data flow of lasting input.

Using the process demand for generating the corresponding Multi-task Concurrency identification of memory models under this model structure, according to audio visual The different task for perceiving identification establishes depth collaboration feature learning mechanism in conjunction with application environment complicated and changeable.It is basic herein On, the comprehensive limitation of the constraintss such as regionality is paid attention to according to timing memory, Long-Range Dependence, collaboration, is closed using time-space domain Join Optimal Learning method, the depth of selective memory and forgetting ability cooperates with feature learning method when realization has long.Pass through elder generation Hypothesis, aposterior reasoning, the theoretical research of Cooperative Optimization completion relevant programme are tested, algorithm simulating platform etc. is further passed through Tool completes the simulating, verifying work of new departure.

Method described in the embodiment of the present invention three completes the generic features description towards multi-source audio-visual media perception data It after sustainable depth collaboration feature learning, is perceived in identification process for audio visual multitask, scene is complicated and changeable, intelligent body needs The problem of wanting that multiple tasks can be handled simultaneously is studied the collaboration enhancing fed back based on audio visual perceptual parameters and fits border calculating reason By to solve the problems, such as audio visual perception and the harmonious theory and technology being connected between natural environment.

It include mainly following three parts research contents：1) extraction of border perceptual parameters is fitted；2) the depth collaboration of progressive network Enhance recognition mechanism；3) distributed intelligence demo system.

Suitable border perceptual parameters, which extract, includes：

Suitable border computational theory can effectively be adapted to environment by biology and be inspired, with audio visual perceptual parameters feedback mechanism with Environment interacts, and learns the optimal policy of multitask identification by way of maximizing accumulation award.The suitable border sense of extraction Know that parameter is as follows：

Brightness perceptual parameters：By the normalized cumulant for calculating the pixel average and normal brightness information of image/video Value obtains；

Loudness perceptual parameters：The normalized cumulant value of the sound intensity average value and standard sound intensity information of audio is inputted by calculating It obtains；

Visual angle perceptual parameters：The average information acquisition value for including using high frequency imaging is bigger, and image detail information is richer The more excellent calculating of richness, i.e. visual angle；

Sound field perceptual parameters：It is calculated by the average energy of the transmission function inside sound source to ear；

Pay attention to perceptual parameters：Notice that the attention rule parameter in time zone indicates by the audio visual in collaboration feature learning.

The dynamic change of complex scene can cause phenomena such as lighting change, visual angle deflection, sound field drift to seriously affect perception The performance of recognition result.Therefore, single perceptual parameters cannot be relied only on when fitting border perception decision judgement, it should make full use of above The weighted sum of five kinds of perceptual parameters calculated values, the integrated decision-making as suitable border perception self-adaption feedback.

The depth collaboration of progressive network enhances recognition mechanism and includes：

It is used as suitable border decision by the weighted sum of perceptual parameters, establishes progressive network collaboration recognition mechanism, which can With by successively storing migration knowledge, and valuable award feature is extracted, the current identification mission to be treated of decision solves Problem of the migration knowledge to true environment from simulated environment.

As shown in figure 4, describing a simple progressive network, wherein a is adaptive adapter, before effect is to maintain The hidden layer activation value of row is consistent with the dimension being originally inputted, and composition process is as follows,

1st row construct 1 deep neural network to train a certain task；

In order to train the 2nd task, therefore the activation value of each hidden layer in its network is handled by adapter, is connected to The respective layer of 2nd row neural network, using as additional input；In order to train the 3rd task, fixed first two columns network parameter preceding The activation value of the two each hidden layers of row network is handled by adapter, and combination is connected to the respective layer of the 2nd row neural network, as Additional input.If there is more mission requirements, and so on.All of above network passes through UNREAL algorithm training parameters.

Migration knowledge is stored by successively propulsion mode and extracts valuable award feature, and knowledge is moved in completion It moves.For new task, the hiding layer state of training pattern before being remained in training, before combining to hierarchy in network The useful award of each hidden layer so that transfer learning is gathered around there are one the priori relied on for a long time, is formed and is directed to final goal Completed policy.

Distributed intelligence demo system includes：Using distribution in high-performance calculation, the multiple agent of multinode, more GPU Coprocessor system carries out building for intelligent demonstration system.In data training process, the intelligent body of each GPU compositions has One complete network model copy, and iteration can only be assigned a subset in sample every time.GPU is by being in communication with each other Come the gradient that the different GPU that are averaged are calculated, average gradient is obtained into new weight applied to weight, and once a GPU is completed The iteration of oneself, it has to that other all GPU is waited for all to complete to ensure that weight can suitably be updated.This is equivalent to SGD is handled on single GPU, but distributes to multiple GPU come concurrent operation, to obtain calculating speed by data It is promoted.Here by the brief algorithm of distribution of high-performance computing sector, and decayed to solve to lead between GPU using bandwidth optimization ring Letter problem.

In conclusion the multitask coordinated recognition methods for merging audiovisual perception described in the embodiment of the present invention and system, phase For the prior art, there is better multi-source, Dynamic persistence and space-time transformation.The number when processing multi-source is long It is especially good according to upper effect.Specifically, having following features：

Multi-source：The characteristics of for multi-source audio-visual media perception data, establishes a kind of general feature description mechanism, The audio-visual media information that different channels are obtained carries out effective complementary support, and multi-source is evolved to from traditional single source fixed mode Elastic model not only effectively removes data redundancy, but also the feature description of standby versatility is provided in study.

Dynamic persistence：Audiovisual task have time-space domain variation characteristic, conventional method can only be to set demand at Reason can not carry out effective long-term memory reasoning to the data learnt, difficult between learning network underloadingization and high usage With balance.Meanwhile when having pop-up mission or target data is added, over-fitting and network parameter fragmentation can be led to.Therefore, needle The depth collaboration feature learning mechanism remembered to the audio visual feature lasts that the data of lasting input are established is received with high dynamic Rate, high resource utilization, low network consumption rate.

Space-time transformation：In order to meet under the space-time passage variation of complex scene, optimal perception identification is still maintained Performance should use the adaptive feedback mechanism of suitable border perception, realize that the dynamic fitted under border calculates adjusts to the environment of variation, to reach The optimal adaptation effect of the multitask coordinated enhancing feedback identifying of intelligence under to mass data storage.

The above research contents is integrated, a complete intelligent demonstration system is built, realizes and is acquired from audio visual perception data Result to multitask coordinated identification exports, and a standard platform is provided for subsequent further investigation and functionization.In experiment side To consider in method audio visual perceive the high efficiency in multitask coordinated analysis, dynamic, it is intelligent the features such as, in conjunction with soft project Software for Design specification, utilize object-oriented programming method design a demo system easily extended.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It is realized by the mode of software plus required general hardware platform.Based on this understanding, technical scheme of the present invention essence On in other words the part that contributes to existing technology can be expressed in the form of software products, the computer software product It can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that a computer equipment (can be personal computer, server either network equipment etc.) executes the certain of each embodiment of the present invention or embodiment Method described in part.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims Subject to.

Claims

1. a kind of multitask recognition methods of fusion audiovisual perception, which is characterized in that include the following steps：

Step S110：The generic features of multi-source heterogeneous data describe：Time synchronization matching mechanisms based on multi-source heterogeneous data, build Be based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the generic features of multi-source heterogeneous data；

Step S120：The depth of long-term memory cooperates with feature learning：Memory models are relied on when establishing long, are relied in conjunction with based on outside Collaboration attention mechanism, continuous learning is carried out as priori to the generic features of the multi-source heterogeneous data, generates memory Model；

Step S130：Task based on suitable border feedback with evaluation mechanism differentiates：The environment sensing parameter in multi-source heterogeneous data is extracted, Progressive network depth collaboration enhancing recognition mechanism is established to realize in conjunction with the learning characteristic and mission requirements of the memory models Merge the multitask identification of audiovisual perception.

2. the multitask recognition methods of fusion audiovisual perception according to claim 1, which is characterized in that the step S110 In, the time synchronization matching mechanisms of the multi-source heterogeneous data include：

The low-level feature stream of the multi-source heterogeneous data is extracted, a Coded concepts stream is established for the data of each channel, as The reference semantic coding of complicated event carries out dynamic time warping, production with described to the low-level feature stream with reference to semantic coding Raw time-shifting function realizes semantic alignment；Wherein,

The low-level feature stream of the extraction multi-source heterogeneous data includes：

To audio signal, wave sample pretreatment being carried out first, then carries out Spectrum Conversion, frequency spectrum is built in conjunction with prosodic features Figure；

To two-dimensional video signal, Spectrum Conversion is carried out first, is introduced symbiosis statistical property and is obtained with rotation translation invariance Two-dimentional clock signal；

To three-dimensional video sequence, the low-level feature abstract technology that Multiscale Theory carries out quick scale space transformation is introduced, then into Line frequency spectral transformation and statistics symbiosis, generate sequential pyramid spectrum signature.

3. the multitask recognition methods of fusion audiovisual perception according to claim 2, which is characterized in that the foundation is based on The potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, the generic features for extracting multi-source heterogeneous data include：

For each characteristic type X_iLearn a projection matrix Θ_i, heterogeneous characteristic is projected as to equal intrinsic dimensionality, by right General semantics proper subspace { Θ_i, the sharing matrix W under Unified frame₀With special characteristic modular matrix { W_iCombination learning, Counting loss function R₁(W₀,{W_i},{Θ_i), reconstruct loss function R₂({Θ_i) and regular function R₃(W₀,{W_i) joint most Excellent solution is established to have and shares semantic heterogeneous characteristic study；

Specifically,

It, will to S class heterogeneous characteristicsIt is denoted as n_iThe eigenmatrix of a training sample, data noise Part is E, and Γ is twiddle factor, and the majorized function established under orthogonality constraint is：

Wherein, λ indicates that sharing matrix coefficient, T representing matrixes carry out transposition operation, Y_iIndicate that ith feature classification mark, F indicate Frobenius norms,Indicate projection matrix Θ_iTransposition, α, β, μ₁And μ₂For multiplier factor, rank (X) is characterized matrix X Order, E is noise matrix；

The mark as transfer learning is integrated from study, note unlabeled data is marked to the migration of unlabeled data in multi-source heterogeneous data Object set makes object set pass through { Θ with supplement collection_iCombined optimization feature independently marks study, rememberCollect for supplement Sample characteristics describe and markup information,For the description of object set sample characteristics and markup information, migration learns from mark Model indicates as follows：

Wherein F () is object function, and ρ is multiplier factor, and described problem migration is solved from mark using three perfecting by stage algorithms Learning model obtains the generic features description.

4. the multitask recognition methods of fusion audiovisual perception according to claim 3, which is characterized in that the long-term memory Depth collaboration feature learning include：

It includes generic features description collection e that memory models are relied on when long_≤T={ e₁,e₂,…,e_TAnd corresponding hidden variable collection z_≤T= {z₁,z₂,…,z_T, map h using translation_t=f_h(h_t-1,e_t,z_t) correct the hidden state variable h of certainty of each time point_t, Priori mapping function f_z(h_t-1) describe the non-linear dependence of past observing and hidden variable and hidden variable distributed constant is provided；

Nonlinear observation mapping function f_e(z_t,h_t-1) likelihood function for depending on hidden variable and state is provided, utilize memory external Modifying model sequential variable autocoder generates a memory text ψ at every point of time_t, obtain prior information and posteriority letter Breath is as follows：

Prior information

Posterior information

Wherein,It is the translation mapping function of hidden variable z states μ,It is the translation mapping function of hidden variable z states σ,It is The translation mapping function of posterior probability q states μ,The translation mapping function of posterior probability q states σ, prior information are to rely on Priori maps f_zRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp is depended on and is mapped by posteriority Function f_qAssociated memory text Ψ_t-1With current observation e_t。

5. the multitask recognition methods of fusion audiovisual perception according to claim 4, which is characterized in that the long-term memory Depth collaboration feature learning further include：

Using Cooperative Mode perception theory, the timing memory bias that generic features generate under the influence of task is calculated, according to institute It states timing memory bias and generic features generates and pay attention to time zone with the relevant adaptive perception of identification mission；

Use long memory network (LSTM) f in short-term_rnnTo promote state history h_t, memory external M_tUsing coming from previous moment Hidden variable and external text information c_tIt generates, it is as follows to generate state more new model：

State updates (h_t,M_t)=f_rnn(h_t-1,M_t-1,z_t-1,c_t)

It is derived from memory M in order to be formed_tR item content informations, introduce a key value, use cosine similarity evaluation and test willWith memory M_t-1Each row is compared, and is generated and is paid attention to weight, the memory of retrievalBy attention weight and memory M_t-1Weighted sum obtain, In,

Key value

Attention mechanism

Retrieval memory

Generate memory

Wherein,It is the crucial value function of r items for promoting state history, f_attIt is attention mechanism function,It is t moment r items i-th The memory weight of a point,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is by retrieving mnemonic learning The Setover relatedly value arrived, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result,As the output for generating memory models.

6. the multitask recognition methods of fusion audiovisual perception according to claim 5, which is characterized in that the extraction multi-source Environment sensing parameter in isomeric data, establishing progressive network depth collaboration enhancing recognition mechanism includes：

Brightness perception is obtained by the normalized cumulant value of the pixel average and normal brightness information that calculate image/video to join Number；The sound intensity average value of audio, which is inputted, by calculating obtains loudness perceptual parameters with the normalized cumulant value of standard sound intensity information； The average information acquisition value for including using high frequency imaging is bigger, and image detail information is abundanter, i.e., visual angle is more excellent calculates to regard Angle perceptual parameters；Sound field perceptual parameters are calculated by the average energy of the transmission function inside sound source to ear；By heterogeneous characteristic Audio visual in habit notices that the attention rule parameter in time zone indicates to pay attention to perceptual parameters；

By the brightness perceptual parameters, the loudness perceptual parameters, the visual angle perceptual parameters, the sound field perceptual parameters and institute State pay attention to perceptual parameters weighted sum as fit border decision, establish progressive network depth collaboration enhancing recognition mechanism, by by Layer storage migration knowledge, and extracts award feature, the current identification mission to be treated of decision.

7. a kind of multitask coordinated identifying system of fusion audiovisual perception, it is characterised in that：Including generic features extraction module, association With feature learning module, suitable border feedback with evaluation identification module；

The generic features extraction module is used for the time synchronization matching mechanisms based on multi-source heterogeneous data, establishes based on potential High level shares semantic multi-source heterogeneous data correlation descriptive model, extracts the generic features of multi-source heterogeneous data；

The collaboration feature learning module, relies on memory models when for establishing long, pay attention in conjunction with the collaboration relied on based on outside Mechanism carries out continuous learning as priori to the generic features, generates memory models；

The suitable border feedback with evaluation identification module is established gradual for extracting the environment sensing parameter in multi-source heterogeneous data Network depth collaboration enhancing recognition mechanism realizes multitask identification in conjunction with the learning characteristic and mission requirements of the memory models.

8. the multitask coordinated identifying system of fusion audiovisual perception according to claim 7, it is characterised in that：It is described general Characteristic extracting module includes time synchronization submodule and shared semantic association feature description submodule；

The time synchronization submodule passes through probability and Knowledge driving for the low-level feature in conjunction with the multi-source heterogeneous data Frame, establish have scale, translation, rotation, time invariance multi-source heterogeneous data time synchronize securing mechanism；

The shared semantic association feature description submodule, for according to semantic vector mechanism, multi-source information association mining machine System establishes the shared semantic feature of the synchronous multi-source heterogeneous data obtained, extracts generic features stream.

9. the multitask coordinated identifying system of fusion audiovisual perception according to claim 8, it is characterised in that：The collaboration The generation memory models submodule and depth that feature learning module relies on when including long cooperate with feature learning model submodule；

The generation memory models submodule relied on when described long, for extracting the generic features of the multi-source heterogeneous data as first Knowledge is tested to be stored, in conjunction with it is long when data dependence establish memory external generate model；

The depth cooperates with feature learning model submodule, for combining the collaboration attention mechanism relied on based on outside, to described Generic features carry out continuous learning as priori, and output identification feature generates memory models as aposterior knowledge.

10. the multitask coordinated identifying system of fusion audiovisual perception according to claim 9, it is characterised in that：It is described suitable Feedback with evaluation identification module in border includes suitable border perceptible feedback evaluation system submodule and depth collaboration enhancing joint cognitron system Module；

The suitable border perceptible feedback evaluation system submodule is used for extraction environment perceptual parameters, by environment sensing parameter and Identification feature organically blends, and realizes that the weighting to identification mission updates layering；

The depth collaboration enhancing joint recognition mechanism submodule, is used for the weight according to environment sensing parameter and identification mission, The generic features description for extracting multi-source heterogeneous data, exports recognition result.