CN108804715A - Merge multitask coordinated recognition methods and the system of audiovisual perception - Google Patents
Merge multitask coordinated recognition methods and the system of audiovisual perception Download PDFInfo
- Publication number
- CN108804715A CN108804715A CN201810746362.3A CN201810746362A CN108804715A CN 108804715 A CN108804715 A CN 108804715A CN 201810746362 A CN201810746362 A CN 201810746362A CN 108804715 A CN108804715 A CN 108804715A
- Authority
- CN
- China
- Prior art keywords
- memory
- feature
- data
- heterogeneous data
- source heterogeneous
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Landscapes
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of multitask coordinated recognition methods of fusion audiovisual perception and systems, belong to multi-source heterogeneous data processing identification technology field, which includes generic features extraction module, collaboration feature learning module, suitable border feedback with evaluation identification module;Time synchronization matching mechanisms based on multi-source heterogeneous data, extract the generic features of multi-source heterogeneous data;Memory models are relied on when establishing long, and in conjunction with the collaboration attention mechanism relied on based on outside, continuous learning is carried out as priori to the generic features;The environment sensing parameter in multi-source heterogeneous data is extracted, progressive network depth collaboration enhancing recognition mechanism is established and realizes multitask identification in conjunction with the learning characteristic and mission requirements of the memory models.The suitable border computational theory of combining environmental perception of the present invention is judged the weight of task to be identified by depth enhancing feedback, the priority of task to be identified is adaptively adjusted according to environmental change, realizes the effect of multiple audio visual perception recognition results while output.
Description
Technical field
The present invention relates to multi-source heterogeneous data processing identification technology fields, and in particular to a kind of merging audiovisual perception more
Business collaboration recognition methods and system.
Background technology
Artificial intelligence experienced cycle of sixty years it is of flowing rhythm after, borrow internet, mobile Internet and Internet of Things etc.
Information-technology age tide, based on deep neural network algorithm, with big data, cloud computing, intelligent terminal be support, i.e., will enter
The new era broken out comprehensively.The continuous lasting promotion increased with transmission speed of communication bandwidth so that magnanimity audio/video data obtains
The threshold taken reduces rapidly.In face of mass data in the upper ultrahigh speed of storage and processing, the active demand of mobile and generalization, pass
Weak artificial intelligence based on the processing of single mode single task in system meaning has become the main bottleneck for keeping field development in check.
So-called audio-visual media multitask perception identification refers to extracting multi-source heterogeneous audiovisual based on biological audio visual mechanism of perception
The generic features for feeling information, in conjunction with duration depth level recursive models, language is shared in the time-space domain that standby long-term memory is provided in study
Adopted related information is realized under enhancing feedback mechanism, to the suitable border perception collaboration recognition result of different audio visual tasks.For example,
One section " Xiao Ming bounces about says that ' teacher is good to school!' " audio, video data in, realize class brain cognition under a variety of audio visuals
The effect that task identifies simultaneously identifies scene (school), target (Xiao Ming), goal behavior (jumping), target emotion simultaneously
(happiness) and object language (teacher is good), rather than a set of individual identification frame is established to each identification mission in conventional method
Frame exports recognition result respectively, not only wastes computing resource, but also is difficult to handle mass data.
On the one hand the big data epoch derive from social, information and physical space different platform and terminal audio-visual media number
According to magnanimity isomerism is showed, traditional mode identification method based on artificial selection feature has been unable to meet multitask coordinated identification
Demand.On the other hand, these multi-source heterogeneous data share identical semantic information again, have abundant potential association.With
For the theme of " horse ", all images, video, audio fragment, stereo-picture and threedimensional model can be from the angles of complementary support
Degree preferably describes " horse " this identical semantic concept.In order to preferably meet the need of the strong Artificial Intelligence Development of current generalization
It wants, a kind of generic features semantic, multi-source audio-visual media data based on association of searching describe method, become and further increase intelligence
Processing speed, the premise and basis of memory capacity and robustness that identification can be perceived are that the multitask coordinated perception of audio-visual media is known
Indescribably ensure for effective data.Therefore, the generic features description side of semantic multi-source audio-visual media data is shared in conjunction with upper layer
Method becomes the research hotspot of intelligent perception technology in recent years.
On the other hand, from the point of view of multitask perception identification, the feature learning method based on deep learning is schemed in processing
Great advantage is shown in picture, voice and video.However for magnanimity multi-source data, with userbase, Regional Distribution
And time-evolution, and produce some new problems:
Deep neural network needs a large amount of training datas when training, and keeps it helpless to small-scale data task, faces
The high training label cost of mass data, the true identification mission poor performance for making it input duration data stream.
Deep neural network model is complicated, and number of parameters is huge, and training process needs powerful calculating facility, while in face
When to different identification missions, using different convolutional layer structures, it is difficult to realize the rapid and balanced configuration of Internet resources.
Selectivity can not be associated with when establishing long according to processed data time sequence information in face of complicated and diversified scene changes
Memory and Forgetting Mechanism are realized and fit the efficient adaptive learning mechanism in border.Such as target move towards dining room from teaching building one section regards
Goal behavior can be identified from Reasoning With Learning the recognition memory of two scenes of teaching building and dining room in, according to early period by frequency
To have a meal, corresponding topic of talking with can also change.
Therefore, the duration depth Cooperative Study in audio visual perception identification towards multitask and enhancing are fed back, and are become and are worked as
One of key problem urgently to be resolved hurrily in preceding audio visual Intellisense identification.
Invention content
The purpose of the present invention is to provide a kind of suitable border computational theories of combinable environment sensing, are enhanced by depth and are fed back
Judge the weight of task to be identified, the priority of task to be identified is adaptively adjusted according to environmental change, realizes multiple regard
The multitask coordinated recognition methods for the fusion audio visual perception that Auditory Perception identification mission differentiates and system, to solve above-mentioned background
Technical problem present in technology.
To achieve the goals above, this invention takes following technical solutions:
On the one hand, the present invention provides a kind of multitask recognition methods of fusion audiovisual perception, include the following steps:
Step S110:The generic features of multi-source heterogeneous data describe:Time synchronization matching machine based on multi-source heterogeneous data
System is established based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the general of multi-source heterogeneous data
Feature;
Step S120:The depth of long-term memory cooperates with feature learning:Memory models are relied on when establishing long, in conjunction with based on outside
The collaboration attention mechanism of dependence carries out continuous learning as priori to the generic features, generates memory models;
Step S130:Task based on suitable border feedback with evaluation mechanism differentiates:Extract the environment sensing in multi-source heterogeneous data
Parameter establishes progressive network depth collaboration enhancing recognition mechanism, in conjunction with the learning characteristic and mission requirements of the memory models,
Realize multitask identification.
Further, in the step S110, the time synchronization matching mechanisms of the multi-source heterogeneous data include:Extraction institute
The low-level feature stream of multi-source heterogeneous data is stated, a Coded concepts stream is established for the data of each channel, as complicated event
Semantic coding carries out dynamic time warping to the low-level feature stream with reference to semantic coding, and generation time translation function is realized
Semanteme alignment.
Further, the low-level feature stream of the extraction multi-source heterogeneous data includes:
To audio signal, wave sample pretreatment being carried out first, then carries out Spectrum Conversion, frequency is built in conjunction with prosodic features
Spectrogram;
To two-dimensional video signal, Spectrum Conversion is carried out first, is introduced symbiosis statistical property and is obtained that there is rotation translation invariant
The two-dimentional clock signal of property;
To three-dimensional video sequence, the low-level feature abstract technology that Multiscale Theory carries out quick scale space transformation is introduced,
Spectrum Conversion and statistics symbiosis are carried out again, generate sequential pyramid spectrum signature.
Further, described to establish based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extraction
The generic features of multi-source heterogeneous data include:
For each characteristic type XiLearn a projection matrix Θi, heterogeneous characteristic is projected as to equal intrinsic dimensionality, is led to
It crosses to general semantics proper subspace { Θi, the sharing matrix W under Unified frame0With special characteristic modular matrix { WiJoint
It practises, counting loss function R1(W0,{Wi},{Θi), reconstruct loss function R2({Θi) and regular function R3(W0,{Wi) connection
Optimal solution is closed, establishes to have and shares semantic heterogeneous characteristic study;
Specifically,
To S class heterogeneous characteristics, by i (i=1 ..., S),It is denoted as niThe eigenmatrix of a training sample, number
It is E according to noise section, Γ is twiddle factor, and the majorized function established under orthogonality constraint is:
Wherein λ is sharing matrix coefficient,TRepresenting matrix carries out transposition operation, YiIt is marked for ith feature classification, F is
Frobenius norms,It is projection matrix ΘiTransposition, α, β, μ1And μ2For multiplier factor, rank (X) is characterized matrix X's
Order, E are noise matrixes.
Further, described to establish based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extraction
The generic features of multi-source heterogeneous data further include:
The migration of unlabeled data in multi-source heterogeneous data is integrated from study, note unlabeled data is marked as transfer learning
Label target collection makes object set pass through { Θ with supplement collectioniCombined optimization feature independently marks study, rememberTo mend
The description of collection sample characteristics and markup information are filled,For the description of object set sample characteristics and markup information, migration marks certainly
Learning model indicates as follows:
Wherein F () is object function, and ρ is multiplier factor, and described problem migration is solved certainly using three perfecting by stage algorithms
Learning model is marked, the generic features description is obtained.
Further, the depth collaboration feature learning of the long-term memory includes:
It includes generic features description collection e that memory models are relied on when long≤T={ e1,e2,···,eTAnd corresponding hidden variable
Collect z≤T={ z1,z2,···,zT, map h using translationt=fh(ht-1,et,zt) correct the certainty of each time point
Hidden state variable ht, priori mapping function fz(ht-1) describe the non-linear dependence of past observing and hidden variable and hidden variable point is provided
Cloth parameter;
Nonlinear observation mapping function fe(zt,ht-1) likelihood function for depending on hidden variable and state is provided, utilize outside
Memory models correct sequential variable autocoder, generate a memory text ψ at every point of timet, prior information is obtained with after
It is as follows to test information:
Prior information pθ(zt|z< T,e< T)=Ν (zt|fz μ(Ψt),fz σ(Ψt-1))
Posterior information qφ(zt|z< T,e≤T)=Ν (zt|fq μ(Ψt-1,et),fq σ(Ψt-1,et))
Wherein,It is the translation mapping function of hidden variable z states μ,It is the translation mapping function of hidden variable z states σ,It is the translation mapping function of posterior probability q states μ,The translation mapping function of posterior probability q states σ, prior information be according to
Rely and maps f in priorizRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp depends on and passes through posteriority
Mapping function fqAssociated memory text Ψt-1With current observation et。
Further, the depth collaboration feature learning of the long-term memory further includes:
Using Cooperative Mode perception theory, the timing memory bias that generic features generate under the influence of task, root are calculated
It is generated according to the timing memory bias and generic features and pays attention to time zone with the relevant adaptive perception of identification mission;
Use long memory network (LSTM) f in short-termrnnTo promote state history ht, memory external MtWhen using coming from previous
The hidden variable and external text information c at quartertIt generating, generation state more new model is as follows,
State updates (ht,Mt)=frnn(ht-1,Mt-1,zt-1,ct)
It is derived from memory M in order to be formedtR item content informations, introduce a key value, use cosine similarity evaluation and test willWith
Remember Mt-1Each row is compared, and is generated and is paid attention to weight, the memory of retrievalBy attention weight and memory Mt-1Weighted sum obtain
, wherein
Key value
Attention mechanism
Retrieval memory φt r=wt r·Mt-1
Generate memory
Wherein,It is the crucial value function of r items for promoting state history, fattIt is attention mechanism function,It is t moment r
The memory weight of i-th point of item,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is to be remembered by retrieving
The Setover relatedly value learnt, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result,
Ψt=[φt 1,φt 2,···,φt R,ht], as the output for generating memory models.
Further, the environment sensing parameter in the multi-source heterogeneous data of extraction, establishes the collaboration of progressive network depth
Enhancing recognition mechanism includes:
Brightness perception is obtained by the normalized cumulant value of the pixel average and normal brightness information that calculate image/video
Parameter;It inputs the sound intensity average value of audio by calculating and obtains loudness with the normalized cumulant value of standard sound intensity information and perceive and join
Number;The average information acquisition value for including using high frequency imaging is bigger, and image detail information is abundanter, i.e., visual angle is more excellent calculates
Visual angle perceptual parameters;Sound field perceptual parameters are calculated by the average energy of the transmission function inside sound source to ear;By heterogeneous characteristic
Audio visual in study notices that the attention rule parameter in time zone indicates to pay attention to perceptual parameters;
By the brightness perceptual parameters, the loudness perceptual parameters, the visual angle perceptual parameters, the sound field perceptual parameters
With the weighted sum for paying attention to perceptual parameters as border decision is fitted, progressive network depth collaboration enhancing recognition mechanism is established, is led to
After successively storage migration knowledge, and award feature is extracted, the current identification mission to be treated of decision.
On the other hand, the present invention also provides a kind of multitask coordinated identifying system of fusion audiovisual perception, feature exists
In:Including generic features extraction module, collaboration feature learning module, suitable border feedback with evaluation identification module;
The generic features extraction module, is used for the time synchronization matching mechanisms based on multi-source heterogeneous data, and foundation is based on
The potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the generic features of multi-source heterogeneous data;
The collaboration feature learning module, relies on memory models when for establishing long, in conjunction with the collaboration relied on based on outside
Attention mechanism carries out continuous learning as priori to the generic features, generates memory models;
The suitable border feedback with evaluation identification module is established gradually for extracting the environment sensing parameter in multi-source heterogeneous data
Enhancing recognition mechanism is cooperateed with into formula network depth, in conjunction with the learning characteristic and mission requirements of the memory models, realizes multitask
Identification.
Further, the generic features extraction module includes time synchronization submodule and shared semantic association feature description
Submodule;
The time synchronization submodule passes through probability and knowledge for the low-level feature in conjunction with the multi-source heterogeneous data
Driver framework, establish have scale, translation, rotation, time invariance multi-source heterogeneous data time synchronize securing mechanism;
The shared semantic association feature description submodule, for being dug according to semantic vector mechanism, multi-source information association
Pick mechanism establishes the shared semantic feature of the synchronous multi-source heterogeneous data obtained, extracts generic features stream.
Further, the generation memory models submodule relied on when the collaboration feature learning module includes long and depth association
With feature learning model submodule;
The generation memory models submodule relied on when described long, the generic features for extracting the multi-source heterogeneous data are made
Stored for priori, in conjunction with it is long when data dependence establish memory external generate model;
The depth cooperates with feature learning model submodule, right for combining the collaboration attention mechanism relied on based on outside
The generic features carry out continuous learning as priori, and output identification feature generates memory models as aposterior knowledge.
Further, the suitable border feedback with evaluation identification module includes suitable border perceptible feedback evaluation system submodule and depth
Collaboration enhancing joint recognition mechanism submodule;
The suitable border perceptible feedback evaluation system submodule is used for extraction environment perceptual parameters, by joining to environment sensing
It is several and identification feature to organically blend, realize that the weighting to identification mission updates layering;
The depth collaboration enhancing joint recognition mechanism submodule, for the power according to environment sensing parameter and identification mission
Weight extracts the generic features description of multi-source heterogeneous data, exports recognition result.
Advantageous effect of the present invention:Multitask coordinated recognition methods of the present invention compared to existing fusion audiovisual perception has more
Good validity and high efficiency can be the further research and development of the cognition machint theory and application under the following strong artificial intelligence
Valuable achievement in research and theory and technology guidance are provided.Specifically:
(1) mechanism is described based on generic features, the audio-visual media information that different channels are obtained carries out effective complementary branch
Support, multi-source elastic model is evolved to from traditional single source fixed mode, not only effectively removes data redundancy, but also learns to provide standby
The feature description of versatility.
(2) be directed to the depth collaboration feature learning mechanism that the multi-source data that persistently inputs establishes persistent memory, in conjunction with it is long when
Data dependence establishes memory external and generates model, enhances learning network performance by memory external, on the one hand with smaller number
According to memory capacity stable model parameter complexity, useful information on the other hand can be extracted at once, be applied to different types of sequence
Array structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.
(3) the suitable border computational theory of combining environmental perception judges the weight of task to be identified by depth enhancing feedback,
The priority that task to be identified is adaptively adjusted according to environmental change is realized multiple audio visual perception recognition results while being exported
Effect.
The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description
Obviously, or practice through the invention is recognized.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill of field, without creative efforts, others are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is that the multitask identification of the multitask coordinated identifying system of the fusion audiovisual perception described in the embodiment of the present invention is former
Manage block diagram.
Fig. 2 is the multitask coordinated recognition methods of the fusion audiovisual perception described in the embodiment of the present invention based on shared semanteme
Linked character descriptive model schematic diagram.
Fig. 3 is dependence outside the combination of the multitask coordinated recognition methods of the fusion audiovisual perception described in the embodiment of the present invention
Generation memory models schematic diagram.
Fig. 4 is gradual depth collaboration enhancing feedback identifying mechanism principle frame under the suitable border frame described in the embodiment of the present invention
Figure.
Specific implementation mode
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the accompanying drawings, wherein from beginning
Same or similar element or module with the same or similar functions are indicated to same or similar label eventually.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one
It is a ", " described " and "the" may also comprise plural form.It is to be further understood that is used in the specification of the present invention arranges
It refers to there are the feature, integer, step, operation, element and/or module, but it is not excluded that presence or addition to take leave " comprising "
Other one or more features, integer, step, operation, element, module and/or their group.
It should be noted that in embodiment of the present invention unless specifically defined or limited otherwise, term is " even
Connect ", " fixation " etc. shall be understood in a broad sense, may be a fixed connection, may be a detachable connection, or is integral, can be machine
Tool connects, and can also be electrical connection, can be directly connected to, can also be to be indirectly connected with by intermediary, can be two
The interaction relationship of connection or two elements inside element, unless being limited with specific.For those skilled in the art
For, the concrete meaning of above-mentioned term in embodiments of the present invention can be understood as the case may be.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology
Term and scientific terminology) there is meaning identical with the general understanding of the those of ordinary skill in fields of the present invention.Also answer
It should be appreciated that those terms such as defined in the general dictionary should be understood that with in the context of the prior art
The consistent meaning of meaning, and unless defined as here, will not be explained with the meaning of idealization or too formal.
For ease of the understanding to the embodiment of the present invention, solved below in conjunction with attached drawing is further by taking specific embodiment as an example
Explanation is released, and embodiment does not constitute the restriction to the embodiment of the present invention.
Those of ordinary skill in the art are it should be understood that attached drawing is the schematic diagram of one embodiment, the portion in attached drawing
Part or device are not necessarily implemented necessary to the present invention.
Embodiment one
As shown in Figure 1, the embodiment of the present invention one provide a kind of fusion audiovisual perception multitask coordinated recognition methods and
System.
A kind of multitask coordinated identifying system of fusion audiovisual perception disclosed in the embodiment of the present invention one, including:
Generic features extraction module, the time synchronization matching mechanisms for establishing multi-source heterogeneous data are realized based on potential
High level shares semantic multi-source data and is associated with descriptive model, realizes that the efficient support between different channels data, message complementary sense are maximum
Realize Elimination of Data Redundancy to limit;
Depth cooperates with feature learning module, the generation memory models relied on when for establishing long to explore and pay attention to based on collaboration
The autonomous semi-supervised continuous learning system with depth, realizes the dynamic self study for having selective memory and forgeing ability, reaches
To the effect of existing learning model performance incremental improvements;
Intelligent multitask depth collaboration enhances feedback identifying module, based on the suitable border perception by being cooperated based on intelligent body
Calculation is theoretical, introduces adaptive depth collaboration enhancing and feeds back and combine recognition mechanism with multitask, to solve audio visual perception and nature
The theory and technology problem of harmonious linking between environment.
It is excellent using bandwidth by the intelligent recognition demo platform of distributed treatment on research multinode, multithreading, more GPU
Change algorithm, realize the efficient calling of resource, greatly reduce traffic load between calculating and storage device and realizes that extension is set on demand
Standby resource, hardware supported is provided for the Effec-tive Function of system.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception in preferred generic characteristic extracting module, including is used for
The submodule that multi-source heterogeneous data time synchronizes, including:Multi-source data processing mode requires accurately to examine in time-space domain simultaneously
The change information of target and scene is surveyed and tracks, and the time in actual acquired data between different modalities mismatches, and can not keep away
It causes effective information to lose and judge by accident with exempting from, causes damages to recognition result.Therefore, it is necessary to combine multi-source audio-visual media data
Intrinsic characteristics, by probability and Knowledge driving frame, research have scale, translation, rotation, time invariance isomeric data when
Between synchronization mechanism, reduce multi-data source between time uncertainty.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, in preferred generic characteristic extracting module, including it is shared
Semantic association feature description submodule, including:Comprising rich in social activity, information, physical space different platform and modal data
Rich nature and social property, the dimension that has different characteristics and data distribution, but the synchronous multi-source data obtained is but shared
Similar semantic information, a large amount of potential incidence relation is contained.Therefore, it is necessary to explore the semantic vector of different modalities data
Mechanism, multi-source information association mining mechanism study potential shared semantic feature under audio-visual media different channels, it is regular to establish dimension
Association semanteme generalization feature description model.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferred depth cooperate in feature learning module, including
The generation memory models submodule relied on when long, including:For it is long when, more sequences input Expressive Features stream, one is not remembered
The study mechanism of ability needs constantly to mark the data newly inputted, and relearns network model according to new input, to meter
Calculation, storage and human resources are all huge wastes, are also unfavorable for effective extraction of identification information.When therefore, it is necessary to combine long
Data dependence establishes memory external and generates model, enhances learning network performance by memory external, on the one hand with smaller number
According to memory capacity stable model parameter complexity, useful information on the other hand can be extracted at once, be applied to different types of sequence
Array structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferred depth cooperate in feature learning module, including
Depth cooperates with feature learning model submodule, including:For lasting input without mark feature stream, need accurately and efficiently to learn
Standby minimize in class is provided to identify for multitask away from the joint optimal characteristics with maximization class spacing, and can not without labeled data
Classification markup information is manually provided, performance loss is inevitably resulted in.Therefore, it is necessary to combine the collaboration note for having long-term memory
Meaning mechanism establishes depth and continues composite character learning model, realizes that identification feature independently selects, improves distinguishing without labeled data
The property known, implementation model increment is dynamically refined.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferably intelligent multitask depth collaboration enhancing feedback are known
In other module, including suitable border perceptible feedback evaluation system submodule, including:Scene in being perceived for audiovisual is uncertain, needs
Extraction environment perceptual parameters are wanted, adaptive feedback is provided by organically blending for parameter information for multitask identifying system and comments
Estimate, realizes that the weighting to important identification mission identifies.Such as identify that pupilage and expression are main identification missions in classroom;Family
Identify that target and behavior are main identification missions in outer scene;And identify that voice and action are mainly to identify in human-computer interaction scene
Task.
The multitask coordinated identifying system of above-mentioned fusion audiovisual perception, preferably intelligent multitask depth collaboration enhancing feedback are known
In other module, including depth collaboration enhancing joint recognition mechanism submodule, including:For multitask coordinated identification in current scene
Demand, need the data flow to inputting online, while exporting a variety of audiovisual recognition results.It is strong therefore, it is necessary to establish generalization
Intelligent body, by feedback parameter and task weight, extraction generic features description carries out task enhancing to collaboration feature learning parameter
Study, exports correct recognition result, and computer is made to have certain " thinking understands " ability.
Embodiment two
It is provided by Embodiment 2 of the present invention it is a kind of using above system carry out multitask sentence method for distinguishing, this method includes:
The generic features of magnanimity multi-source audio-visual media perception data describe, including establish the time synchronization matching machine of multi-source heterogeneous data
System is realized and is associated with descriptive model based on the potential high-rise multi-source data for sharing semanteme;Stream medium data towards lasting input is long
When the depth collaboration feature learning remembered, including generation memory models that while establishing long relies on are explored and are paid attention to based on collaboration and deep
The autonomous semi-supervised continuous learning system of degree;Intelligent multitask depth collaboration enhancing feedback identifying model under suitable border frame, including
The suitable border perceptual computing to be cooperated based on intelligent body is theoretical, introduces adaptive depth collaboration enhancing feedback and multitask joint is known
Other mechanism.The present invention compared to it is existing fusion audiovisual perception multitask coordinated recognition methods have better validity and efficiently
Property, valuable research can be provided for the further research and development that the cognition machint under the following strong artificial intelligence is theoretical and applies
Achievement and theory and technology guidance
The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, preferably magnanimity multi-source audio-visual media perception data is logical
With in feature description, the isomeric data Time Synchronization Mechanism includes:Multi-source data processing mode requires can be simultaneously in space-time
The change information of domain accurately detect and track target and scene, and the time in actual acquired data between different modalities is not
Match, effective information will be inevitably resulted in and lose and judge by accident, caused damages to recognition result.Therefore, it is necessary to be regarded in conjunction with multi-source
The intrinsic characteristics for listening media data, by probability and Knowledge driving frame, research has scale, translation, rotation, time invariance
Isomeric data Time Synchronization Mechanism, reduce multi-data source between time uncertainty.
The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, preferably magnanimity multi-source audio-visual media perception data is logical
With in feature description, the shared semantic association feature description model includes:It is flat from social activity, information, physical space difference
Comprising abundant nature and social property in platform and modal data, the dimension that has different characteristics and data distribution, but it is synchronous
The multi-source data of acquisition but shares similar semantic information, contains a large amount of potential incidence relation.Therefore, it is necessary to explore difference
Semantic vector mechanism, the multi-source information association mining mechanism of modal data are studied potential shared under audio-visual media different channels
Semantic feature establishes the regular association semanteme generalization feature description model of dimension.
The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, when the stream medium data preferably towards lasting input is long
In the depth collaboration feature learning of memory, the generation memory models relied on when described long include:For it is long when, more sequences input
Expressive Features stream, the study mechanism of a not no memory capability, needs constantly to mark the data newly inputted, and according to new defeated
Enter to relearn network model, is all huge waste to calculating, storage and human resources, is also unfavorable for the effective of identification information
Extraction.Data dependence establishes memory external and generates model when therefore, it is necessary to combine long, enhances learning network by memory external
On the other hand performance can extract useful letter at once on the one hand with smaller data storage capacity stable model parameter complexity
Breath is applied to different types of sequential structure, to solve the problems, such as that complicated, long sequential column data can not selective memory and forgetting.
The multitask coordinated recognition methods of above-mentioned fusion audiovisual perception, when the stream medium data preferably towards lasting input is long
In the depth collaboration feature learning of memory, depth collaboration feature learning model includes:For lasting input without mark
Feature stream needs accurately and efficiently to learn to provide standby minimize in class away from the joint optimal characteristics with maximization class spacing for more
Task recognition, and classification markup information can not be manually provided without labeled data, inevitably result in performance loss.Therefore, it needs
It to establish depth in conjunction with the collaboration attention mechanism for having long-term memory and continue composite character learning model, realize identification feature
Autonomous selection, improves the identification without labeled data, and implementation model increment is dynamically refined.
Intelligent multitask depth collaboration under the frame of border is preferably fitted in the multitask coordinated recognition methods of above-mentioned fusion audiovisual perception
Enhance in feedback identifying model, the suitable border perceptible feedback evaluation system includes:Scene in being perceived for audiovisual is uncertain
Property, needs extraction environment perceptual parameters, by parameter information organically blend provided for multitask identifying system it is adaptive anti-
Feedback assessment realizes that the weighting to important identification mission identifies.Such as identify that pupilage and expression are that main identification is appointed in classroom
Business;Identify that target and behavior are main identification missions in Outdoor Scene;And identify that voice and action are main in human-computer interaction scene
Want identification mission.
Intelligent multitask depth collaboration under the frame of border is preferably fitted in the multitask coordinated recognition methods of above-mentioned fusion audiovisual perception
Enhance in feedback identifying model, depth collaboration enhancing joint recognition mechanism includes:It is assisted for multitask in current scene
With the demand of identification, the data flow to inputting online is needed, while exporting a variety of audiovisual recognition results.It is logical therefore, it is necessary to establish
With strong intelligent body is changed, by feedback parameter and task weight, collaboration feature learning parameter is appointed in extraction generic features description
Business enhancing study, exports correct recognition result, and computer is made to have certain " thinking understands " ability.
Embodiment three
As shown in Figure 1, a kind of multitask coordinated recognition methods for fusion audiovisual perception that the embodiment of the present invention three provides.
First, a kind of generic features description side of migration formula algorithm foundation towards multi-source audio-visual media perception data is explored
Method.
In order to realize the efficient Cooperative Analysis for being directed to different audio visual tasks, the audio visual perception data of multi-source is extracted
Feature description with height robustness and versatility, the prototype feature as follow-up Cooperative Study, it is necessary first to analyze audiovisual
The characteristics of feeling perception data.The audio data actually obtained is mostly One-dimension Time Series, it is main it is descriptive be embodied in its frequency spectrum-when
Between in clue, need the prosodic information of the Spectrum Conversion combination audio consecutive frame using class Auditory Perception domain to be described.And it regards
Feel that perception data is mostly two dimension or three-dimensional image or video sequence.The main descriptive variation for being embodied in its ken and spatial domain
On, need to take into consideration it in many-sided characteristic such as color, depth, scale, rotation.And the cross-module state of audio visual perception data is total
Enjoy the characteristics of semantic feature needs to have time, scale, rotation and translation invariance.
For audio visual perception data multichannel, multiple dimensioned, multi-modal characteristic, generalization feature description of the present invention by with
Under several key steps composition:Multi-source perceives low-level feature description, the matching of across media data time synchronization, multiple features channels associated
Learning model and migration feature fusion.
Multi-source perception low-level feature describes:
Feature is obtained for the multi-source of audio visual perceptual signal, across media, multichannel, audio, video data is extracted respectively low
Layer feature description.To audio signal, wave sample pretreatment is carried out first, Spectrum Conversion is then carried out, in conjunction with prosodic features structure
Build the spectrogram low-level feature regular as its.To two-dimensional video signal, Spectrum Conversion is carried out first, and symbiosis statistical property is drawn
Enter to obtain the two-dimentional clock signal with rotation translation invariance.To three-dimensional video sequence, introduces Multiscale Theory and carry out quickly
The low-level feature abstract technology of scale space transformation, then Spectrum Conversion and statistics symbiosis are carried out, it is special to generate sequential pyramid frequency spectrum
Sign.
Across media data time synchronization matches:
The accurate detect and track target in time-space domain is required in being perceived for audio visual multitask, needs to realize multi-medium data
Between time unifying.In order to realize the non-linear alignment of heterogeneous data flow, dynamic time warping technology is used first, realizes sequential
The optimal alignment of signal.A Coded concepts stream is established for the data flow of each channel.As the semantic coding of complicated event, institute
There is the low-level feature stream newly inputted to flow into Mobile state Time alignment with reference to semantic coding, generation time translation function realizes language
Justice alignment.
Multiple features channels associated learning model includes:
Due to sharing similar high-level semantic structural information between different channels media, in order to effectively quantify different dimensions difference
The shared information of feature extracts the maximum generic features description of discrimination property in a variety of audio visual tasks, increases class spacing, reduce class
It is interior away from needing the combination learning model for establishing heterogeneous characteristic.Assuming that having S class heterogeneous characteristics, to each characteristic typeIt is denoted as niThe eigenmatrix of a training sample, data noise part are E, and Γ is twiddle factor.
Combine heterogeneous characteristic learning model under multitask frame it is intended that each XiLearn a projection matrix Θi.By matrix heterogeneous characteristic
It is projected as equal intrinsic dimensionality, reduces the redundancy of multiple features data, the majorized function under orthogonality constraint is expressed as:
The heterogeneous characteristic learning model is intended to combination learning general semantics proper subspace { Θi, being total under Unified frame
Enjoy matrix W0With special characteristic modular matrix { Wi, prediction loss function R is solved using least square method1(W0,{Wi},
{Θi), reconstruct loss function R2({Θi) and regular function R3(W0,{Wi) joint optimal solution.Pass through the number that will newly input
It is described with the high-rise generic features of dimension according to eigenspace projection extraction, establishes and share semantic association relationship, as shown in Figure 2.
Migration feature fusion learns:
For the problem that training sample in mass data is limited, introducing transfer learning model enhances unlabeled data from principal mark
Learning ability is noted, note unlabeled data integrates the label target collection as transfer learning, by providing powerful prior information, makes target
Collection passes through { Θ with supplement collectioniCombined optimization feature independently marks study, rememberFor supplement collection sample characteristics description and
Markup information,For the description of object set sample characteristics and markup information, migration combination learning model indicates as follows:
Wherein F () is the object function of model, using the three above-mentioned optimization problems of perfecting by stage algorithm solution, obtains audiovisual matchmaker
The generic features of the decorum one describe.
It is realized using migration formula algorithm under this model and the generic features of multi-source audio-visual media perception data is described.According to
The different modalities of perception data are established in conjunction with the application environment of perception identification mission and share semantic generic features based on high-rise
Descriptive model.On this basis, it is limited according to the comprehensive of the constraintss such as intrinsic dimensionality, computing relay, time unifying, frame frequency,
Using the joint isomery optimization method of multi-source data, the shared semantic information of extraction different characteristic information is realized.It is built by theory
Mould, mathematical derivation, Optimization Algorithm complete the theoretical study method of relevant programme, further pass through mathematical simulation platform etc.
Tool completes the simulating, verifying work of new departure.
Method described in the embodiment of the present invention three completes the generic features description towards multi-source audio-visual media perception data
Afterwards, continue to explore it is a kind of establishing sustainable depth collaboration feature learning mechanism using generating memory models dynamic, use external note
The sequential for recalling system enhancing generates model, and under variation Framework for Reasoning, store-memory feature is retouched since the early stage of sequence
The effective information stated, and efficiently sustainable collaboration recycling is carried out to having stored information.
Generic features describe process can merge time-space domain identifying information in audio-visual media perception data well, connect down
The basic theory relied on when generating memory models and long collaboration from research is started with, identification mission is perceived to compatibility for audiovisual
Property, intelligent and flexibility requirement, research be suitable for memory external system enhancing sequential generate model and collaboration feature
Learning algorithm.Under normal conditions, for the audiovisual stream medium data of sustainable input, the length based on time interval and past observing
Long-range rely on separates the predictable element of long time series and unpredictable element, and uncertainty is indicated to unpredictable element,
And quickly identification can be with the new element in aid forecasting future.
It includes generic features description collection e that sequential, which generates model,≤T={ e1,e2,···,eTAnd corresponding hidden variable collection
z≤T={ z1,z2,···,zT, map h using translationt=fh(ht-1,et,zt) correct the hidden shape of certainty of each time point
State variable ht, priori mapping function fz(ht-1) describe the non-linear dependence of past observing and hidden variable and hidden variable distribution ginseng is provided
Number.Nonlinear observation mapping function fe(zt,ht-1) likelihood function for depending on hidden variable and state is provided.Using outer in the present invention
Portion's memory models correct sequential variable autocoder, generate a memory text ψ at every point of timet, priori and posteriority
Probability indicates as follows:
Prior information pθ(zt|z< T,e< T)=Ν (zt|fz μ(Ψt),fz σ(Ψt-1))
Posterior information qφ(zt|z< T,e≤T)=Ν (zt|fq μ(Ψt-1,et),fq σ(Ψt-1,et))
Wherein prior information is to rely on priori mapping fzRemember the diagonal gauss of distribution function of text, and diagonal Gauss is close
It is depended on like Posterior distrbutionp and passes through posteriority mapping function fqAssociated memory text Ψt-1With current observation et。
As shown in figure 3, using the random processing procedure for calculating figure and generating model as memory sequential.In order to make the structure pair
Different perception tasks have higher versatility and flexibility, present invention introduces the memory of high-level semantic and controller architecture with
Stable storage information will be extracted for future, and carry out corresponding calculate to extract use information at once.
Specifically, memory first in, first out buffering different from the past, quasi- using the association being close with people's cognitive process
It with mode page theory, is formed and describes the notable time zone of the relevant audio visual of task with generic features, calculate generic features in task
Under the influence of the timing memory biasing that generates, generate the relevant adaptive perception attention time zone of task by bias and generic features.
The memory structure versatility, which is embodied in, allows information position reading at any time and write-in.
Controller uses long memory network (LSTM) f in short-termrnnTo promote state history ht, memory external MtUsing coming from
The hidden variable and external text information c of previous momenttIt generating, generation model is as follows,
State updates (ht,Mt)=frnn(ht-1,Mt-1,zt-1,ct)
It is derived from memory M in order to be formedtR item content informations, controller generate one collection key value, commented using cosine similarity
Surveying willWith memory Mt-1Each row is compared, and soft attention weight-sets, the memory of retrieval are generatedBy attention weight and memory
Mt-1Weighted sum obtain.
Key value
Attention mechanism
Retrieval memory
Generate memory
Wherein,It is the Setover relatedly value arrived by retrieving mnemonic learning, σ () is sigmoid functions.It is external as a result,
Remember MtFor storing hidden variable zt, controller, which is formed, informs the expression mechanism Ψ of memory storage and retrievalt=[φt 1,
φt 2,···,φt R,ht].It is the output for generating memory models, the audio visual more times unknown for task definition and number
Business collaboration feature learning is, it can be achieved that the non-supervisory feature learning of the data flow of lasting input.
Using the process demand for generating the corresponding Multi-task Concurrency identification of memory models under this model structure, according to audio visual
The different task for perceiving identification establishes depth collaboration feature learning mechanism in conjunction with application environment complicated and changeable.It is basic herein
On, the comprehensive limitation of the constraintss such as regionality is paid attention to according to timing memory, Long-Range Dependence, collaboration, is closed using time-space domain
Join Optimal Learning method, the depth of selective memory and forgetting ability cooperates with feature learning method when realization has long.Pass through elder generation
Hypothesis, aposterior reasoning, the theoretical research of Cooperative Optimization completion relevant programme are tested, algorithm simulating platform etc. is further passed through
Tool completes the simulating, verifying work of new departure.
Method described in the embodiment of the present invention three completes the generic features description towards multi-source audio-visual media perception data
It after sustainable depth collaboration feature learning, is perceived in identification process for audio visual multitask, scene is complicated and changeable, intelligent body needs
The problem of wanting that multiple tasks can be handled simultaneously is studied the collaboration enhancing fed back based on audio visual perceptual parameters and fits border calculating reason
By to solve the problems, such as audio visual perception and the harmonious theory and technology being connected between natural environment.
It include mainly following three parts research contents:1) extraction of border perceptual parameters is fitted;2) the depth collaboration of progressive network
Enhance recognition mechanism;3) distributed intelligence demo system.
Suitable border perceptual parameters, which extract, includes:
Suitable border computational theory can effectively be adapted to environment by biology and be inspired, with audio visual perceptual parameters feedback mechanism with
Environment interacts, and learns the optimal policy of multitask identification by way of maximizing accumulation award.The suitable border sense of extraction
Know that parameter is as follows:
Brightness perceptual parameters:By the normalized cumulant for calculating the pixel average and normal brightness information of image/video
Value obtains;
Loudness perceptual parameters:The normalized cumulant value of the sound intensity average value and standard sound intensity information of audio is inputted by calculating
It obtains;
Visual angle perceptual parameters:The average information acquisition value for including using high frequency imaging is bigger, and image detail information is richer
The more excellent calculating of richness, i.e. visual angle;
Sound field perceptual parameters:It is calculated by the average energy of the transmission function inside sound source to ear;
Pay attention to perceptual parameters:Notice that the attention rule parameter in time zone indicates by the audio visual in collaboration feature learning.
The dynamic change of complex scene can cause phenomena such as lighting change, visual angle deflection, sound field drift to seriously affect perception
The performance of recognition result.Therefore, single perceptual parameters cannot be relied only on when fitting border perception decision judgement, it should make full use of above
The weighted sum of five kinds of perceptual parameters calculated values, the integrated decision-making as suitable border perception self-adaption feedback.
The depth collaboration of progressive network enhances recognition mechanism and includes:
It is used as suitable border decision by the weighted sum of perceptual parameters, establishes progressive network collaboration recognition mechanism, which can
With by successively storing migration knowledge, and valuable award feature is extracted, the current identification mission to be treated of decision solves
Problem of the migration knowledge to true environment from simulated environment.
As shown in figure 4, describing a simple progressive network, wherein a is adaptive adapter, before effect is to maintain
The hidden layer activation value of row is consistent with the dimension being originally inputted, and composition process is as follows,
1st row construct 1 deep neural network to train a certain task;
In order to train the 2nd task, therefore the activation value of each hidden layer in its network is handled by adapter, is connected to
The respective layer of 2nd row neural network, using as additional input;In order to train the 3rd task, fixed first two columns network parameter preceding
The activation value of the two each hidden layers of row network is handled by adapter, and combination is connected to the respective layer of the 2nd row neural network, as
Additional input.If there is more mission requirements, and so on.All of above network passes through UNREAL algorithm training parameters.
Migration knowledge is stored by successively propulsion mode and extracts valuable award feature, and knowledge is moved in completion
It moves.For new task, the hiding layer state of training pattern before being remained in training, before combining to hierarchy in network
The useful award of each hidden layer so that transfer learning is gathered around there are one the priori relied on for a long time, is formed and is directed to final goal
Completed policy.
Distributed intelligence demo system includes:Using distribution in high-performance calculation, the multiple agent of multinode, more GPU
Coprocessor system carries out building for intelligent demonstration system.In data training process, the intelligent body of each GPU compositions has
One complete network model copy, and iteration can only be assigned a subset in sample every time.GPU is by being in communication with each other
Come the gradient that the different GPU that are averaged are calculated, average gradient is obtained into new weight applied to weight, and once a GPU is completed
The iteration of oneself, it has to that other all GPU is waited for all to complete to ensure that weight can suitably be updated.This is equivalent to
SGD is handled on single GPU, but distributes to multiple GPU come concurrent operation, to obtain calculating speed by data
It is promoted.Here by the brief algorithm of distribution of high-performance computing sector, and decayed to solve to lead between GPU using bandwidth optimization ring
Letter problem.
In conclusion the multitask coordinated recognition methods for merging audiovisual perception described in the embodiment of the present invention and system, phase
For the prior art, there is better multi-source, Dynamic persistence and space-time transformation.The number when processing multi-source is long
It is especially good according to upper effect.Specifically, having following features:
Multi-source:The characteristics of for multi-source audio-visual media perception data, establishes a kind of general feature description mechanism,
The audio-visual media information that different channels are obtained carries out effective complementary support, and multi-source is evolved to from traditional single source fixed mode
Elastic model not only effectively removes data redundancy, but also the feature description of standby versatility is provided in study.
Dynamic persistence:Audiovisual task have time-space domain variation characteristic, conventional method can only be to set demand at
Reason can not carry out effective long-term memory reasoning to the data learnt, difficult between learning network underloadingization and high usage
With balance.Meanwhile when having pop-up mission or target data is added, over-fitting and network parameter fragmentation can be led to.Therefore, needle
The depth collaboration feature learning mechanism remembered to the audio visual feature lasts that the data of lasting input are established is received with high dynamic
Rate, high resource utilization, low network consumption rate.
Space-time transformation:In order to meet under the space-time passage variation of complex scene, optimal perception identification is still maintained
Performance should use the adaptive feedback mechanism of suitable border perception, realize that the dynamic fitted under border calculates adjusts to the environment of variation, to reach
The optimal adaptation effect of the multitask coordinated enhancing feedback identifying of intelligence under to mass data storage.
The above research contents is integrated, a complete intelligent demonstration system is built, realizes and is acquired from audio visual perception data
Result to multitask coordinated identification exports, and a standard platform is provided for subsequent further investigation and functionization.In experiment side
To consider in method audio visual perceive the high efficiency in multitask coordinated analysis, dynamic, it is intelligent the features such as, in conjunction with soft project
Software for Design specification, utilize object-oriented programming method design a demo system easily extended.
As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can
It is realized by the mode of software plus required general hardware platform.Based on this understanding, technical scheme of the present invention essence
On in other words the part that contributes to existing technology can be expressed in the form of software products, the computer software product
It can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are used so that a computer equipment
(can be personal computer, server either network equipment etc.) executes the certain of each embodiment of the present invention or embodiment
Method described in part.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims
Subject to.
Claims (10)
1. a kind of multitask recognition methods of fusion audiovisual perception, which is characterized in that include the following steps:
Step S110:The generic features of multi-source heterogeneous data describe:Time synchronization matching mechanisms based on multi-source heterogeneous data, build
Be based on the potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, extracts the generic features of multi-source heterogeneous data;
Step S120:The depth of long-term memory cooperates with feature learning:Memory models are relied on when establishing long, are relied in conjunction with based on outside
Collaboration attention mechanism, continuous learning is carried out as priori to the generic features of the multi-source heterogeneous data, generates memory
Model;
Step S130:Task based on suitable border feedback with evaluation mechanism differentiates:The environment sensing parameter in multi-source heterogeneous data is extracted,
Progressive network depth collaboration enhancing recognition mechanism is established to realize in conjunction with the learning characteristic and mission requirements of the memory models
Merge the multitask identification of audiovisual perception.
2. the multitask recognition methods of fusion audiovisual perception according to claim 1, which is characterized in that the step S110
In, the time synchronization matching mechanisms of the multi-source heterogeneous data include:
The low-level feature stream of the multi-source heterogeneous data is extracted, a Coded concepts stream is established for the data of each channel, as
The reference semantic coding of complicated event carries out dynamic time warping, production with described to the low-level feature stream with reference to semantic coding
Raw time-shifting function realizes semantic alignment;Wherein,
The low-level feature stream of the extraction multi-source heterogeneous data includes:
To audio signal, wave sample pretreatment being carried out first, then carries out Spectrum Conversion, frequency spectrum is built in conjunction with prosodic features
Figure;
To two-dimensional video signal, Spectrum Conversion is carried out first, is introduced symbiosis statistical property and is obtained with rotation translation invariance
Two-dimentional clock signal;
To three-dimensional video sequence, the low-level feature abstract technology that Multiscale Theory carries out quick scale space transformation is introduced, then into
Line frequency spectral transformation and statistics symbiosis, generate sequential pyramid spectrum signature.
3. the multitask recognition methods of fusion audiovisual perception according to claim 2, which is characterized in that the foundation is based on
The potential high-rise multi-source heterogeneous data correlation descriptive model for sharing semanteme, the generic features for extracting multi-source heterogeneous data include:
For each characteristic type XiLearn a projection matrix Θi, heterogeneous characteristic is projected as to equal intrinsic dimensionality, by right
General semantics proper subspace { Θi, the sharing matrix W under Unified frame0With special characteristic modular matrix { WiCombination learning,
Counting loss function R1(W0,{Wi},{Θi), reconstruct loss function R2({Θi) and regular function R3(W0,{Wi) joint most
Excellent solution is established to have and shares semantic heterogeneous characteristic study;
Specifically,
It, will to S class heterogeneous characteristicsIt is denoted as niThe eigenmatrix of a training sample, data noise
Part is E, and Γ is twiddle factor, and the majorized function established under orthogonality constraint is:
Wherein, λ indicates that sharing matrix coefficient, T representing matrixes carry out transposition operation, YiIndicate that ith feature classification mark, F indicate
Frobenius norms,Indicate projection matrix ΘiTransposition, α, β, μ1And μ2For multiplier factor, rank (X) is characterized matrix X
Order, E is noise matrix;
The mark as transfer learning is integrated from study, note unlabeled data is marked to the migration of unlabeled data in multi-source heterogeneous data
Object set makes object set pass through { Θ with supplement collectioniCombined optimization feature independently marks study, rememberCollect for supplement
Sample characteristics describe and markup information,For the description of object set sample characteristics and markup information, migration learns from mark
Model indicates as follows:
Wherein F () is object function, and ρ is multiplier factor, and described problem migration is solved from mark using three perfecting by stage algorithms
Learning model obtains the generic features description.
4. the multitask recognition methods of fusion audiovisual perception according to claim 3, which is characterized in that the long-term memory
Depth collaboration feature learning include:
It includes generic features description collection e that memory models are relied on when long≤T={ e1,e2,…,eTAnd corresponding hidden variable collection z≤T=
{z1,z2,…,zT, map h using translationt=fh(ht-1,et,zt) correct the hidden state variable h of certainty of each time pointt,
Priori mapping function fz(ht-1) describe the non-linear dependence of past observing and hidden variable and hidden variable distributed constant is provided;
Nonlinear observation mapping function fe(zt,ht-1) likelihood function for depending on hidden variable and state is provided, utilize memory external
Modifying model sequential variable autocoder generates a memory text ψ at every point of timet, obtain prior information and posteriority letter
Breath is as follows:
Prior information
Posterior information
Wherein,It is the translation mapping function of hidden variable z states μ,It is the translation mapping function of hidden variable z states σ,It is
The translation mapping function of posterior probability q states μ,The translation mapping function of posterior probability q states σ, prior information are to rely on
Priori maps fzRemember the diagonal gauss of distribution function of text, and diagonal Gaussian approximation Posterior distrbutionp is depended on and is mapped by posteriority
Function fqAssociated memory text Ψt-1With current observation et。
5. the multitask recognition methods of fusion audiovisual perception according to claim 4, which is characterized in that the long-term memory
Depth collaboration feature learning further include:
Using Cooperative Mode perception theory, the timing memory bias that generic features generate under the influence of task is calculated, according to institute
It states timing memory bias and generic features generates and pay attention to time zone with the relevant adaptive perception of identification mission;
Use long memory network (LSTM) f in short-termrnnTo promote state history ht, memory external MtUsing coming from previous moment
Hidden variable and external text information ctIt generates, it is as follows to generate state more new model:
State updates (ht,Mt)=frnn(ht-1,Mt-1,zt-1,ct)
It is derived from memory M in order to be formedtR item content informations, introduce a key value, use cosine similarity evaluation and test willWith memory
Mt-1Each row is compared, and is generated and is paid attention to weight, the memory of retrievalBy attention weight and memory Mt-1Weighted sum obtain,
In,
Key value
Attention mechanism
Retrieval memory
Generate memory
Wherein,It is the crucial value function of r items for promoting state history, fattIt is attention mechanism function,It is t moment r items i-th
The memory weight of a point,Retrieval memory equation obtain as a result, ⊙ indicate point multiplication operation,It is by retrieving mnemonic learning
The Setover relatedly value arrived, σ () are sigmoid functions, form the expression mechanism for informing memory storage and retrieval as a result,As the output for generating memory models.
6. the multitask recognition methods of fusion audiovisual perception according to claim 5, which is characterized in that the extraction multi-source
Environment sensing parameter in isomeric data, establishing progressive network depth collaboration enhancing recognition mechanism includes:
Brightness perception is obtained by the normalized cumulant value of the pixel average and normal brightness information that calculate image/video to join
Number;The sound intensity average value of audio, which is inputted, by calculating obtains loudness perceptual parameters with the normalized cumulant value of standard sound intensity information;
The average information acquisition value for including using high frequency imaging is bigger, and image detail information is abundanter, i.e., visual angle is more excellent calculates to regard
Angle perceptual parameters;Sound field perceptual parameters are calculated by the average energy of the transmission function inside sound source to ear;By heterogeneous characteristic
Audio visual in habit notices that the attention rule parameter in time zone indicates to pay attention to perceptual parameters;
By the brightness perceptual parameters, the loudness perceptual parameters, the visual angle perceptual parameters, the sound field perceptual parameters and institute
State pay attention to perceptual parameters weighted sum as fit border decision, establish progressive network depth collaboration enhancing recognition mechanism, by by
Layer storage migration knowledge, and extracts award feature, the current identification mission to be treated of decision.
7. a kind of multitask coordinated identifying system of fusion audiovisual perception, it is characterised in that:Including generic features extraction module, association
With feature learning module, suitable border feedback with evaluation identification module;
The generic features extraction module is used for the time synchronization matching mechanisms based on multi-source heterogeneous data, establishes based on potential
High level shares semantic multi-source heterogeneous data correlation descriptive model, extracts the generic features of multi-source heterogeneous data;
The collaboration feature learning module, relies on memory models when for establishing long, pay attention in conjunction with the collaboration relied on based on outside
Mechanism carries out continuous learning as priori to the generic features, generates memory models;
The suitable border feedback with evaluation identification module is established gradual for extracting the environment sensing parameter in multi-source heterogeneous data
Network depth collaboration enhancing recognition mechanism realizes multitask identification in conjunction with the learning characteristic and mission requirements of the memory models.
8. the multitask coordinated identifying system of fusion audiovisual perception according to claim 7, it is characterised in that:It is described general
Characteristic extracting module includes time synchronization submodule and shared semantic association feature description submodule;
The time synchronization submodule passes through probability and Knowledge driving for the low-level feature in conjunction with the multi-source heterogeneous data
Frame, establish have scale, translation, rotation, time invariance multi-source heterogeneous data time synchronize securing mechanism;
The shared semantic association feature description submodule, for according to semantic vector mechanism, multi-source information association mining machine
System establishes the shared semantic feature of the synchronous multi-source heterogeneous data obtained, extracts generic features stream.
9. the multitask coordinated identifying system of fusion audiovisual perception according to claim 8, it is characterised in that:The collaboration
The generation memory models submodule and depth that feature learning module relies on when including long cooperate with feature learning model submodule;
The generation memory models submodule relied on when described long, for extracting the generic features of the multi-source heterogeneous data as first
Knowledge is tested to be stored, in conjunction with it is long when data dependence establish memory external generate model;
The depth cooperates with feature learning model submodule, for combining the collaboration attention mechanism relied on based on outside, to described
Generic features carry out continuous learning as priori, and output identification feature generates memory models as aposterior knowledge.
10. the multitask coordinated identifying system of fusion audiovisual perception according to claim 9, it is characterised in that:It is described suitable
Feedback with evaluation identification module in border includes suitable border perceptible feedback evaluation system submodule and depth collaboration enhancing joint cognitron system
Module;
The suitable border perceptible feedback evaluation system submodule is used for extraction environment perceptual parameters, by environment sensing parameter and
Identification feature organically blends, and realizes that the weighting to identification mission updates layering;
The depth collaboration enhancing joint recognition mechanism submodule, is used for the weight according to environment sensing parameter and identification mission,
The generic features description for extracting multi-source heterogeneous data, exports recognition result.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810746362.3A CN108804715A (en) | 2018-07-09 | 2018-07-09 | Merge multitask coordinated recognition methods and the system of audiovisual perception |
CN201910312615.0A CN109947954B (en) | 2018-07-09 | 2019-04-18 | Multitask collaborative identification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810746362.3A CN108804715A (en) | 2018-07-09 | 2018-07-09 | Merge multitask coordinated recognition methods and the system of audiovisual perception |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108804715A true CN108804715A (en) | 2018-11-13 |
Family
ID=64074892
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810746362.3A Withdrawn CN108804715A (en) | 2018-07-09 | 2018-07-09 | Merge multitask coordinated recognition methods and the system of audiovisual perception |
CN201910312615.0A Active CN109947954B (en) | 2018-07-09 | 2019-04-18 | Multitask collaborative identification method and system |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910312615.0A Active CN109947954B (en) | 2018-07-09 | 2019-04-18 | Multitask collaborative identification method and system |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN108804715A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726903A (en) * | 2018-12-19 | 2019-05-07 | 中国电子科技集团公司信息科学研究院 | Distributed multi agent Collaborative Decision Making Method based on attention mechanism |
CN110379416A (en) * | 2019-08-15 | 2019-10-25 | 腾讯科技(深圳)有限公司 | A kind of neural network language model training method, device, equipment and storage medium |
CN111145538A (en) * | 2019-12-06 | 2020-05-12 | 齐鲁交通信息集团有限公司 | Stereo perception system suitable for audio and video acquisition, recognition and monitoring on highway |
CN111859267A (en) * | 2020-06-22 | 2020-10-30 | 复旦大学 | Operation method of privacy protection machine learning activation function based on BGW protocol |
CN112257785A (en) * | 2020-10-23 | 2021-01-22 | 中科院合肥技术创新工程院 | Serialized task completion method and system based on memory consolidation mechanism and GAN model |
CN112388627A (en) * | 2019-08-19 | 2021-02-23 | 维布络有限公司 | Method and system for executing tasks in dynamic heterogeneous robot environment |
CN112529184A (en) * | 2021-02-18 | 2021-03-19 | 中国科学院自动化研究所 | Industrial process optimization decision method fusing domain knowledge and multi-source data |
CN112580806A (en) * | 2020-12-29 | 2021-03-30 | 中国科学院空天信息创新研究院 | Neural network continuous learning method and device based on task domain knowledge migration |
NL2026432A (en) * | 2019-09-09 | 2021-05-11 | Shenzhen Demio Tech Co Ltd | Multi-source target tracking method for complex scenes |
CN112883256A (en) * | 2021-01-11 | 2021-06-01 | 北京达佳互联信息技术有限公司 | Multitasking method and device, electronic equipment and storage medium |
CN112951218A (en) * | 2021-03-22 | 2021-06-11 | 百果园技术(新加坡)有限公司 | Voice processing method and device based on neural network model and electronic equipment |
CN113344085A (en) * | 2021-06-16 | 2021-09-03 | 东南大学 | Balanced-bias multi-source data collaborative optimization and fusion method and device |
CN113837121A (en) * | 2021-09-28 | 2021-12-24 | 中国科学技术大学先进技术研究院 | Epidemic prevention robot vision and hearing collaborative perception method and system based on brain-like |
CN116884404A (en) * | 2023-09-08 | 2023-10-13 | 北京中电慧声科技有限公司 | Multitasking voice semantic communication method, device and system |
CN116996844A (en) * | 2023-07-07 | 2023-11-03 | 中国科学院脑科学与智能技术卓越创新中心 | Multi-point communication method and device for describing and predicting event |
CN117194900A (en) * | 2023-09-25 | 2023-12-08 | 中国铁路成都局集团有限公司成都供电段 | Equipment operation lightweight monitoring method and system based on self-adaptive sensing |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110908986B (en) * | 2019-11-08 | 2020-10-30 | 欧冶云商股份有限公司 | Layering method and device for computing tasks, distributed scheduling method and device and electronic equipment |
CN111488840A (en) * | 2020-04-15 | 2020-08-04 | 桂林电子科技大学 | Human behavior classification method based on multi-task learning model |
CN111598107B (en) * | 2020-04-17 | 2022-06-14 | 南开大学 | Multi-task joint detection method based on dynamic feature selection |
CN113282933B (en) * | 2020-07-17 | 2022-03-01 | 中兴通讯股份有限公司 | Federal learning method, device and system, electronic equipment and storage medium |
CN112329948B (en) * | 2020-11-04 | 2024-05-10 | 腾讯科技(深圳)有限公司 | Multi-agent strategy prediction method and device |
CN113377884B (en) * | 2021-07-08 | 2023-06-27 | 中央财经大学 | Event corpus purification method based on multi-agent reinforcement learning |
CN114155496B (en) * | 2021-11-29 | 2024-04-26 | 西安烽火软件科技有限公司 | Vehicle attribute multitasking collaborative recognition method based on self-attention |
WO2024103345A1 (en) * | 2022-11-17 | 2024-05-23 | 中国科学院深圳先进技术研究院 | Multi-task cognitive brain-inspired modeling method |
CN116028620B (en) * | 2023-02-20 | 2023-06-09 | 知呱呱(天津)大数据技术有限公司 | Method and system for generating patent abstract based on multi-task feature cooperation |
CN115985402B (en) * | 2023-03-20 | 2023-09-19 | 北京航空航天大学 | Cross-modal data migration method based on normalized flow theory |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103530619B (en) * | 2013-10-29 | 2016-08-31 | 北京交通大学 | Gesture identification method based on a small amount of training sample that RGB-D data are constituted |
US10013640B1 (en) * | 2015-12-21 | 2018-07-03 | Google Llc | Object recognition from videos using recurrent neural networks |
CN105893612A (en) * | 2016-04-26 | 2016-08-24 | 中国科学院信息工程研究所 | Consistency expression method for multi-source heterogeneous big data |
CN106447625A (en) * | 2016-09-05 | 2017-02-22 | 北京中科奥森数据科技有限公司 | Facial image series-based attribute identification method and device |
CN106971200A (en) * | 2017-03-13 | 2017-07-21 | 天津大学 | A kind of iconic memory degree Forecasting Methodology learnt based on adaptive-migration |
CN107563407B (en) * | 2017-08-01 | 2020-08-14 | 同济大学 | Feature representation learning system for multi-modal big data of network space |
CN107506712B (en) * | 2017-08-15 | 2021-05-18 | 成都考拉悠然科技有限公司 | Human behavior identification method based on 3D deep convolutional network |
CN108229066A (en) * | 2018-02-07 | 2018-06-29 | 北京航空航天大学 | A kind of Parkinson's automatic identifying method based on multi-modal hyper linking brain network modelling |
-
2018
- 2018-07-09 CN CN201810746362.3A patent/CN108804715A/en not_active Withdrawn
-
2019
- 2019-04-18 CN CN201910312615.0A patent/CN109947954B/en active Active
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109726903A (en) * | 2018-12-19 | 2019-05-07 | 中国电子科技集团公司信息科学研究院 | Distributed multi agent Collaborative Decision Making Method based on attention mechanism |
CN110379416A (en) * | 2019-08-15 | 2019-10-25 | 腾讯科技(深圳)有限公司 | A kind of neural network language model training method, device, equipment and storage medium |
CN110379416B (en) * | 2019-08-15 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Neural network language model training method, device, equipment and storage medium |
CN112388627A (en) * | 2019-08-19 | 2021-02-23 | 维布络有限公司 | Method and system for executing tasks in dynamic heterogeneous robot environment |
NL2026432A (en) * | 2019-09-09 | 2021-05-11 | Shenzhen Demio Tech Co Ltd | Multi-source target tracking method for complex scenes |
CN111145538A (en) * | 2019-12-06 | 2020-05-12 | 齐鲁交通信息集团有限公司 | Stereo perception system suitable for audio and video acquisition, recognition and monitoring on highway |
CN111859267A (en) * | 2020-06-22 | 2020-10-30 | 复旦大学 | Operation method of privacy protection machine learning activation function based on BGW protocol |
CN111859267B (en) * | 2020-06-22 | 2024-04-26 | 复旦大学 | Operation method of privacy protection machine learning activation function based on BGW protocol |
CN112257785A (en) * | 2020-10-23 | 2021-01-22 | 中科院合肥技术创新工程院 | Serialized task completion method and system based on memory consolidation mechanism and GAN model |
CN112580806A (en) * | 2020-12-29 | 2021-03-30 | 中国科学院空天信息创新研究院 | Neural network continuous learning method and device based on task domain knowledge migration |
CN112883256A (en) * | 2021-01-11 | 2021-06-01 | 北京达佳互联信息技术有限公司 | Multitasking method and device, electronic equipment and storage medium |
CN112883256B (en) * | 2021-01-11 | 2024-05-17 | 北京达佳互联信息技术有限公司 | Multitasking method, apparatus, electronic device and storage medium |
CN112529184A (en) * | 2021-02-18 | 2021-03-19 | 中国科学院自动化研究所 | Industrial process optimization decision method fusing domain knowledge and multi-source data |
CN112529184B (en) * | 2021-02-18 | 2021-07-02 | 中国科学院自动化研究所 | Industrial process optimization decision method fusing domain knowledge and multi-source data |
US11409270B1 (en) | 2021-02-18 | 2022-08-09 | Institute Of Automation, Chinese Academy Of Sciences | Optimization decision-making method of industrial process fusing domain knowledge and multi-source data |
CN112951218B (en) * | 2021-03-22 | 2024-03-29 | 百果园技术(新加坡)有限公司 | Voice processing method and device based on neural network model and electronic equipment |
CN112951218A (en) * | 2021-03-22 | 2021-06-11 | 百果园技术(新加坡)有限公司 | Voice processing method and device based on neural network model and electronic equipment |
CN113344085A (en) * | 2021-06-16 | 2021-09-03 | 东南大学 | Balanced-bias multi-source data collaborative optimization and fusion method and device |
CN113344085B (en) * | 2021-06-16 | 2024-04-26 | 东南大学 | Balance bias multi-source data collaborative optimization and fusion method and device |
CN113837121B (en) * | 2021-09-28 | 2024-03-01 | 中国科学技术大学先进技术研究院 | Epidemic prevention robot visual and visual sense cooperative sensing method and system based on brain-like |
CN113837121A (en) * | 2021-09-28 | 2021-12-24 | 中国科学技术大学先进技术研究院 | Epidemic prevention robot vision and hearing collaborative perception method and system based on brain-like |
CN116996844A (en) * | 2023-07-07 | 2023-11-03 | 中国科学院脑科学与智能技术卓越创新中心 | Multi-point communication method and device for describing and predicting event |
CN116884404B (en) * | 2023-09-08 | 2023-12-15 | 北京中电慧声科技有限公司 | Multitasking voice semantic communication method, device and system |
CN116884404A (en) * | 2023-09-08 | 2023-10-13 | 北京中电慧声科技有限公司 | Multitasking voice semantic communication method, device and system |
CN117194900A (en) * | 2023-09-25 | 2023-12-08 | 中国铁路成都局集团有限公司成都供电段 | Equipment operation lightweight monitoring method and system based on self-adaptive sensing |
Also Published As
Publication number | Publication date |
---|---|
CN109947954A (en) | 2019-06-28 |
CN109947954B (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804715A (en) | Merge multitask coordinated recognition methods and the system of audiovisual perception | |
CN111930992B (en) | Neural network training method and device and electronic equipment | |
CN108846384A (en) | Merge the multitask coordinated recognition methods and system of video-aware | |
CN110532996A (en) | The method of visual classification, the method for information processing and server | |
CN109919078A (en) | A kind of method, the method and device of model training of video sequence selection | |
CN106909938A (en) | Viewing angle independence Activity recognition method based on deep learning network | |
CN106022294A (en) | Intelligent robot-oriented man-machine interaction method and intelligent robot-oriented man-machine interaction device | |
Emmeche | At home in a complex world: Lessons from the frontiers of natural science | |
Han et al. | Internet of emotional people: Towards continual affective computing cross cultures via audiovisual signals | |
CN112905762A (en) | Visual question-answering method based on equal attention-deficit-diagram network | |
Tan et al. | Style interleaved learning for generalizable person re-identification | |
CN113657272B (en) | Micro video classification method and system based on missing data completion | |
CN113423005B (en) | Intelligent music generation method and system based on improved neural network | |
Wang et al. | TC3KD: Knowledge distillation via teacher-student cooperative curriculum customization | |
Ye et al. | [Retracted] IoT‐Based Wearable Sensors and Bidirectional LSTM Network for Action Recognition of Aerobics Athletes | |
Zhang et al. | Local-global graph pooling via mutual information maximization for video-paragraph retrieval | |
Saleem et al. | Stateful human-centered visual captioning system to aid video surveillance | |
CN116244473B (en) | Multi-mode emotion recognition method based on feature decoupling and graph knowledge distillation | |
CN116737897A (en) | Intelligent building knowledge extraction model and method based on multiple modes | |
CN108764459B (en) | Target recognition network design method based on semantic definition | |
Usman et al. | Skeleton-based motion prediction: A survey | |
Xu et al. | Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network | |
Yadikar et al. | A Review of Knowledge Distillation in Object Detection | |
CN113792626A (en) | Teaching process evaluation method based on teacher non-verbal behaviors | |
Ji et al. | [Retracted] Analysis of the Impact of the Development Level of Aerobics Movement on the Public Health of the Whole Population Based on Artificial Intelligence Technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20181113 |
|
WW01 | Invention patent application withdrawn after publication |