CN109685196B - Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion - Google Patents

Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion Download PDF

Info

Publication number
CN109685196B
CN109685196B CN201811527643.6A CN201811527643A CN109685196B CN 109685196 B CN109685196 B CN 109685196B CN 201811527643 A CN201811527643 A CN 201811527643A CN 109685196 B CN109685196 B CN 109685196B
Authority
CN
China
Prior art keywords
node
layer
visual
symbol
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811527643.6A
Other languages
Chinese (zh)
Other versions
CN109685196A (en
Inventor
马昕
黄珂
荣学文
宋锐
田新诚
李贻斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN201811527643.6A priority Critical patent/CN109685196B/en
Publication of CN109685196A publication Critical patent/CN109685196A/en
Application granted granted Critical
Publication of CN109685196B publication Critical patent/CN109685196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/008Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Robotics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion, comprising the following steps: a sample layer, a symbol layer and an associated layer three-layer network structure; the three-layer network structure comprises a visual path and an auditory path; in the visual pathway: the sample layer is used for respectively learning the original shape and the color characteristics of the object and carrying out autonomous clustering; the symbol layer receives the autonomous clustering results of the shape and color sample layers and abstracts the autonomous clustering results into corresponding symbols; in the auditory pathway: a sample layer for learning word vectors of names; the symbol layer receives the word vector type of the name and simplifies the word vector type into a symbol; and the association layer establishes association relationship between symbols in the visual path and the auditory path and feeds back a response signal to the lower layer network according to the known association relationship. Based on the self-organizing neural network, the concept of the object can be developed autonomously and the audio-visual fusion can be realized.

Description

Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to an autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion.
Background
With more and more robots participating in daily life of human beings, cognitive development has become a hot spot in the field of intelligent robots at present. The ability to recognize and understand is required in robot-to-human communication, so a common knowledge base with human, such as an object concept, must be established. Robots are often represented by human pre-designed internal knowledge and cannot adapt to unknown and dynamic environments. To solve this problem, the robot needs to have the ability to develop and recognize autonomously like a human infant.
Human infants, using their own cognitive principles and parental guidelines, are able to develop representations of the world rapidly before two years of age, forming an original sample representation of the object, and gradually developing symbolic representations ranging from simple to complex. The entire process is multi-modal and when several modalities occur simultaneously, the brain establishes an internal association between modalities. This facilitates the development of a complete concept of objects in the cognitive process of the infant. Thus, the baby can learn about objects based on some simple features (such as shape and color) and map these visual representations to the names taught by the parents.
Currently, some studies have applied the cognitive development theory or brain mechanism of infants to the cognitive development of robots. Such as: learning a sample and symbolic representation of an object using an SVM; a development Environment-Refraction (Dev E-R) model is capable of simulating the human "assimilation-adaptation" regulation process; and the robot learns new knowledge by using a salient object detection method and a genetic algorithm. However, these methods have the following disadvantages:
first, most learning processes are offline and take a lot of time to train the model;
second, the parameters or structure of the learning model are predefined and need to be retrained each time a new sample is encountered;
third, the robot cannot develop cognition through human-computer interaction and establish a common knowledge base with humans.
Therefore, autonomous cognitive development remains a great challenge for robots. However, the cognitive development process of infants still has reference. The baby gradually knows the world by observing objects and listening to the name of the parent, and the robot can simulate the process, learn the concept of the objects through human-computer interaction, and improve the intelligence of the baby. This process is primarily concerned with audio-visual fusion and open incremental learning.
For many studies of audio-visual fusion that have been proposed, most are directed to target detection and recognition, and few studies are available on cognitive development. For example, some fusion networks utilize two deep neural network branches to learn visual images and sounds separately, and perform information fusion by concatenating feature vectors of the two modalities. However, these computational models are of fixed topology and require offline training with large amounts of data. This also exposes another problem of multi-modal fusion, namely how to design a multi-modal generic learning algorithm, so that one does not have to design a specific structure for each modality. In the prior art, SOM is utilized to learn information of three modes of vision, hearing and posture, and posture branches are taken as bridges for conversion among the modes, so that fusion among multiple modes is realized. However, SOM is also a fixed topology network requiring a predefined number of nodes. This will greatly limit the learning ability of the robot. Therefore, the cognitive algorithm of the robot is required to be universal to multiple modes and dynamically expand the network with the increase of learned knowledge.
Although SOM cannot meet all of the proposed requirements, incremental ad-hoc neural networks can make up for the deficiencies of SOM. GNGs can learn new classes in an online manner and gradually expand network nodes, thereby enabling incremental learning. But the fixed iteration it employs may result in the network reacting too slowly to the new input. GWR is faster than GNG learning. GWR inserts a new node when new samples are encountered that exceed the activation threshold and the best matching node is activated multiple times. Another advantage of GWR is that the final weight of a node is stabilized by using a strategy of learning rate adjustability reduction. SOINN is also a very efficient incremental self-organizing neural network. The difference with the maximum GWR is that the new node is directly represented by the input vector. GWR is then represented by the average of the input vector and the best-matching node weights, which will destroy the true representation of the new sample.
The present invention investigates research on the application of incremental self-organizing neural networks to multimodal fusion. Such as: a two-layer connection structure based on SOM is used for fusing spatial position, shape and color of an object; a hierarchical GWR structure for fusing multi-modal action representations, but the fusion strategy in this paper is to concatenate weights of underlying network neurons, which increases the dimensionality of the higher layer neurons. In addition, the architecture sets a fixed similarity threshold for all nodes of the GWR. However, it is difficult for an experimenter to set appropriate thresholds for all categories. This will put the network into a number-quality dilemma. GAM can overcome the GWR disadvantage. The method utilizes the connection among the nodes to establish the association relationship among the modalities and can dynamically adjust the similarity threshold of each node. However, the GAM employs supervised learning, and the class of each sample is known. The network only needs to consider the intra-class distance. STAR-SOINN and M-SOINN consider only the inter-class distance. The PCN enables the acquisition and binding of online multimodal concepts. However, during the learning process, PCN needs to rely on extensive human guidance to make decisions. Thus, there is little research available to learn classes and within-class instances of objects simultaneously in an unsupervised manner.
The cognitive structure of the L ent Dirichlet Allocation model based on hierarchical object representation and extensions is also the main task of achieving classification and requires predefined parameters, although there is also a two-way learning structure study, but the high level information can only recover the associated part, and it is not possible to use the known experience to guide the clustering of the lower-level network.
Disclosure of Invention
In order to solve the problems, the invention provides an autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion, which can simulate the cognitive development process of infants, learn multi-modal object concepts and establish the association relationship between objects and names.
In order to achieve the purpose, the invention adopts the following technical scheme:
disclosed in one or more embodiments is an autonomic cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion, comprising: a sample layer, a symbol layer and an associated layer three-layer network structure; the three-layer network structure comprises a visual path and an auditory path;
in the visual pathway:
the sample layer is used for respectively learning the original shape and the color characteristics of the object and carrying out autonomous clustering;
the symbol layer receives the autonomous clustering results of the shape and color sample layers and abstracts the autonomous clustering results into corresponding symbols;
in the auditory pathway:
a sample layer for learning word vectors of names;
the symbol layer receives the word vector type of the name and simplifies the word vector type into a symbol;
and the association layer establishes association relationship between symbols in the visual path and the auditory path and feeds back a response signal to the lower layer network according to the known association relationship.
Further, the sample layer of the visual path extracts a shape normalized Fourier descriptor and a color histogram of the object as visual features, and constructs two networks to represent exclusive areas of the object; and defining an activation function of the network according to the difference rate, and enabling the network to cluster according to data by adopting a dynamic self-adaptive similarity threshold strategy through a learning model of a visual access sample layer.
Further, the auditory pathway sample layer learns word vectors of names and calculates differences between the word vectors by using L evenshtein distance as an activation function, finds a best matching node when the word vectors are received, updates the node by increasing the optimal node instance number if the two word vectors are identical, and creates a new node if the two word vectors are not identical.
Furthermore, the learning algorithm of the symbol layer adopts incremental competition learning, and a new symbol node is added from the empty network when an unknown class transmitted by the sample layer is encountered.
Disclosed in one or more embodiments is an autonomic cognitive development method based on incremental associative neural networks and dynamic audiovisual fusion, comprising:
the visual access sample layer respectively learns the original shape and the color characteristics of the object and carries out autonomous clustering;
the visual path symbolic layer receives the autonomous clustering result of the shape or color sample layer and abstracts the autonomous clustering result into corresponding symbols;
the auditory pathway sample layer learns word vectors of names;
the auditory pathway symbolic layer receives the word vector category of the name and reduces the word vector category into a symbol;
the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway.
Further, the visual path sample layer learns the original shape and color characteristics of the object respectively, and performs autonomous clustering, and the specific process is as follows:
(1) inputting a sample x to a sample layer;
(2) if the visual sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the sample layer has only one node, calculating the difference rate with the node; if the difference rate is smaller than the set maximum difference rate, updating the weight of the node and increasing the number of instances of the node; otherwise, a new node is created;
(4) if the number of the nodes in the sample layer is more than one, finding the node which is most matched with the sample x, and calculating the difference rate; if the difference rate is larger than the set minimum difference rate, a new node is created;
if the difference rate is smaller than the set maximum difference rate or the distance between the input sample and the most matched node is smaller than the intra-class distance, updating the most matched node and the neighborhood thereof, and updating the number of nodes; otherwise, checking whether the node fusion condition is satisfied: if the input can be merged with the best matching node, updating the best matching node; if the two nodes can not be combined, calculating the distance between the two nodes, if the distance exceeds an inter-class threshold value, creating a new class node, otherwise, creating a node of the same class as the most matched node by the network;
(5) updating the most matched node and the synapse effect of the neighborhood thereof;
(6) and transmitting the determined category information to a visual path symbol layer, waiting for response information fed back by the association layer, and then adjusting the learning result of the time.
Further, the term vector of the auditory pathway sample layer learning name specifically includes:
(1) inputting a name sample x into a sample layer;
(2) if the auditory sample layer network is empty, increasing the category and the number of instances of the first node, and returning to the step (1);
(3) if the auditory sample layer network is not empty, finding the most matched node, and calculating L evenshtein distance;
(4) if the L evenshtein distance of the two word vectors is zero, the two word vectors are completely the same, and the node is updated by increasing the optimal node instance number, otherwise, a new node is created;
(5) the result of the recognition is passed to the auditory pathway symbol layer.
Further, the visual path symbolic layer receives the autonomous clustering result of the shape or color sample layer, and abstracts the autonomous clustering result into a corresponding symbol, specifically:
(1) initializing an empty visual symbol layer, receiving a class number l ∈ N from the visual sample layer+
(2) Combining the number l and the corresponding feature f to form the symbol fl
(3) If the combined symbol does not exist, the symbol layer creates a new node; if the combined symbol is learned before, activating the corresponding symbol node and increasing the number of instances of the node;
(4) passing the symbols to an association layer;
(5) waiting for the reply signal of the associated layer, then adjusting the symbol node and passing the reply signal to the corresponding visual sample layer.
Further, the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway, specifically:
if only a signal is received from the visual pathway and the visual symbol pair is the same as the visual portion of the associated node a, then the node is activated; the association layer feeds back the hearing part of the node to a lower layer network as a top-down response, so that the name of the object can be recalled;
if only the name is received from the auditory pathway and the auditory symbol is matched with the visual symbol pair of the association node a, the association layer finds the most frequently activated node from the matched nodes as the best matched node and extracts the visual symbol pair thereof to call back the visual feature of the object;
when the symbols are transmitted from the audiovisual channel at the same time and the associated nodes matched with the audiovisual parts exist, activating the nodes and updating the number of the instances; if no node is activated, the audiovisual symbols are combined and a new associated node is created.
Further, still include: the response process from the association layer to the sample layer is from top to bottom; the method specifically comprises the following steps:
the guiding signal is used for guiding the lower-layer network to process new input by utilizing the knowledge learned by the associated layer; specifically, the method comprises the following steps: if the name of the current object has been heard before and the association layer has learned the associated visual symbol of the name, the association relation can compare the learned visual part of the node with the newly recognized visual symbol, and judge whether a new class node or an in-class node of the best matching node needs to be created in the visual sample layer;
or,
when the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal.
Compared with the prior art, the invention has the beneficial effects that:
based on the self-organizing neural network, a novel cognitive structure is provided, and the concept of an object can be autonomously developed and audio-visual fusion can be realized;
a dynamic self-adaptive similarity threshold strategy is provided, the intra-class distance and the inter-class distance can be automatically adjusted, the network can self-organize and cluster from data, and a new class and various intra-class examples can be learned at the same time;
the top-down response strategy is provided, so that a high-level network can autonomously feed back response information to a low-level network, and callback of an association mode, conflict association resolution and learned knowledge adjustment are realized without human guidance.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a schematic diagram of an autonomous cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion;
FIG. 2 is a schematic view of a node region structure;
FIG. 3(a) is a schematic diagram of a network update node in which an input may be overridden;
FIG. 3(b) is a schematic diagram of the network creating a new intra-class node when the input cannot be overwritten;
FIG. 4 is a schematic diagram of 20 common fruit and vegetable datasets;
FIG. 5 is a graphical illustration of the trend of the number of visual sample layer network categories;
fig. 6(a) and fig. 6(b) are the intra-class and inter-class similarity thresholds of each node in the visual sample layer, respectively, after the network learning once.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
In one or more embodiments, an autonomous cognitive development system based on incremental associative neural networks and dynamic audio-visual fusion is disclosed, which is composed of three layers of networks and can learn visual characteristics and audio names of objects on line. During the learning process, the structure starts from an empty network and develops concrete and abstract object concepts autonomously. In addition, it can establish the association relationship between the visual features and the names synchronously. A system block diagram is shown in fig. 1. In the visual path, two sample layers respectively learn the original shape and color characteristics of an object, and simulate a mechanism of extracting and storing the characteristics of the object in a partition mode by a brain. A symbol layer receives the self-organizing results of the shape and color sample layers and abstracts them into corresponding symbols. In the auditory pathway, a sample layer learns word vectors for names and a symbol layer reduces the class of word vectors learned by the sample layer to symbols. The association layer can realize audio-visual fusion and develop the internal relation between the two modalities.
In order to autonomously learn the intrinsic relationships of the visual representation, the visual sample layer employs a strategy of dynamically adjusting the similarity threshold of each neuron. During learning, bottom-up excitatory activities produced by visual and auditory inputs drive the gradual development of cognitive structures and form the robot's own knowledge. At the same time, the known knowledge provides top-down guidance for current learning. Therefore, the embodiment expands the one-way information transmission adopted by the OSS-GWR and the GAM into two-way transmission, so that the robot can realize autonomous cognitive development.
1.1 sample layer
In the visual pathway, evidence from human brain physiology suggests that object concepts are stored in different neural circuits according to attributes. For example, shape features are stored in the ventral and lateral occipital temporal cortex, and color features are located on the lingual and fusiform area of the anterior cortex. Thus, shape normalized fourier descriptors and color histograms of objects are extracted herein as visual features and two networks are constructed to represent their dedicated regions. For the auditory pathway, neurons in the superior temporal gyrus are responsible for auditory word recognition. This document uses Automatic Speech Recognition (ASR) techniques of science initiative fly to translate human speech into words and to specify another network to learn auditory representations.
1.1.1 visual pathway
The learning model of the visual sample layer is the Dynamic Threshold Self-organization additive Neural Network (DT-SOINN), which combines the characteristics of GWR and SOINN networks, but changes the clustering method of the Network, and adopts a Dynamic Self-adaptive similarity Threshold strategy to enable the Network to cluster according to data. Unlike SOINN networks that are based on GWR models that configure all nodes with the same fixed threshold and on fully dependent connections between nodes, DT-SOINN assigns each node an intra-class threshold and an inter-class threshold, allowing the node to learn different classes and intra-class instances step by step.
In the learning process, DT-SOINN starts from the empty network, processes the input samples in turn and dynamically adjusts its topology according to competitive Hebbian learning. Defining the activation function of the network according to a difference rate, the difference rate being calculated from the distance between the input and its best matching node and the weight of that node:
Figure GDA0002523356670000081
where x denotes the input vector, wbRepresenting the weight of the best matching node b. If the difference rate is very large or very small, the network can easily decide to create a new class of nodes or update the weight of node b. The network may first use a larger disparity ratioHAnd a smaller rate of differenceLTo process a portion of the input. However, when underdevelopment is complete, the network cannot make an accurate determination of the intermediate condition.
For this case, two similarity thresholds are defined herein for each node, an intra-class threshold T L and an inter-class threshold TH. as shown in FIG. 2, two parameters dividing the area around the node into three parts, coverage area, intra-class area and extra-class area T L determines the coverage area of the node and TH represents the class boundary, the two thresholds are initialized by a small value determined by the weight of each node as shown in equation (2).
TL=TH=L·||w|| (2)
In the subsequent learning process, the two thresholds are gradually updated by the input data. When a new sample is input, the network can make three action decisions according to two similarity thresholds: and adding a new class node, adding a class node or updating the weight of the best matching node.
The present solution replaces the volumes with distances and compares the two cases shown in FIGS. 3(a), (b), one is to update the best matching node b to b ', and the other is to create a new intra-class node represented by the input x, if the coverage area of node b ' can cover both b and x, as shown in equation (3), i.e., twice the intra-class threshold of node b ' is less than the sum of the intra-class thresholds of the two nodes, the input x can be considered very similar to the best matching nodebUpdated by equation (4) to ensure the reliability of the results, T L is specifiedbMust be smaller than the minimum connection of node b.
2·TLb'>TLb+TLx(3)
Figure GDA0002523356670000091
Wherein T L'bRepresentation T LbUpdated value, CbIs a neighborhood set of node b.
Since it is difficult to configure a proper TH for each node, the dynamic adaptive similarity threshold strategy enables the network to learn from the data itself and adjust the threshold. The network sets a very small TH value for each node, and TH is set when the network creates a new node in the class for node bbUpdated by equation (5). If b has a neighborhood, then THbUpdated by the longest connection between node b and its neighbors. If b is an isolated node, THbAnd T LbAre equal. Wherein creating a policy pair update TH for an intra-class nodebIt is of great importance. The text utilizes empirical knowledge stored in the association layer, resulting from the association layer from top to bottomTo control the generation of nodes within a class. At the same time, new THb' A decision may also be made in future learning as to whether to generate a threshold for an intra-class node, e.g., the distance between the input and the best-matching node is between [ T L ]b,THb]In the meantime. The TH of each node can be gradually expanded in the learning process, and the network is guided to learn all samples which are the same as the node. Eventually, the network forms a stable boundary between classes.
Figure GDA0002523356670000092
During the learning process, the network needs to calculate the activation value a of the best matching node. If the activation value exceeds the difference rateHThen DT-SOINN creates a new class node at the location of the input sample. If the activation value is less than the difference rateLAnd the distance between the input and the best matched node is less than T LbDT-SOINN updates the best matching node. The scheme adopts a GWR node weight updating strategy instead of SOINN. Although their learning rates are all regulatory attenuations, GWR models the adaptive mechanism of synapses when they repeatedly sense stimuli, and is therefore more suitable for biomimetic cognitive studies.
w'b=wbb·ηb·(x-wb) (6)
w'n=wnn·ηn·(x-wn) (7)
Wherein 0<γnb<1 is the learning rate of node b and its neighbor node n, ηbAnd ηnAnd respectively representing the synaptic effects of the node b and the adjacent nodes n thereof, as shown in the formulas (8) and (9).
Figure GDA0002523356670000101
Figure GDA0002523356670000102
Wherein,η0is an initial value of the prominent effect, αb,τb,αn,τnIs the time constant that determines the decay rate of the synaptic effect, and S (t) is the stimulus intensity. The salient effect is reduced along with the increase of the activation times, so that the updating amplitude of the node weight is gradually reduced to 0, and finally the node is in a stable state.
However, if the above condition is not satisfied, then DT-SOINN checks whether the node fusion condition is satisfied according to equation (3). If the input can be merged with the best matching node b, then the network updates b. Otherwise, the network calculates the distance between the two. If the distance exceeds the threshold TH between classesbThe network creates a new class of nodes. Otherwise, the network creates a node of the same type as b. In addition, the DT-SOINN also receives a response signal fed back from a higher-level network and adjusts the learning result according to experience.
DT-SOINN uses two fixed difference rates for coarse decision making and two dynamically adjustable similarity thresholds for fine decision making. The method can reduce the calculation amount, and can endow the robot with more autonomous cognitive ability, so that a proper similarity threshold value of each node is learned. The complete algorithm for DT-SOINN is as follows:
1, initializing an empty visual sample network: v { }.
Input sample x to the sample layer.
3, if the network is empty:
a, adding a first node 1: V ═ V ∪ {1}, w1=x,TL1=TH1LX | l, class
Respectively c11 and true
Example number ins _ num1=1。
And b, returning to the step 2 to process the next sample.
4 if the sample layer has only one node:
a calculation of a1=||x-w1||/||w1||。
b if a1<L
i. Update weight of node 1: w is a1←w1b·η1·(x-w1)。
increasing the number of examples: ins _ num1=ins_num1+1。
Otherwise, creating a new node 2V-V ∪ {2}, w2=x,
TL2=TH2L·||x||,c22 and ins _ num2=1。
Otherwise, finding the best matching node b:
Figure GDA0002523356670000111
and calculate ab=||x-wb||/||wb||。
a if ab>H
i. Creating a new node r, V ∪ { r }, wr=x,TLr=THrL·||x||,
crLen (c) +1 and ins _ numr=1。
Otherwise, if ab<LOr | | | x-wb||<TLb
i. Updating node b and its neighborhood n:
wb←wbb·ηb·(x-wb),wn←wnn·ηn·(x-wn)。
update node number: ins _ numb=ins_numb+1。
Otherwise:
i. assume that 1: the best matching node is updated to b': w is ab'=wbb·ηb·(x-wb),
And its threshold value T L in classb'Is updated to T Lb'L·||wb'||。
Assume 2: input x is created as a new intra-class node for b: w is axX and T LxL·||x||。
iii if2·TLb'>TLb+TLxThen hypothesis 1 is selected and the nodes are updated with equations (6), (7), respectively
b and their neighborhood weight, while updating T L with equation (4)b
Otherwise, choose hypothesis 2 and calculate the current TH using equation (5)b
v. if x-wb||>THb: as shown in steps 5-a-i, a new class node is added.
Else, building an intra-class node r and connecting the node b and r, V-V ∪ { r }, wr=x,
TLr=THrL·||x||,cr=cb,ins_numb=ins_numb+1。
And 6, updating the synaptic effect of the node b and the neighborhood n:
Δηb=τb·αb·(1-ηb)-τb,Δηn=τn·αn(1-ηn)-τn
and 7, transmitting the judged category information to a visual symbol layer.
And 8, waiting for response information fed back by the association layer, and then adjusting the learning result.
And 9, returning to the step 2, and processing the next sample.
1.1.2 auditory pathway
In the auditory pathway, each name is first translated into a word vector using ASR. Since ASR has a speech recognition accuracy of over 98%, and this document only refers to simple words, it can be assumed that each recognition result corresponds to a name. Therefore, acoustic characteristics (such as tone and volume) beyond the scope of the present study and the class nodes in the network need not be considered herein, and only the corresponding word vectors need to be compared to distinguish names. Although OSS-GWR uses google's ASR to generate action vocabularies, these vocabularies are ad-hoc and cannot be expanded online through learning. The method of the invention aims to achieve lifelong learning, handling any name heard by the robot. Each word vector may be considered a separate category, recorded in the node.
The learning model in the auditory sample layer is the L evenhtein Distance Self-organization Neural Network (L D-SOINN). The Network uses the L evenhtein Distance as the activation function to compute the difference between word vectors.
Figure GDA0002523356670000131
Where L denotes the L evenshtein distance if the two word vectors are identical, i.e., L (x, w)b) If 0, L D-SOINN updates the node by increasing the optimal number of node instances.
1, initializing an empty auditory sample network: d { }.
2: enter a name x.
3, if the network is empty:
a: creating a first node 1: D ═ D ∪ {1}, w1X, class c11, example number ins _ num1=1。
And b, returning to the step 2 to process the next name.
And 4, otherwise, finding the best matching node b according to the formula (10).
a if L (x, w)b) If 0, then update node b: ins _ numb=ins_numb+1。
Otherwise, adding a new node r, D-D ∪ { r }, wr=x,cr=len(c)+1,
ins_numr=1。
And 5, transmitting the recognition result to an auditory symbol layer.
And 6, returning to the step 2 to process the next name.
1.2 symbol layer
The symbolic layer receives the clustering information of the visual and auditory sample layers and forms an abstract, short symbolic representation. As shown in fig. 1, the layer still divides into two paths. The visual part learns the categories of visual features and the auditory part handles symbolic representations of names. The learning algorithm of the symbol layer also adopts incremental competitive learning, and a new symbol node is added from the empty network when an unknown class transmitted by the sample layer is encountered.
The layer is made ofi、cj、nkRespectively representing three category symbols, i.e. shape, color and name symbols, wherein i, j, k ∈ N+. The sample layer passes a set of class numbers i, j, k to the symbol layer each time, and the symbol layer combines these numbers with corresponding characteristic symbols s, c, n to form a visual symbol si、cjAnd an auditory symbol nkThe learning model for each symbol layer uses the L D-SOINN algorithm since the weight of each symbol node is in the form of a string, if the combined symbol does not exist, the symbol layer will create a new node, if the symbol has been learned previously, the corresponding symbol node will be activated and the number of instances of the node will be increased.
1, initializing an empty visual symbol layer: s { }.
2 receiving a class number l ∈ N from the visual sample layer+
Combining the number l and the corresponding feature f to form the symbol flWhere f ∈ { s, c, n }.
4, learning according to a learning algorithm of L D-SOINN.
And 5, the symbols are transferred to an association layer.
And 6, waiting for the response signal of the associated layer, adjusting the symbol node and transmitting the response signal to the corresponding visual sample layer.
And 7, returning to the step 2, and processing the next category number.
The symbol layer may implement online incremental learning. Although symbols have no clear semantic meaning, they are an internal representation that is automatically generated by the robot. These symbols can reduce the complexity of cognitive computation, can be used for other high-level cognitive processes, and promote the cognitive development of the robot. In the following associative learning, symbols are used for audiovisual fusion and play an important role. In addition, the symbol layer is a bridge for bi-directional information transfer between the sample layer and the associated layer, including bottom-up input and top-down response.
1.3 layers of Association
The human brain has three white matter tracts for connecting different regions of the brain and dominates object recognition and understanding functions. The scheme designs an association layer and a learning algorithm thereof, wherein the association layer and the learning algorithm thereof simulate the brain of the R-SOINN. The algorithm can establish the association relationship between visual and auditory symbols transmitted by two symbol layers so as to connect visual and auditory paths, and feedback response signals to a lower layer network according to the known association relationship.
In the association layer, the weight w of each nodeaIs formed by combining three symbols, as shown in formula (11).
wa={si,cj,nk} (11)
To improve the autonomy of the robot, R-SOINN does not require that the audiovisual information must occur simultaneously. The association layer may receive only two visual symbols or one auditory symbol, or both modalities of symbols. A relationship node may be activated by a symbol of any modality, as long as the auditory symbol or visual symbol pair matches the corresponding modality portion of the node. In particular, if the network receives signals from the visual pathway only and the visual symbol pair si,cjIs the same as the visual portion of the associated node a, then the node is activated. The association layer associates the auditory portion n of the nodekAnd feeding back to a lower network as a response from top to bottom, so that the name of the object can be recalled. Likewise, if the network receives only names from the auditory pathway and auditory symbols nkMatches with the auditory part of the relationship node, the association layer will find the most activated node from these matching nodes as the best matching node and extract its visual part si,cjTo recall the visual characteristics of the object. This means that a name symbol can be subordinate to many relationship nodes, but a visual symbol pair maps to only one node. In addition, considerThe invention extends the weight of the associated node by the possibility of multiple aliases to an object:
wa={si,cj,nk,…,nm} (12)
wherein a, i, j, k, …, m ∈ N+,{nk,…,nmDenotes a visual feature as si,cjAll names of objects of. The method does not limit the number of namesymbols that each associated node can learn, and each namesymbol can activate the associated node and call back the visual portion.
In addition, when there is an associated node where the audiovisual parts match, the R-SOINN will activate the node and update its number of instances when the audiovisual channels simultaneously transmit symbols. If no node is activated, the R-SOINN will combine the audiovisual symbols and create a new associated node. The audio-visual association relation can realize on-line real-time learning, and the concise form of the audio-visual association relation is beneficial to the robot to develop high-level cognitive ability.
In the learning process, the association layer not only receives the bottom-up information to learn the association relationship between the modalities, but also generates top-down response by using the known experience so that the robot can autonomously process different situations, such as callback, conflict resolution and knowledge adjustment. The specific details of R-SOINN are as follows:
1, initializing an empty associated layer network: r { }.
2 receiving a set of symbols f from the symbol layerl
3 if fl={si,cj,nk}:
Finding the best matched node b:
Figure GDA0002523356670000151
where N represents the number of name symbols in node b.
b, if the node b does not exist and does not cause the conflict of the association relationship:
i. obtaining all name-containing symbols nkIs onConnecting nodes: r1={w1,w2,…,wq∈ R, wherein q
Is R1Number of elements in (1).
Create a new node R ∪ R, wr={si,cj,nk},ins_numr=1。
iii. reacting R1The visual symbol pair in (1) serves as a guide signal from top to bottom:
guidance={{w1[1],w1[2]},…,{wq[1],wq[2]}}。
wait for bottom-up adjustment results.
Otherwise, if node b exists and the symbol combination conflicts with the known association relationship, then:
i. finding out the collision associated node, and feeding back the auditory part of the collision associated node to the lower-layer network from top to bottom as a collision signal
Complexing: conflict ═ nk,…,nm}。
Wait for bottom-up adjustment results.
Otherwise:
i. updating the best matching node b: ins _ numb=ins_numb+1。
Return conflict { }, identify { }.
4 if fl={si,cj}:
Finding the best matched node b:
Figure GDA0002523356670000161
b if node b does not exist, then go back to step 2 and process the next set of symbols.
Otherwise:
i. and (3) updating the node b: ins _ numb=ins_numb+1。
Feeding back all name symbols as callback signals to the lower layer network: call ═ nk,…,nm}。
Otherwise, if fl={nkThen:
a, finding the best matched node set Ab
Figure GDA0002523356670000171
u=len(Ab)。
b if node b does not exist, then go back to step 2 and process the next set of symbols.
Otherwise:
i. finding the most frequently activated node as the best matching node b:
Figure GDA0002523356670000172
update node b: ins _ numb=ins_numb+1。
Returning the visual symbol pair as a visual callback signal: parallel ═ si,cj}。
6 otherwise fl={}。
And 7, returning to the step 2, and processing the next group of symbols.
1.4 Top-Down answer Signal
The three sections above describe the bottom-up information transfer process. The complete learning process of the cognitive structure also includes top-down responses from the association layer to the sample layer. The learning process from bottom to top realizes the increase of knowledge in cognitive development, and the response process from top to bottom aims at using and adjusting the learned knowledge, thereby improving the cognitive level of the robot and enabling the robot to be more intelligent. The response signals generated by the association layer are of three types: callback, direct and resolve conflicts.
1.4.1 callback
Numerous studies of the brain have provided evidence that single modality information can also activate other modality neurons associated with it. The callback process aims to mimic this cognitive activity of the brain. When the robot receives only single modality information, the callback signal can recall a representation of another modality. But the premise of callbacks is that the network must have learned this audiovisual association or else the callbacks will fail.
When the robot sees an object, if an associated node is activated by a visual symbol pair, the association layer will return the audible portion of that node as a callback signal. Similarly, if the robot hears the name, the association layer extracts the corresponding pair of visual symbols and feeds them back to the lower layer network as callback visual signals. When the association layer returns a callback signal, the signal will propagate through the symbol layer and reach the corresponding sample layer. The sample layer then selects the most frequently activated node in the set of matching categories as a representative weight and outputs it as the target modality representation. The details are as follows:
1 if the association layer returns a callback signal recall, then:
a if recall ═ nk,…,nmThen:
i. recall is passed to the auditory symbol layer and the symbols are converted to the original digital form:
np→p,(p∈{k,…,m})。
pass p to the auditory sample layer, find the relevant word vector and output.
b, if recall is { s ═ si,cjThen:
i. pass recall to the visual symbol layer and convert the symbol to the original digital form: si→i,
cj→j。
Passing i, j to the shape, color sample layers, respectively, finding the most frequently activated node as typical
And (3) node:
Figure GDA0002523356670000181
and outputs the weight
Figure GDA0002523356670000182
As a visual sample representation.
And 2, ending.
Callback is the most important process in top-down response, and is also the basis of the other two response processes. The learned audiovisual associations can be fully utilized in this process. Unlike GAM and OSS-GWR, which directly store and invoke low-level representations in the associated layers in a unidirectional manner, this approach allows each layer to learn and process specific information and recall layer-by-layer, closer to the way the brain processes the information. When complex associations are involved, GAM and OSS-GWR may occupy additional memory or cause dimensional disasters. The method adopts symbolic representation, so that the associated layer structure is simpler and more effective, and the problem of dimension increase can be avoided.
1.4.2 guide signals
The guidance signal utilizes knowledge learned by the association layer to guide the lower layer network to process new inputs. Specifically, if the name of the current object has been heard before and the association layer has learned the associated visual symbol for the name, the association may compare the learned visual portion of the node with the newly recognized visual symbol to determine whether a new class node or an intra-class node that best matches the node needs to be created in the visual sample layer. Thus, if the current name activates the associated node, the network will generate the direction signal.
The association layer selects and returns the most frequently activated visual symbol pairs to the visual symbol layer, and the instructional signals are ultimately passed to the visual sample layer for adjusting the current learning results. After the adjustment, the change in the visual sample layer will again be transmitted from bottom to top and update the entire visual pathway and associated layers. The specific process is as follows:
1 if the associated layer returns a direction signal guidance { { w { {1[1],w1[2]},…,{wq[1],wq[2]}, then:
passing the symbol layer with the symbol and converting the symbol into the original digital form:
fl→l,(l={l1,l2,…lz}),
wherein, when the shape symbol is transferred, fl={w1[1],…,wq[1]}; when a symbol of a color is to be delivered,
fl={w1[2],…,wq[2]}。
b, transferring l to a corresponding sample layer, finding a corresponding category: c ═ l1,l2,…lz}。
c if the current best matching category cbNot in class set c and just adding a new one in the sample layer
Class node r, then:
i. in the corresponding sample layer, the new class node is changed into the same class node of the node b, and the node is updated
The category of r: c. Cr=cb
Add an edge between nodes b and r and update TH of both nodes.
Passing the adjusted results to the corresponding symbol layer, removing the new symbol nodes and updating cbMapping
Symbol node: ins _ num is ins _ num + 1.
And iv, transmitting the new symbol recognition result to the association layer, removing the new association node, and learning a new symbol combination.
2: and judging whether a conflict exists.
1.4.3 Collision signals
When the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal. There are three reasons for the occurrence of a conflict. First, the two names are aliases of each other. Second, the visual sample nodes created in the visual sample layer are inaccurate, and the current features should not be considered as intra-class nodes, but as new class nodes. Third, the robot hears the wrong name. The robot can handle the above-mentioned several cases by fully utilizing known knowledge without asking human answers like PCN. Therefore, some rules for resolving conflicts are designed so that the robot autonomously develops the ability to infer and judge.
In the early learning process, the network cannot make an accurate judgment with little experience. Recording the current learning actions in the visual sample layer may serve as a clue to resolve these conflicts. When the robot hears the new name, no corresponding association node exists in the association layer. There are two possibilities. If the learning action in both the shape and color sample layers is to update the node, then the robot has seen the object previously and the new name is an alias of the conflicting name. The association layer then feeds back a resolution signal γ of 1 and adds the new name to the conflicting association node. Otherwise, the association layer searches which feature node is created as the intra-class node, and feeds back the solution signal γ of 2 to the corresponding sample layer, thereby changing the node into the new class node.
When a new name exists with the corresponding association node, the association layer will extract all pairs of visual symbols and call up the representative sample representations among them, and then calculate the distance between the current input and these representative representations. If both the shape and color distance are less than the TH of a typical node, then it means that the current object is very similar to the conflicting object and the current name can be considered as its alias. The association layer sets the feedback resolution signal γ to 1. If one feature satisfies the threshold TH and the other feature exceeds the threshold, the new class node created by the latter is replaced with the intra-class node. If no feature meets the threshold, the association layer will call the first conflicting name and prompt the current name as wrong, and then tell the object the correct name based on the learned knowledge. The specific steps are as follows.
1 if conflict ═ nk,…,nmThen:
finding all associated nodes matching the current auditory symbol and treating the nodes as collision signals.
b if the node does not exist, i.e. the current name symbol ncurrIs new, then:
i. if the learning actions of both the shape and color sample layers are updating nodes, then:
1) the associated layer feedback γ is 1: n is to becurrAs an alias for conflicting names.
2) N is to becurrAdding into the associated node: w is aa=wa∪{ncurr}。
Else, if the learning action of all shapes (colors) is to create an inner class node:
1) feedback γ 2 into the shape (color) sample layer.
2) And removing the nodes in the class, and adding a new class node as shown in the step 5-a-i in the DT-SOINN.
3) And transmitting the adjusted result to the association layer for learning again.
Otherwise:
i. the adjustment extracts all pairs of visual symbols, recalling their typical sample representation:
{[wtp1,wtc1],[wtp2,wtc2],…}。
calculating the distance between the current input and each representative sample node: d | | | x-wt||。
if there is one visual symbol q pair satisfying
Figure GDA0002523356670000211
And
Figure GDA0002523356670000212
then turning to the step
1-b-i, feeding back gamma as 1 and converting ncurrAdding into the associated node: w is aa=wa∪{ncurr}。
Else, if
Figure GDA0002523356670000213
And is
Figure GDA0002523356670000214
Then step 1-b-ii is switched and γ ═ 2 is fed back
And adding a new color node in the color sample layer.
v. else, if
Figure GDA0002523356670000215
And is
Figure GDA0002523356670000216
Then step 1-b-ii is switched and γ ═ 2 is fed back
And adding a new shape node in the shape sample layer.
Else, callback the sample representation w of the first conflicting nameconfAnd outputs "this name is error
The correct name is wconf”。
And 2, processing the next object.
In this bi-directional cognitive structure, all cognitive activities (e.g., recognition, learning, and decision making) are in parallel.
Most studies use one-way models that only augment knowledge, but cannot leverage higher-level information to scale lower-level networks. The lower level representation must be passed and stored in the higher level network, although the unidirectional model can call back other modalities as well. The bi-directional approach of this document allows each layer network to handle only homogeneous representations. In addition, the response signal of the associated layer can help lower-layer network learning, so that the robot can handle more complex situations.
2. Experiment and results
To test the effectiveness of the visual and auditory based bi-directional cognitive development algorithm, we performed validation on a dataset of 20 common fruits and foods (as shown in fig. 4). The dataset had 176 samples, two of which had alias objects. Each object has 8 views, and each view is obtained after the object is rotated by a fixed angle. During the experiment, we let the cognitive structure learn the view and name of the object. First, the camera captures an image of an object. The algorithm obtains visual features by extracting a normalized Fourier descriptor S of the object boundary and a color histogram C, where S is a 23-dimensional vector and C is a 63-dimensional vector. Each object may be represented by a visual feature pair S, C. Meanwhile, the experimenter speaks the name of the object to the microphone and the ASR of the science news aircraft translates the speech into words. The structure starts to learn after receiving visual and auditory information, and after the current learning is finished, the structure is switched to another object to enter the next learning round until all objects and names are learned.
2.1 Experimental evaluation protocol
In order to evaluate the effectiveness of the dynamic threshold strategy and the two-way cognition process proposed in the DT-SOINN, the cognitive development result of the algorithm is compared with the learning results of the one-way self-organizing learning structures GWR and PCN. To ensure the consistency of the evaluation criteria, we used a similar evaluation protocol as PCN and performed the experiments in closed and open environments, respectively. In a closed environment, we randomly select and input an object into the cognitive system, and all objects in the dataset are all learned during a learning cycle. In an open environment, a data set is divided into two parts with different types of objects. The cognitive system learns half of the objects first and then the remaining objects. The experiments in the two environments were performed 30 times each.
The invention integrates the evaluation indexes widely adopted by the self-organizing neural network and another standardized open type class learning algorithm evaluation scheme provided by the prior art, and adopts the following evaluation indexes:
a) number of nodes in each layer. In particular, the number of nodes in the symbol layer also represents the number of categories learned in the sample layer;
b) the average number of nodes in each category and the average number of storage instances in each category reflect the generalization capability of the nodes;
c) inputting a change in the number of categories of instances, indicating an online learning process;
d) visual and auditory callback rates, equal to recognition accuracy and indicative of external learning effects;
e) and evaluating the internal performance of the dynamic self-adaptive similarity threshold strategy according to the learned similarity threshold of each node.
2.2 Experimental parameter settings
The network parameters of DT-SOINN are set as: learning rate gammab=0.1,γn0.01, synaptic Effect parameter αb=αn=1.05,τb=0.3,τn0.1. These parameters were set with reference to experience with other studies. Considering the reliability of classification and experimental experience, two difference rates are set to be respectivelyH=0.5,L0.1. The PCN can realize the fusion of vision, hearing and taste, and in the experiment, people only let the PCN learn the audio-visual information and refer to the original textSetting the PCN parameters, where the shape and color thresholds are 4, corresponds to a disparity ratio of 0.25. The cognitive structure and the PCN only need to be trained once, and the online learning abilities of the cognitive structure and the PCN can be compared. Since the GWR must be trained multiple times to form stable clusters, we first train the GWR network 200 times to learn shape and color clusters. Then we replace DT-SOINN in our cognitive structure with trained GWR and learn all objects again. The network parameter settings of the GWR are the same as OSS-GWR. In addition, the similarity threshold of GWR is set to {0.8,0.85,0.9,0.95} four values respectively to compare the performance of the dynamically adjustable threshold with the fixed threshold.
2.3 results and evaluation
The results of the experiment are shown in table I. In 60 experiments, our method learned an average of 94 sample nodes, 73 color sample nodes, 32 shape symbol nodes, 22 color symbol nodes, 21 name sample and symbol nodes, and 57 associated nodes. In both environments, the number of nodes per layer is very close. This means that our method is stable under different circumstances. The nodes of the symbol level indicate that our method forms a total of 42 shape classes and 26 color classes. The number of these categories exceeds the number of objects because different views of the object may have different shapes and colors, and rotation of the object has a greater effect on the shape categories than on the color categories. The average number of nodes per shape and color category is 3, and the average number of stored instances per category is 4 for shapes and 7 for colors. This shows that our method can both identify similar instances and learn new classes, i.e., both intra-class instances and inter-class instances can be learned at the same time.
The change in the number of categories may reflect the online learning process, as shown in FIG. 5. We can find that the number of classes increases rapidly at the beginning and becomes progressively more stable as the number of instances increases. This shows that our method can realize online identification and learning. However, GWR can only be learned during the training phase and identified during the testing phase. The PCN clusters the shape and color features only through topological connection, and does not store category information, and the number of categories in the PCN conceptual layer shown in the table I is calculated after learning is finished.
When at is 0.8, GWR learns a small number of shapes and color symbols, which is insufficient to represent all categories. When at is 0.95, the symbol node of GWR may satisfy the number of classes, but there are too many sample nodes. This indicates that the performance of GWR depends largely on the similarity threshold. However, it is difficult to find a suitable threshold before experimentation to resolve the quantity-quality dilemma. While our approach can implement the network autonomous learning category using dynamic adaptive thresholds and top-down response strategies. Furthermore, our method does not require multiple training and is more efficient than GWR.
Our approach is more autonomous than PCN, since the proposed top-down response strategy can utilize learned knowledge to resolve conflict situations without asking human beings. During the learning process, the PCN queries the experimenter 104 times on average when an unknown object or conflicting recognition result is encountered. For the number of nodes, although our shape and color sample nodes both exceed the number of nodes in the PCN, the number of nodes in both the auditory sample layer and the symbol layer are significantly smaller than the number of nodes in the PCN. In addition, our number of associated nodes is also close to PCN. In summary, our cognitive structure may be comparable in complexity to PCN.
Table i experimental results in closed and open environments
Figure GDA0002523356670000241
TABLE II callback rates in closed and open environments
Figure GDA0002523356670000242
Figure GDA0002523356670000251
To verify the learning effect of the network, we compared visual and auditory callback rates by testing one modality to recall another. The callback test is performed after each learning, and all objects and their names that have been learned before need to be tested. We still performed 30 conditioning experiments in closed and open environments. Since GWR is only used to learn visual features, there is no need to consider the auditory callback rate of GWR. As shown in table II, the overall visual recall rate of our method was 90.02%, which is higher than 83.98% obtained with PCN. When at is greater than or equal to 0.9, the visual recall ratio of GWR is better than our method, and when at is less than or equal to 0.85, the GWR results are worse than our method. This means that the similarity threshold not only affects the number of nodes generated by the GWR, but also determines the identification performance of the GWR. For the auditory callback test, our method achieves 100% accuracy, while PCN has only 61.47%. One reason is that we use words to represent auditory concepts, which is simpler than syllables used by PCN. Thus, the method of the present invention can correctly identify what people say each time. Another reason is that the method employs a top-down response strategy that can map names to the correct visual representation.
To test the effectiveness of the dynamic adaptive similarity threshold strategy, we recorded the final learning results of the two dynamic adaptive thresholds shown in fig. 6(a), 6(b) (vertical axis represents error rate, horizontal axis represents number of each node.) most of the nodes were significantly adjusted during the learning process, which indicates that the method of the present invention can autonomously learn from the data and form two reliable similarity thresholds for each sample node.
Experimental results show that the cognitive structure can autonomously realize online and incremental learning. Knowledge grows as nodes increase. The weight and similarity threshold of each node is also gradually adjusted as the samples are input. In addition, the structure utilizes a suitable number of nodes to form a stable object view and name representation, a compact class notation representation, and appropriate audiovisual associations. Under the condition of no manual help, the method improves the callback rate of the object. The proposed dynamic adaptive threshold strategy utilizes data characteristics to autonomously adjust node thresholds, so that the network learns category concepts and intra-class examples at the same time. The proposed top-down response strategy enables the network to autonomously guide type judgment, recall information and solve conflict situations without human guidance, so that the robot has autonomous cognitive ability. Our method can develop the cognitive ability of the robot step by step.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (9)

1. An autonomic cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion, comprising: a sample layer, a symbol layer and an associated layer three-layer network structure; the three-layer network structure comprises a visual path and an auditory path;
in the visual pathway:
the sample layer is used for respectively learning the original shape and the color characteristics of the object and carrying out autonomous clustering;
the symbol layer receives the autonomous clustering results of the shape and color sample layers and abstracts the autonomous clustering results into corresponding symbols;
extracting a shape normalization Fourier descriptor and a color histogram of an object as visual features by a sample layer of the visual path, and constructing two networks to represent exclusive areas of the object; defining an activation function of the network according to the difference rate, and enabling the network to cluster according to data by a learning model of a visual access sample layer by adopting a dynamic self-adaptive similarity threshold strategy;
in the auditory pathway:
a sample layer for learning word vectors of names;
the symbol layer receives the word vector type of the name and simplifies the word vector type into a symbol;
and the association layer establishes association relationship between symbols in the visual path and the auditory path and feeds back a response signal to the lower layer network according to the known association relationship.
2. The autonomic cognitive development system based on incremental associative neural network and dynamic audiovisual fusion of claim 1, wherein the auditory pathway sample layer learns word vectors for each name, calculates differences between word vectors using L evenshtein distance as an activation function, finds a best matching node when receiving a word vector, updates the node by increasing the number of best node instances if two word vectors are identical, and creates a new node otherwise.
3. The autonomic cognitive development system based on incremental associative neural network and dynamic audiovisual fusion of claim 1, wherein the learning algorithm of the symbol layer adopts incremental competitive learning, and a new symbol node is added from an empty network whenever an unknown class transmitted by the sample layer is encountered.
4. An autonomous cognitive development method based on incremental associative neural network and dynamic audio-visual fusion is characterized by comprising the following steps:
the visual access sample layer respectively learns the original shape and the color characteristics of the object and carries out autonomous clustering;
the specific process of clustering is as follows: extracting a shape normalization Fourier descriptor and a color histogram of an object as visual features by a sample layer of the visual path, and constructing two networks to represent exclusive areas of the object; defining an activation function of the network according to the difference rate, and enabling the network to cluster according to data by a learning model of a visual access sample layer by adopting a dynamic self-adaptive similarity threshold strategy;
the visual path symbol layer receives the autonomous clustering result of the original shape or color characteristic sample layer and abstracts the autonomous clustering result into corresponding symbols;
the auditory pathway sample layer learns word vectors of names;
the auditory pathway symbolic layer receives the word vector category of the name and reduces the word vector category into a symbol;
the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway.
5. The autonomic cognitive development method based on incremental associative neural network and dynamic audio-visual fusion as claimed in claim 4, wherein the visual pathway sample layer learns the original shape and color characteristics of the object respectively, and performs autonomic clustering, specifically comprising:
(1) inputting a sample x to a sample layer;
(2) if the visual sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the sample layer has only one node, calculating the difference rate with the node; if the difference rate is smaller than the set maximum difference rate, updating the weight of the node and increasing the number of instances of the node; otherwise, a new node is created;
(4) if the number of the nodes in the sample layer is more than one, finding the node which is most matched with the sample x, and calculating the difference rate; if the difference rate is larger than the set minimum difference rate, a new node is created;
if the difference rate is smaller than the set maximum difference rate or the distance between the input sample and the most matched node is smaller than the intra-class distance, updating the most matched node and the neighborhood thereof, and updating the number of nodes; otherwise, checking whether the node fusion condition is satisfied: if the input can be merged with the best matching node, updating the best matching node; if the two nodes can not be combined, calculating the distance between the two nodes, if the distance exceeds an inter-class threshold value, creating a new class node, otherwise, creating a node of the same class as the most matched node by the network;
(5) updating the most matched node and the synapse effect of the neighborhood thereof;
(6) and transmitting the determined category information to a visual path symbol layer, waiting for response information fed back by the association layer, and then adjusting the learning result of the time.
6. The method of claim 4, wherein the auditory pathway sample layer learns word vectors of names, specifically:
(1) inputting a name sample x into a sample layer;
(2) if the auditory sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the auditory sample network is not empty, finding the best matching node, and calculating L evenshtein distance;
(4) if the L evenshtein distance of the two word vectors is zero, the two word vectors are completely the same, and the node is updated by increasing the optimal node instance number, otherwise, a new node is created;
(5) the result of the recognition is passed to the auditory pathway symbol layer.
7. The method for autonomic cognitive development based on incremental associative neural network and dynamic audiovisual fusion as claimed in claim 4, wherein the visual pathway symbol layer receives the autonomic clustering result of the original shape or color feature sample layer and abstracts the result into corresponding symbols, specifically:
(1) initializing an empty visual symbol layer, receiving a class number l ∈ N from the visual sample layer+
(2) Combining the number l and the corresponding feature f to form the symbol fl
(3) If the combined symbol does not exist, the symbol layer creates a new node; if the combined symbol is learned before, activating the corresponding symbol node and increasing the number of instances of the node;
(4) passing the symbols to an association layer;
(5) waiting for the reply signal of the associated layer, then adjusting the symbol node and passing the reply signal to the corresponding visual sample layer.
8. The autonomic cognitive development method based on incremental relevance neural network and dynamic audio-visual fusion as claimed in claim 4, wherein the relevance layer establishes the relevance relationship between symbols in the visual pathway and the auditory pathway, specifically:
if only a signal is received from the visual pathway and the visual symbol pair is the same as the visual portion of the associated node a, then the node is activated; the association layer feeds back the hearing part of the node to a lower layer network as a response from top to bottom, so that the name of the object can be recalled;
if only the name is received from the auditory pathway and the auditory symbol is matched with the visual symbol pair of the association node a, the association layer finds the most frequently activated node from the matched nodes as the best matched node and extracts the visual symbol pair thereof to call back the visual feature of the object;
when the symbols are transmitted from the audiovisual channel at the same time and the associated nodes matched with the audiovisual parts exist, activating the nodes and updating the number of the instances; if no node is activated, the audiovisual symbols are combined and a new associated node is created.
9. The method of claim 4, further comprising the step of: the response process from the association layer to the sample layer is from top to bottom; the method specifically comprises the following steps:
the guiding signal is used for guiding the lower-layer network to process new input by utilizing the knowledge learned by the associated layer; specifically, the method comprises the following steps: if the name of the current object has been heard before and the association layer has learned the associated visual symbol of the name, the association relation can compare the learned visual part of the node with the newly recognized visual symbol, and judge whether a new class node or an in-class node of the best matching node needs to be created in the visual sample layer;
or,
when the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal.
CN201811527643.6A 2018-12-13 2018-12-13 Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion Active CN109685196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811527643.6A CN109685196B (en) 2018-12-13 2018-12-13 Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811527643.6A CN109685196B (en) 2018-12-13 2018-12-13 Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion

Publications (2)

Publication Number Publication Date
CN109685196A CN109685196A (en) 2019-04-26
CN109685196B true CN109685196B (en) 2020-07-31

Family

ID=66187654

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811527643.6A Active CN109685196B (en) 2018-12-13 2018-12-13 Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion

Country Status (1)

Country Link
CN (1) CN109685196B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070188B (en) * 2019-04-30 2021-03-30 山东大学 Incremental cognitive development system and method integrating interactive reinforcement learning
CN111012342B (en) * 2019-11-01 2022-08-02 天津大学 Audio-visual dual-channel competition mechanism brain-computer interface method based on P300
CN111062494B (en) * 2019-12-26 2023-06-16 山东大学 Robot self-organizing-thinking-back cognitive development method and system with life learning capability
CN113344215B (en) * 2021-06-01 2022-12-30 山东大学 Extensible cognitive development method and system supporting new mode online learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6615197B1 (en) * 2000-03-13 2003-09-02 Songhai Chai Brain programmer for increasing human information processing capacity
CN103353883B (en) * 2013-06-19 2017-02-22 华南师范大学 Big data stream type cluster processing system and method for on-demand clustering
RU2637300C1 (en) * 2016-11-29 2017-12-01 Государственное бюджетное образовательное учреждение высшего профессионального образования "Рязанский государственный медицинский университет имени академика И.П. Павлова" Министерства здравоохранения Российской Федерации Epilepsy diagnostics method based on set of electroencephalographic indicators, characteristics of exogenous and cognitive evoked potentials, motor and autonomic provision activities using artificial neural networks technology
CN108133259A (en) * 2017-12-14 2018-06-08 深圳狗尾草智能科技有限公司 The system and method that artificial virtual life is interacted with the external world
CN108333941A (en) * 2018-02-13 2018-07-27 华南理工大学 A kind of robot cooperated learning method of cloud based on mixing enhancing intelligence
CN108647850A (en) * 2018-04-03 2018-10-12 杭州布谷科技有限责任公司 It is a kind of based on artificial intelligence colleges and universities aspiration make a report on decision-making technique and system
CN108764447A (en) * 2018-05-16 2018-11-06 西安交通大学 A kind of group robot Majiang game intelligence dynamicization system and mahjong identification learning algorithm
CN109299777B (en) * 2018-09-20 2021-12-03 于江 Data processing method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN109685196A (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN109685196B (en) Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion
US11514305B1 (en) Intelligent control with hierarchical stacked neural networks
CN109783666B (en) Image scene graph generation method based on iterative refinement
Steels et al. Coordinating perceptually grounded categories through language: A case study for colour
EP1406135B1 (en) Man-machine interface unit control method; robot apparatus; and its action control method
Rathi et al. STDP based unsupervised multimodal learning with cross-modal processing in spiking neural networks
Taniguchi et al. Cross-situational learning with Bayesian generative models for multimodal category and word learning in robots
WO2021217282A1 (en) Method for implementing universal artificial intelligence
Hagiwara et al. Multiagent multimodal categorization for symbol emergence: emergent communication via interpersonal cross-modal inference
KR100306848B1 (en) A selective attention method using neural networks
CN113344215B (en) Extensible cognitive development method and system supporting new mode online learning
Nakamura et al. Concept formation by robots using an infinite mixture of models
Huang et al. An autonomous developmental cognitive architecture based on incremental associative neural network with dynamic audiovisual fusion
WO2021218614A1 (en) Establishment of general artificial intelligence system
Weng et al. Emergent Turing machines and operating systems for brain-like auto-programming for general purposes
Wang et al. Emergent spatio-temporal multimodal learning using a developmental network
Andrade et al. Implementation of Incremental Learning in Artificial Neural Networks.
US20200257503A1 (en) Auto-Programming for General Purposes and Auto-Programming Operating Systems
Levinson et al. Automatic language acquisition by an autonomous robot
Weng A model for auto-programming for general purposes
Kuremoto et al. A human-machine interaction system: A voice command learning system using PL-G-SOM
Xing et al. Artificial evolution network: A computational perspective on the expansibility of the nervous system
CN117093733A (en) Training method of media classification model, media data classification method and device
Ghayoumi et al. An adaptive fuzzy multimodal biometric system for identification and verification
KR102478367B1 (en) Method, apparatus and system for matching and recommendation of sound source based on image recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant