CN109685196A - The autonomous cognitive development system and method for neural network and dynamic audiovisual fusion is associated with based on increment type - Google Patents
The autonomous cognitive development system and method for neural network and dynamic audiovisual fusion is associated with based on increment type Download PDFInfo
- Publication number
- CN109685196A CN109685196A CN201811527643.6A CN201811527643A CN109685196A CN 109685196 A CN109685196 A CN 109685196A CN 201811527643 A CN201811527643 A CN 201811527643A CN 109685196 A CN109685196 A CN 109685196A
- Authority
- CN
- China
- Prior art keywords
- node
- layer
- visual
- symbol
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 88
- 230000004927 fusion Effects 0.000 title claims abstract description 35
- 230000008133 cognitive development Effects 0.000 title claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 27
- 239000013598 vector Substances 0.000 claims abstract description 43
- 210000003984 auditory pathway Anatomy 0.000 claims abstract description 27
- 230000000007 visual effect Effects 0.000 claims description 111
- 230000013016 learning Effects 0.000 claims description 87
- 230000008569 process Effects 0.000 claims description 46
- 230000004044 response Effects 0.000 claims description 29
- 238000004422 calculation algorithm Methods 0.000 claims description 17
- 210000000239 visual pathway Anatomy 0.000 claims description 16
- 230000004400 visual pathway Effects 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 12
- 230000002567 autonomic effect Effects 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 230000003213 activating effect Effects 0.000 claims description 4
- 230000002860 competitive effect Effects 0.000 claims description 3
- 210000000225 synapse Anatomy 0.000 claims description 3
- 238000011161 development Methods 0.000 abstract description 4
- 230000018109 developmental process Effects 0.000 abstract description 4
- 230000037361 pathway Effects 0.000 abstract 3
- 230000001149 cognitive effect Effects 0.000 description 22
- 241000282414 Homo sapiens Species 0.000 description 14
- 210000004556 brain Anatomy 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 230000009471 action Effects 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 7
- 230000003044 adaptive effect Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 230000019771 cognition Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 210000002569 neuron Anatomy 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000003930 cognitive ability Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000000946 synaptic effect Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 239000003086 colorant Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000035045 associative learning Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000003592 biomimetic effect Effects 0.000 description 1
- 230000008309 brain mechanism Effects 0.000 description 1
- 230000037185 brain physiology Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000003750 conditioning effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000002964 excitative effect Effects 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 235000012055 fruits and vegetables Nutrition 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000001339 gustatory effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 210000004885 white matter Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Robotics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of autonomous cognitive development system and methods that neural network and dynamic audiovisual fusion are associated with based on increment type, comprising: sample layer, symbol layer and associated layers Three Tiered Network Architecture;It include pathways for vision and Auditory Pathway in the Three Tiered Network Architecture;In the pathways for vision: sample layer learns the original-shape and color characteristic of object respectively, and is independently clustered;Symbol layer, receives the autonomous cluster result of shape and color card layer, and is abstracted as corresponding symbol;In the Auditory Pathway: sample layer learns the term vector of name;Symbol layer receives the term vector classification of name and is reduced to symbol;The associated layers establish the incidence relation in pathways for vision and Auditory Pathway between symbol, and feed back answer signal to lower layer network according to known incidence relation.It, being capable of autonomous development object concept and realization audiovisual fusion based on self organizing neural network.
Description
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to an autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion.
Background
With more and more robots participating in daily life of human beings, cognitive development has become a hot spot in the field of intelligent robots at present. The ability to recognize and understand is required in robot-to-human communication, so a common knowledge base with human, such as an object concept, must be established. Robots are often represented by human pre-designed internal knowledge and cannot adapt to unknown and dynamic environments. To solve this problem, the robot needs to have the ability to develop and recognize autonomously like a human infant.
Human infants, using their own cognitive principles and parental guidelines, are able to develop representations of the world rapidly before two years of age, forming an original sample representation of the object, and gradually developing symbolic representations ranging from simple to complex. The entire process is multi-modal and when several modalities occur simultaneously, the brain establishes an internal association between modalities. This facilitates the development of a complete concept of objects in the cognitive process of the infant. Thus, the baby can learn about objects based on some simple features (such as shape and color) and map these visual representations to the names taught by the parents.
Currently, some studies have applied the cognitive development theory or brain mechanism of infants to the cognitive development of robots. Such as: learning a sample and symbolic representation of an object using an SVM; a development Environment-Refraction (Dev E-R) model is capable of simulating the human "assimilation-adaptation" regulation process; and the robot learns new knowledge by using a salient object detection method and a genetic algorithm. However, these methods have the following disadvantages:
first, most learning processes are offline and take a lot of time to train the model;
second, the parameters or structure of the learning model are predefined and need to be retrained each time a new sample is encountered;
third, the robot cannot develop cognition through human-computer interaction and establish a common knowledge base with humans.
Therefore, autonomous cognitive development remains a great challenge for robots. However, the cognitive development process of infants still has reference. The baby gradually knows the world by observing objects and listening to the name of the parent, and the robot can simulate the process, learn the concept of the objects through human-computer interaction, and improve the intelligence of the baby. This process is primarily concerned with audio-visual fusion and open incremental learning.
For many studies of audio-visual fusion that have been proposed, most are directed to target detection and recognition, and few studies are available on cognitive development. For example, some fusion networks utilize two deep neural network branches to learn visual images and sounds separately, and perform information fusion by concatenating feature vectors of the two modalities. However, these computational models are of fixed topology and require offline training with large amounts of data. This also exposes another problem of multi-modal fusion, namely how to design a multi-modal generic learning algorithm, so that one does not have to design a specific structure for each modality. In the prior art, SOM is utilized to learn information of three modes of vision, hearing and posture, and posture branches are taken as bridges for conversion among the modes, so that fusion among multiple modes is realized. However, SOM is also a fixed topology network requiring a predefined number of nodes. This will greatly limit the learning ability of the robot. Therefore, the cognitive algorithm of the robot is required to be universal to multiple modes and dynamically expand the network with the increase of learned knowledge.
Although SOM cannot meet all of the proposed requirements, incremental ad-hoc neural networks can make up for the deficiencies of SOM. GNGs can learn new classes in an online manner and gradually expand network nodes, thereby enabling incremental learning. But the fixed iteration it employs may result in the network reacting too slowly to the new input. GWR is faster than GNG learning. GWR inserts a new node when new samples are encountered that exceed the activation threshold and the best matching node is activated multiple times. Another advantage of GWR is that the final weight of a node is stabilized by using a strategy of learning rate adjustability reduction. SOINN is also a very efficient incremental self-organizing neural network. The difference with the maximum GWR is that the new node is directly represented by the input vector. GWR is then represented by the average of the input vector and the best-matching node weights, which will destroy the true representation of the new sample.
The present invention investigates research on the application of incremental self-organizing neural networks to multimodal fusion. Such as: a two-layer connection structure based on SOM is used for fusing spatial position, shape and color of an object; a hierarchical GWR structure for fusing multi-modal action representations, but the fusion strategy in this paper is to concatenate weights of underlying network neurons, which increases the dimensionality of the higher layer neurons. In addition, the architecture sets a fixed similarity threshold for all nodes of the GWR. However, it is difficult for an experimenter to set appropriate thresholds for all categories. This will put the network into a number-quality dilemma. GAM can overcome the GWR disadvantage. The method utilizes the connection among the nodes to establish the association relationship among the modalities and can dynamically adjust the similarity threshold of each node. However, the GAM employs supervised learning, and the class of each sample is known. The network only needs to consider the intra-class distance. STAR-SOINN and M-SOINN consider only the inter-class distance. The PCN enables the acquisition and binding of online multimodal concepts. However, during the learning process, PCN needs to rely on extensive human guidance to make decisions. Thus, there is little research available to learn classes and within-class instances of objects simultaneously in an unsupervised manner.
In addition, most methods have unidirectional information flow, and higher layer information cannot be fed back into lower layer networks. The prior ART adopts an ART-based learning model, and only the cognitive nodes of the input activation category domain can be used and the weight of the best matching node can be read. The cognitive structure based on hierarchical object representations and extended late Dirichlet Allocation models is also the main implementation classification task and requires predefined parameters. Although there is also a study of the bidirectional learning structure, the high-level information can only recover the relevant part and cannot guide the clustering of the lower-level network by using the known experience. The prior art proposes a bidirectional cognitive structure based on multiple classifiers and a combination of multiple classifiers. However, the names of objects taught by humans in human-computer interaction are used as category symbols, rather than independent auditory knowledge. In the prior art, a graphical menu interface is adopted to realize human-computer interaction, and cognition of a robot is developed through a model based on a dictionary generator and a naive Bayes classifier. This method is simple to compute and consumes little memory, but is not suitable for restoring object representations.
Disclosure of Invention
In order to solve the problems, the invention provides an autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion, which can simulate the cognitive development process of infants, learn multi-modal object concepts and establish the association relationship between objects and names.
In order to achieve the purpose, the invention adopts the following technical scheme:
disclosed in one or more embodiments is an autonomic cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion, comprising: a sample layer, a symbol layer and an associated layer three-layer network structure; the three-layer network structure comprises a visual path and an auditory path;
in the visual pathway:
the sample layer is used for respectively learning the original shape and the color characteristics of the object and carrying out autonomous clustering;
the symbol layer receives the autonomous clustering results of the shape and color sample layers and abstracts the autonomous clustering results into corresponding symbols;
in the auditory pathway:
a sample layer for learning word vectors of names;
the symbol layer receives the word vector type of the name and simplifies the word vector type into a symbol;
and the association layer establishes association relationship between symbols in the visual path and the auditory path and feeds back a response signal to the lower layer network according to the known association relationship.
Further, the sample layer of the visual path extracts a shape normalized Fourier descriptor and a color histogram of the object as visual features, and constructs two networks to represent exclusive areas of the object; and defining an activation function of the network according to the difference rate, and enabling the network to cluster according to data by adopting a dynamic self-adaptive similarity threshold strategy through a learning model of a visual access sample layer.
Further, the auditory pathway sample layer learns word vectors for names and calculates differences between word vectors using Levenshtein distance as an activation function; when receiving the word vectors, finding the best matching node, and if the two word vectors are completely the same, updating the node by increasing the number of the best node instances; otherwise, a new node is created.
Furthermore, the learning algorithm of the symbol layer adopts incremental competition learning, and a new symbol node is added from the empty network when an unknown class transmitted by the sample layer is encountered.
Disclosed in one or more embodiments is an autonomic cognitive development method based on incremental associative neural networks and dynamic audiovisual fusion, comprising:
the visual access sample layer respectively learns the original shape and the color characteristics of the object and carries out autonomous clustering;
the visual path symbolic layer receives the autonomous clustering result of the shape or color sample layer and abstracts the autonomous clustering result into corresponding symbols;
the auditory pathway sample layer learns word vectors of names;
the auditory pathway symbolic layer receives the word vector category of the name and reduces the word vector category into a symbol;
the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway.
Further, the visual path sample layer learns the original shape and color characteristics of the object respectively, and performs autonomous clustering, and the specific process is as follows:
(1) inputting a sample x to a sample layer;
(2) if the visual sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the sample layer has only one node, calculating the difference rate with the node; if the difference rate is smaller than the set maximum difference rate, updating the weight of the node and increasing the number of instances of the node; otherwise, a new node is created;
(4) if the number of the nodes in the sample layer is more than one, finding the node which is most matched with the sample x, and calculating the difference rate; if the difference rate is larger than the set minimum difference rate, a new node is created;
if the difference rate is smaller than the set maximum difference rate or the distance between the input sample and the most matched node is smaller than the intra-class distance, updating the most matched node and the neighborhood thereof, and updating the number of nodes; otherwise, checking whether the node fusion condition is satisfied: if the input can be merged with the best matching node, updating the best matching node; if the two nodes can not be combined, calculating the distance between the two nodes, if the distance exceeds an inter-class threshold value, creating a new class node, otherwise, creating a node of the same class as the most matched node by the network;
(5) updating the most matched node and the synapse effect of the neighborhood thereof;
(6) and transmitting the determined category information to a visual path symbol layer, waiting for response information fed back by the association layer, and then adjusting the learning result of the time.
Further, the term vector of the auditory pathway sample layer learning name specifically includes:
(1) inputting a name sample x into a sample layer;
(2) if the auditory sample layer network is empty, increasing the category and the number of instances of the first node, and returning to the step (1);
(3) if the auditory sample layer network is not empty, finding the most matched node, and calculating the Levenshtein distance;
(4) if the Levenshtein distance of the two word vectors is zero, the two word vectors are completely the same, and the nodes are updated by increasing the number of the optimal node instances; otherwise, a new node is created;
(5) the result of the recognition is passed to the auditory pathway symbol layer.
Further, the visual path symbolic layer receives the autonomous clustering result of the shape or color sample layer, and abstracts the autonomous clustering result into a corresponding symbol, specifically:
(1) initializing an empty visual symbol layer; receiving a category number from a visual sample layer+;
(2) Combining the number l and the corresponding feature f to form the symbol fl;
(3) If the combined symbol does not exist, the symbol layer creates a new node; if the combined symbol is learned before, activating the corresponding symbol node and increasing the number of instances of the node;
(4) passing the symbols to an association layer;
(5) waiting for the reply signal of the associated layer, then adjusting the symbol node and passing the reply signal to the corresponding visual sample layer.
Further, the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway, specifically:
if only a signal is received from the visual pathway and the visual symbol pair is the same as the visual portion of the associated node a, then the node is activated; the association layer feeds back the hearing part of the node to a lower layer network as a top-down response, so that the name of the object can be recalled;
if only the name is received from the auditory pathway and the auditory symbol is matched with the visual symbol pair of the association node a, the association layer finds the most frequently activated node from the matched nodes as the best matched node and extracts the visual symbol pair thereof to call back the visual feature of the object;
when the symbols are transmitted from the audiovisual channel at the same time and the associated nodes matched with the audiovisual parts exist, activating the nodes and updating the number of the instances; if no node is activated, the audiovisual symbols are combined and a new associated node is created.
Further, still include: the response process from the association layer to the sample layer is from top to bottom; the method specifically comprises the following steps:
the guiding signal is used for guiding the lower-layer network to process new input by utilizing the knowledge learned by the associated layer; specifically, the method comprises the following steps: if the name of the current object has been heard before and the association layer has learned the associated visual symbol of the name, the association relation can compare the learned visual part of the node with the newly recognized visual symbol, and judge whether a new class node or an in-class node of the best matching node needs to be created in the visual sample layer;
or,
when the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal.
Compared with the prior art, the invention has the beneficial effects that:
based on the self-organizing neural network, a novel cognitive structure is provided, and the concept of an object can be autonomously developed and audio-visual fusion can be realized;
a dynamic self-adaptive similarity threshold strategy is provided, the intra-class distance and the inter-class distance can be automatically adjusted, the network can self-organize and cluster from data, and a new class and various intra-class examples can be learned at the same time;
the top-down response strategy is provided, so that a high-level network can autonomously feed back response information to a low-level network, and callback of an association mode, conflict association resolution and learned knowledge adjustment are realized without human guidance.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a schematic diagram of an autonomous cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion;
FIG. 2 is a schematic view of a node region structure;
FIG. 3(a) is a schematic diagram of a network update node in which an input may be overridden;
FIG. 3(b) is a schematic diagram of the network creating a new intra-class node when the input cannot be overwritten;
FIG. 4 is a schematic diagram of 20 common fruit and vegetable datasets;
FIG. 5 is a graphical illustration of the trend of the number of visual sample layer network categories;
fig. 6(a) and fig. 6(b) are the intra-class and inter-class similarity thresholds of each node in the visual sample layer, respectively, after the network learning once.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
In one or more embodiments, an autonomous cognitive development system based on incremental associative neural networks and dynamic audio-visual fusion is disclosed, which is composed of three layers of networks and can learn visual characteristics and audio names of objects on line. During the learning process, the structure starts from an empty network and develops concrete and abstract object concepts autonomously. In addition, it can establish the association relationship between the visual features and the names synchronously. A system block diagram is shown in fig. 1. In the visual path, two sample layers respectively learn the original shape and color characteristics of an object, and simulate a mechanism of extracting and storing the characteristics of the object in a partition mode by a brain. A symbol layer receives the self-organizing results of the shape and color sample layers and abstracts them into corresponding symbols. In the auditory pathway, a sample layer learns word vectors for names and a symbol layer reduces the class of word vectors learned by the sample layer to symbols. The association layer can realize audio-visual fusion and develop the internal relation between the two modalities.
In order to autonomously learn the intrinsic relationships of the visual representation, the visual sample layer employs a strategy of dynamically adjusting the similarity threshold of each neuron. During learning, bottom-up excitatory activities produced by visual and auditory inputs drive the gradual development of cognitive structures and form the robot's own knowledge. At the same time, the known knowledge provides top-down guidance for current learning. Therefore, the embodiment expands the one-way information transmission adopted by the OSS-GWR and the GAM into two-way transmission, so that the robot can realize autonomous cognitive development.
1.1 sample layer
In the visual pathway, evidence from human brain physiology suggests that object concepts are stored in different neural circuits according to attributes. For example, shape features are stored in the ventral and lateral occipital temporal cortex, and color features are located on the lingual and fusiform area of the anterior cortex. Thus, shape normalized fourier descriptors and color histograms of objects are extracted herein as visual features and two networks are constructed to represent their dedicated regions. For the auditory pathway, neurons in the superior temporal gyrus are responsible for auditory word recognition. This document uses Automatic Speech Recognition (ASR) techniques of science initiative fly to translate human speech into words and to specify another network to learn auditory representations. 1.1.1 visual pathway
The learning model of the visual sample layer is the Dynamic Threshold Self-organization additive Neural Network (DT-SOINN), which combines the characteristics of GWR and SOINN networks, but changes the clustering method of the Network, and adopts a Dynamic Self-adaptive similarity Threshold strategy to enable the Network to cluster according to data. Unlike those SIONN networks that are based on a GWR model that configures all nodes with the same fixed threshold and on fully dependent connections between nodes, DT-SOINN assigns each node an intra-class threshold and an inter-class threshold, allowing the node to learn different classes and intra-class instances step by step.
In the learning process, DT-SOINN starts from the empty network, processes the input samples in turn and dynamically adjusts its topology according to competitive Hebbian learning. Defining the activation function of the network according to a difference rate, the difference rate being calculated from the distance between the input and its best matching node and the weight of that node:
where x denotes the input vector, wbRepresenting the weight of the best matching node b. If the difference rate is very large or very small, the network can easily decide to create a new class of nodes or update the weight of node b. The network may first use a larger disparity ratio epsilonHAnd a smaller rate of difference epsilonLTo process a portion of the input. However, when underdevelopment is complete, the network cannot make an accurate determination of the intermediate condition.
For this case, two similarity thresholds are defined herein for each node: an intra-class threshold TL and an inter-class threshold TH. As shown in fig. 2, two parameters divide the area around a node into three parts: coverage area, intra-class area and out-of-class area. TL determines the coverage area of the node and TH denotes the class boundary. Both thresholds are initialized by a small value determined by the weight of each node, as shown in equation (2).
TL=TH=εL·||w|| (2)
In the subsequent learning process, the two thresholds are gradually updated by the input data. When a new sample is input, the network can make three action decisions according to two similarity thresholds: and adding a new class node, adding a class node or updating the weight of the best matching node.
TL is updated using a fusion policy. The prior art treats each node as a hyper-ellipsoid. When two nodes are very close to each other, if the volume of the fusion node is smallAt the sum of the volumes of the two nodes, then the two nodes can be merged into one node to make the network more compact. This solution replaces the volume with the distance and compares the two cases as shown in fig. 3(a), (b). One is to update the best matching node b to b' and the other is to create a new intra-class node represented by the input x. If the coverage area of node b 'can cover both b and x, as shown in equation (3), i.e., twice the intra-class threshold of node b' is less than the sum of the intra-class thresholds of the two nodes, then the input x can be considered very similar to the best-matching node. The network updates node b and its neighborhood. TLbUpdated by equation (4). To ensure the reliability of the results, TL is specifiedbMust be smaller than the minimum connection of node b.
2·TLb'>TLb+TLx(3)
Wherein TL'bRepresents TLbUpdated value, CbIs a neighborhood set of node b.
Since it is difficult to configure a proper TH for each node, the dynamic adaptive similarity threshold strategy enables the network to learn from the data itself and adjust the threshold. The network sets a very small TH value for each node, and TH is set when the network creates a new node in the class for node bbUpdated by equation (5). If b has a neighborhood, then THbUpdated by the longest connection between node b and its neighbors. If b is an isolated node, THbAnd TLbAre equal. Wherein creating a policy pair update TH for an intra-class nodebIt is of great importance. The generation of nodes within a class is controlled by the association layer generating a top-down response using empirical knowledge stored at the association layer. At the same time, new THb' also, a threshold for whether to generate an intra-class node can be determined in future learning, e.g., the distance between the input and the best-matching node is between [ TLb,THb]In the meantime. The TH of each node can be gradually enlarged in the learning process to guide the networkThe network learns all samples that are homogeneous with the node. Eventually, the network forms a stable boundary between classes.
During the learning process, the network needs to calculate the activation value a of the best matching node. If the activation value exceeds the rate of difference εHThen DT-SOINN creates a new class node at the location of the input sample. If the activation value is less than the rate of difference εLAnd the distance between the input and the best matched node is less than TLbDT-SOINN updates the best matching node. The scheme adopts a GWR node weight updating strategy instead of SOINN. Although their learning rates are all regulatory attenuations, GWR models the adaptive mechanism of synapses when they repeatedly sense stimuli, and is therefore more suitable for biomimetic cognitive studies.
w'b=wb+γb·ηb·(x-wb) (6)
w'n=wn+γn·ηn·(x-wn) (7)
Wherein 0 < gamman<γb< 1 is the learning rate of node b and its neighbors, respectively ηbAnd ηnAnd respectively representing the synaptic effects of the node b and the adjacent nodes n thereof, as shown in the formulas (8) and (9).
Wherein, η0Is an initial value of the prominent effect, αb,τb,αn,τnIs the time constant determining the decay rate of the synaptic effect, and S (t) is the stimulusStrength. The salient effect is reduced along with the increase of the activation times, so that the updating amplitude of the node weight is gradually reduced to 0, and finally the node is in a stable state.
However, if the above condition is not satisfied, then DT-SOINN checks whether the node fusion condition is satisfied according to equation (3). If the input can be merged with the best matching node b, then the network updates b. Otherwise, the network calculates the distance between the two. If the distance exceeds the threshold TH between classesbThe network creates a new class of nodes. Otherwise, the network creates a node of the same type as b. In addition, the DT-SOINN also receives a response signal fed back from a higher-level network and adjusts the learning result according to experience.
DT-SOINN uses two fixed difference rates for coarse decision making and two dynamically adjustable similarity thresholds for fine decision making. The method can reduce the calculation amount, and can endow the robot with more autonomous cognitive ability, so that a proper similarity threshold value of each node is learned. The complete algorithm for DT-SOINN is as follows:
1, initializing an empty visual sample network: v { }.
Input sample x to the sample layer.
3, if the network is empty:
a, adding a first node 1: V ═ V ∪ {1}, w1=x,TL1=TH1=εLX | of, category c11 and example number ins _ num1=1。
And b, returning to the step 2 to process the next sample.
4 if the sample layer has only one node:
a calculation of a1=||x-w1||/||w1||。
b if a1<εL:
i. Update weight of node 1: w is a1←w1+γb·η1·(x-w1)。
increasing the number of examples: ins _ num1=ins_num1+1。
Otherwise, creating a new node 2V-V ∪ {2}, w2=x,TL2=TH2=εL·||x||,c22 and ins _ num2=1。
Otherwise, finding the best matching node b:and calculate ab=||x-wb||/||wb||。
a if ab>εH:
i. Creating a new node r, V ∪ { r }, wr=x,TLr=THr=εL·||x||,crLen (c) +1 and ins _ numr=1。
Otherwise, if ab<εLOr | | | x-wb||<TLb:
i. Updating node b and its neighborhood n:
wb←wb+γb·ηb·(x-wb),wn←wn+γn·ηn·(x-wn)。
update node number: ins _ numb=ins_numb+1。
Otherwise:
i. assume that 1: the best matching node is updated to b': w is ab'=wb+γb·ηb·(x-wb) And its threshold value TL within classb'Is updated as: TLb'=εL·||wb'||。
Assume 2: input devicex is created as a new intra-class node for b: w is axX and TLx=εL·||x||。
iii if 2. TLb'>TLb+TLxThen hypothesis 1 is chosen and the weights for node b and its neighborhood are updated using equations (6), (7), respectively. Updating TL simultaneously with equation (4)b。
Otherwise, choose hypothesis 2 and calculate the current TH using equation (5)b。
v. if x-wb||>THb: as shown in steps 5-a-i, a new class node is added.
Else, building an intra-class node r and connecting the node b and r, V-V ∪ { r }, wr=x,TLr=THr=εL·||x||,cr=cb,ins_numb=ins_numb+1。
And 6, updating the synaptic effect of the node b and the neighborhood n:
Δηb=τb·αb·(1-ηb)-τb,Δηn=τn·αn(1-ηn)-τn。
and 7, transmitting the judged category information to a visual symbol layer.
And 8, waiting for response information fed back by the association layer, and then adjusting the learning result.
And 9, returning to the step 2, and processing the next sample.
1.1.2 auditory pathway
In the auditory pathway, each name is first translated into a word vector using ASR. Since ASR has a speech recognition accuracy of over 98%, and this document only refers to simple words, it can be assumed that each recognition result corresponds to a name. Therefore, acoustic characteristics (such as tone and volume) beyond the scope of the present study and the class nodes in the network need not be considered herein, and only the corresponding word vectors need to be compared to distinguish names. Although OSS-GWR uses google's ASR to generate action vocabularies, these vocabularies are ad-hoc and cannot be expanded online through learning. The method of the invention aims to achieve lifelong learning, handling any name heard by the robot. Each word vector may be considered a separate category, recorded in the node.
The learning model in the auditory sample layer is the Levenshtein Distance Self-organization Neural Network (LD-SOINN). The network computes the differences between word vectors using Levenshtein distance as an activation function. When a word vector is received, the network finds the best matching node according to equation (10):
where L represents the Levenshtein distance. If the two word vectors are identical, i.e. L (x, w)b) Then LD-SOINN updates the node by increasing the optimal node instance number. Otherwise, the network will create a new node. The specific process is as follows:
1, initializing an empty auditory sample network: d { }.
2: enter a name x.
3, if the network is empty:
a: creating a first node 1: D ═ D ∪ {1}, w1X, class c11, example number ins _ num1=1。
And b, returning to the step 2 to process the next name.
And 4, otherwise, finding the best matching node b according to the formula (10).
a if L (x, w)b) If 0, then update node b: ins _ numb=ins_numb+1。
Otherwise, adding a new node r, D-D ∪ { r }, wr=x,cr=len(c)+1,ins_numr=1。
And 5, transmitting the recognition result to an auditory symbol layer.
And 6, returning to the step 2 to process the next name.
1.2 symbol layer
The symbolic layer receives the clustering information of the visual and auditory sample layers and forms an abstract, short symbolic representation. As shown in fig. 1, the layer still divides into two paths. The visual part learns the categories of visual features and the auditory part handles symbolic representations of names. The learning algorithm of the symbol layer also adopts incremental competitive learning, and a new symbol node is added from the empty network when an unknown class transmitted by the sample layer is encountered.
The layer is made ofi、cj、nkRespectively representing three category symbols, i.e. shape, color and name symbols, where i, j, k ∈ N+. The sample layer passes a set of class numbers i, j, k to the symbol layer each time, and the symbol layer combines these numbers with corresponding characteristic symbols s, c, n to form a visual symbol si、cjAnd an auditory symbol nk. Because the weight of each symbol node is in a character string form, the learning model of each symbol layer adopts an LD-SOINN algorithm. If the combined symbol does not exist, the symbol layer will create a new node. If the symbol has been learned previously, the corresponding symbol node is activated and the number of instances of that node is incremented. The algorithm of the Symbol layer is called the Symbol Self-Organizing incorporated Neural Network (S-SOINN). The learning procedure of the visual pathway is as follows, and the auditory pathway is the same.
1, initializing an empty visual symbol layer: s { }.
2 receiving a category number from the visual sample layer, l e N+。
Combining the number l and the corresponding feature f to form the symbol flWhere f is ∈ { s, c, n }.
And 4, learning according to a learning algorithm of LD-SOINN.
And 5, the symbols are transferred to an association layer.
And 6, waiting for the response signal of the associated layer, adjusting the symbol node and transmitting the response signal to the corresponding visual sample layer.
And 7, returning to the step 2, and processing the next category number.
The symbol layer may implement online incremental learning. Although symbols have no clear semantic meaning, they are an internal representation that is automatically generated by the robot. These symbols can reduce the complexity of cognitive computation, can be used for other high-level cognitive processes, and promote the cognitive development of the robot. In the following associative learning, symbols are used for audiovisual fusion and play an important role. In addition, the symbol layer is a bridge for bi-directional information transfer between the sample layer and the associated layer, including bottom-up input and top-down response.
1.3 layers of Association
The human brain has three white matter tracts for connecting different regions of the brain and dominates object recognition and understanding functions. The scheme designs an association layer and a learning algorithm thereof, wherein the association layer and the learning algorithm thereof simulate the brain of the R-SOINN. The algorithm can establish the association relationship between visual and auditory symbols transmitted by two symbol layers so as to connect visual and auditory paths, and feedback response signals to a lower layer network according to the known association relationship.
In the association layer, the weight w of each nodeaIs formed by combining three symbols, as shown in formula (11).
wa={si,cj,nk} (11)
To improve the autonomy of the robot, R-SOINN does not require that the audiovisual information must occur simultaneously. The association layer may receive only two visual symbols or one auditory symbol, or both modalities of symbols.A relationship node may be activated by a symbol of any modality, as long as the auditory symbol or visual symbol pair matches the corresponding modality portion of the node. In particular, if the network receives signals from the visual pathway only and the visual symbol pair si,cjIs the same as the visual portion of the associated node a, then the node is activated. The association layer associates the auditory portion n of the nodekAnd feeding back to a lower network as a response from top to bottom, so that the name of the object can be recalled. Likewise, if the network receives only names from the auditory pathway and auditory symbols nkMatches with the auditory part of the relationship node, the association layer will find the most activated node from these matching nodes as the best matching node and extract its visual part si,cjTo recall the visual characteristics of the object. This means that a name symbol can be subordinate to many relationship nodes, but a visual symbol pair maps to only one node. In addition, considering that an object may have multiple aliases, the invention extends the weight of the associated node:
wa={si,cj,nk,…,nm} (12)
wherein a, i, j, k, …, m is equal to N+,{nk,…,nmDenotes a visual feature as si,cjAll names of objects of. The method does not limit the number of namesymbols that each associated node can learn, and each namesymbol can activate the associated node and call back the visual portion.
In addition, when there is an associated node where the audiovisual parts match, the R-SOINN will activate the node and update its number of instances when the audiovisual channels simultaneously transmit symbols. If no node is activated, the R-SOINN will combine the audiovisual symbols and create a new associated node. The audio-visual association relation can realize on-line real-time learning, and the concise form of the audio-visual association relation is beneficial to the robot to develop high-level cognitive ability.
In the learning process, the association layer not only receives the bottom-up information to learn the association relationship between the modalities, but also generates top-down response by using the known experience so that the robot can autonomously process different situations, such as callback, conflict resolution and knowledge adjustment. The specific details of R-SOINN are as follows:
1, initializing an empty associated layer network: r { }.
2 receiving a set of symbols f from the symbol layerl。
3 if fl={si,cj,nk}:
Finding the best matched node b:
where N represents the number of name symbols in node b.
b, if the node b does not exist and does not cause the conflict of the association relationship:
i. obtaining all name-containing symbols nkThe associated node of (2): r1={w1,w2,…,wqIs ∈ R, wherein q is
Is R1Number of elements in (1).
Create a new node R ∪ R, wr={si,cj,nk},ins_numr=1。
iii. reacting R1The visual symbol pair in (1) serves as a guide signal from top to bottom:
guidance={{w1[1],w1[2]},…,{wq[1],wq[2]}}。
wait for bottom-up adjustment results.
Otherwise, if node b exists and the symbol combination conflicts with the known association relationship, then:
i. finding out the collision associated node, and feeding back the auditory part of the collision associated node as a collision signal to a lower layer network from top to bottom: conflict ═ nk,…,nm}。
Wait for bottom-up adjustment results.
Otherwise:
i. updating the best matching node b: ins _ numb=ins_numb+1。
Return conflict { }, identify { }.
4 if fl={si,cj}:
Finding the best matched node b:
b if node b does not exist, then go back to step 2 and process the next set of symbols.
Otherwise:
i. and (3) updating the node b: ins _ numb=ins_numb+1。
Feeding back all name symbols as callback signals to the lower layer network: call ═ nk,…,nm}。
Otherwise, if fl={nkThen:
a, finding the best matched node set Ab:
b if node b does not exist, then go back to step 2 and process the next set of symbols.
Otherwise:
i. findingTo the most frequently activated node as the best matching node b:
update node b: ins _ numb=ins_numb+1。
Returning the visual symbol pair as a visual callback signal: parallel ═ si,cj}。
6 otherwise fl={}。
And 7, returning to the step 2, and processing the next group of symbols.
1.4 Top-Down answer Signal
The three sections above describe the bottom-up information transfer process. The complete learning process of the cognitive structure also includes top-down responses from the association layer to the sample layer. The learning process from bottom to top realizes the increase of knowledge in cognitive development, and the response process from top to bottom aims at using and adjusting the learned knowledge, thereby improving the cognitive level of the robot and enabling the robot to be more intelligent. The response signals generated by the association layer are of three types: callback, direct and resolve conflicts.
1.4.1 callback
Numerous studies of the brain have provided evidence that single modality information can also activate other modality neurons associated with it. The callback process aims to mimic this cognitive activity of the brain. When the robot receives only single modality information, the callback signal can recall a representation of another modality. But the premise of callbacks is that the network must have learned this audiovisual association or else the callbacks will fail.
When the robot sees an object, if an associated node is activated by a visual symbol pair, the association layer will return the audible portion of that node as a callback signal. Similarly, if the robot hears the name, the association layer extracts the corresponding pair of visual symbols and feeds them back to the lower layer network as callback visual signals. When the association layer returns a callback signal, the signal will propagate through the symbol layer and reach the corresponding sample layer. The sample layer then selects the most frequently activated node in the set of matching categories as a representative weight and outputs it as the target modality representation. The details are as follows:
1 if the association layer returns a callback signal recall, then:
a if recall ═ nk,…,nmThen:
i. recall is passed to the auditory symbol layer and the symbols are converted to the original digital form:
np→p,(p∈{k,…,m})。
pass p to the auditory sample layer, find the relevant word vector and output.
b, if recall is { s ═ si,cjThen:
i. pass recall to the visual symbol layer and convert the symbol to the original digital form: si→i,cj→j。
Passing i, j to the shape, color sample layer, respectively, finding the most frequently activated node as the typical node:and outputs the weightAs a visual sample representation.
And 2, ending.
Callback is the most important process in top-down response, and is also the basis of the other two response processes. The learned audiovisual associations can be fully utilized in this process. Unlike GAM and OSS-GWR, which directly store and invoke low-level representations in the associated layers in a unidirectional manner, this approach allows each layer to learn and process specific information and recall layer-by-layer, closer to the way the brain processes the information. When complex associations are involved, GAM and OSS-GWR may occupy additional memory or cause dimensional disasters. The method adopts symbolic representation, so that the associated layer structure is simpler and more effective, and the problem of dimension increase can be avoided.
1.4.2 guide signals
The guidance signal utilizes knowledge learned by the association layer to guide the lower layer network to process new inputs. Specifically, if the name of the current object has been heard before and the association layer has learned the associated visual symbol for the name, the association may compare the learned visual portion of the node with the newly recognized visual symbol to determine whether a new class node or an intra-class node that best matches the node needs to be created in the visual sample layer. Thus, if the current name activates the associated node, the network will generate the direction signal. The association layer selects and returns the most frequently activated visual symbol pairs to the visual symbol layer, and the instructional signals are ultimately passed to the visual sample layer for adjusting the current learning results. After the adjustment, the change in the visual sample layer will again be transmitted from bottom to top and update the entire visual pathway and associated layers. The specific process is as follows:
1 if the associated layer returns a direction signal guidance { { w { {1[1],w1[2]},…,{wq[1],wq[2]}, then:
passing the symbol layer with the symbol and converting the symbol into the original digital form:
fl→l,(l={l1,l2,…lz}),
wherein, when the shape symbol is transferred, fl={w1[1],…,wq[1]}; when a color symbol is delivered, fl={w1[2],…,wq[2]}。
b, transferring l to a corresponding sample layer, finding a corresponding category: c ═ l1,l2,…lz}。
c if the current best matching category cbNot in the class set c and just one new class node r has been added in the sample layer, then:
i. in the corresponding sample layer, the new class node is changed into the same class node of the node b, and the class of the node r is updated as follows: c. Cr=cb。
Add an edge between nodes b and r and update TH of both nodes.
Passing the adjusted results to the corresponding symbol layer, removing the new symbol nodes and updating cbMapping symbol nodes: ins _ num is ins _ num + 1.
And iv, transmitting the new symbol recognition result to the association layer, removing the new association node, and learning a new symbol combination.
2: and judging whether a conflict exists.
1.4.3 Collision signals
When the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal. There are three reasons for the occurrence of a conflict. First, the two names are aliases of each other. Second, the visual sample nodes created in the visual sample layer are inaccurate, and the current features should not be considered as intra-class nodes, but as new class nodes. Third, the robot hears the wrong name. The robot can handle the above-mentioned several cases by fully utilizing known knowledge without asking human answers like PCN. Therefore, some rules for resolving conflicts are designed so that the robot autonomously develops the ability to infer and judge.
In the early learning process, the network cannot make an accurate judgment with little experience. Recording the current learning actions in the visual sample layer may serve as a clue to resolve these conflicts. When the robot hears the new name, no corresponding association node exists in the association layer. There are two possibilities. If the learning action in both the shape and color sample layers is to update the node, then the robot has seen the object previously and the new name is an alias of the conflicting name. The association layer then feeds back a resolution signal γ of 1 and adds the new name to the conflicting association node. Otherwise, the association layer searches which feature node is created as the intra-class node, and feeds back the solution signal γ of 2 to the corresponding sample layer, thereby changing the node into the new class node.
When a new name exists with the corresponding association node, the association layer will extract all pairs of visual symbols and call up the representative sample representations among them, and then calculate the distance between the current input and these representative representations. If both the shape and color distance are less than the TH of a typical node, then it means that the current object is very similar to the conflicting object and the current name can be considered as its alias. The association layer sets the feedback resolution signal γ to 1. If one feature satisfies the threshold TH and the other feature exceeds the threshold, the new class node created by the latter is replaced with the intra-class node. If no feature meets the threshold, the association layer will call the first conflicting name and prompt the current name as wrong, and then tell the object the correct name based on the learned knowledge. The specific steps are as follows.
1 if conflict ═ nk,…,nmThen:
finding all associated nodes matching the current auditory symbol and treating the nodes as collision signals.
b if the node does not exist, i.e. the current name symbol ncurrIs new, then:
i. if the learning actions of both the shape and color sample layers are updating nodes, then:
1) the associated layer feedback γ is 1: n is to becurrAs an alias for conflicting names.
2) N is to becurrAdding into the associated node: w is aa=wa∪{ncurr}。
Else, if the learning action of all shapes (colors) is to create an inner class node:
1) feedback γ 2 into the shape (color) sample layer.
2) And removing the nodes in the class, and adding a new class node as shown in the step 5-a-i in the DT-SOINN.
3) And transmitting the adjusted result to the association layer for learning again.
Otherwise:
i. the adjustment extracts all pairs of visual symbols, recalling their typical sample representation:
calculating the distance between the current input and each representative sample node: d | | | x-wt||。
if there is one visual symbol q pair satisfyingAndthen go to step 1-b-i, feed back γ equal to 1 and let n becurrAdding into the associated node: w is aa=wa∪{ncurr}。
Else, ifAnd isThen go to step 1-b-ii and feed back γ 2 to add a new color node to the color sample layer.
v. else, ifAnd isThen go to step 1-b-ii and feed back γ 2 to add a new shape node to the shape sample layer.
Else, callback the sample representation w of the first conflicting nameconfAnd outputs "this name is wrong and the correct name is wconf”。
And 2, processing the next object.
In this bi-directional cognitive structure, all cognitive activities (e.g., recognition, learning, and decision making) are in parallel. Most studies use one-way models that only augment knowledge, but cannot leverage higher-level information to scale lower-level networks. The lower level representation must be passed and stored in the higher level network, although the unidirectional model can call back other modalities as well. The bi-directional approach of this document allows each layer network to handle only homogeneous representations. In addition, the response signal of the associated layer can help lower-layer network learning, so that the robot can handle more complex situations.
2. Experiment and results
To test the effectiveness of the visual and auditory based bi-directional cognitive development algorithm, we performed validation on a dataset of 20 common fruits and foods (as shown in fig. 4). The dataset had 176 samples, two of which had alias objects. Each object has 8 views, and each view is obtained after the object is rotated by a fixed angle. During the experiment, we let the cognitive structure learn the view and name of the object. First, the camera captures an image of an object. The algorithm obtains visual features by extracting a normalized Fourier descriptor S of the object boundary and a color histogram C, where S is a 23-dimensional vector and C is a 63-dimensional vector. Each object may be represented by a visual feature pair S, C. Meanwhile, the experimenter speaks the name of the object to the microphone and the ASR of the science news aircraft translates the speech into words. The structure starts to learn after receiving visual and auditory information, and after the current learning is finished, the structure is switched to another object to enter the next learning round until all objects and names are learned.
2.1 Experimental evaluation protocol
In order to evaluate the effectiveness of the dynamic threshold strategy and the two-way cognition process proposed in the DT-SOINN, the cognitive development result of the algorithm is compared with the learning results of the one-way self-organizing learning structures GWR and PCN. To ensure the consistency of the evaluation criteria, we used a similar evaluation protocol as PCN and performed the experiments in closed and open environments, respectively. In a closed environment, we randomly select and input an object into the cognitive system, and all objects in the dataset are all learned during a learning cycle. In an open environment, a data set is divided into two parts with different types of objects. The cognitive system learns half of the objects first and then the remaining objects. The experiments in the two environments were performed 30 times each.
The invention integrates the evaluation indexes widely adopted by the self-organizing neural network and another standardized open type class learning algorithm evaluation scheme provided by the prior art, and adopts the following evaluation indexes:
a) number of nodes in each layer. In particular, the number of nodes in the symbol layer also represents the number of categories learned in the sample layer;
b) the average number of nodes in each category and the average number of storage instances in each category reflect the generalization capability of the nodes;
c) inputting a change in the number of categories of instances, indicating an online learning process;
d) visual and auditory callback rates, equal to recognition accuracy and indicative of external learning effects;
e) and evaluating the internal performance of the dynamic self-adaptive similarity threshold strategy according to the learned similarity threshold of each node.
2.2 Experimental parameter settings
Network parameters of DT-SOINNIs set as: learning rate gammab=0.1,γn0.01, synaptic Effect parameter αb=αn=1.05,τb=0.3,τn0.1. These parameters were set with reference to experience with other studies. Considering the reliability of classification and experimental experience, two difference rates are respectively set as epsilonH=0.5,εL0.1. The PCN enables visual, auditory and gustatory fusion, and we only let it learn audiovisual information and set the parameters of the PCN with reference to the original text in experiments where the shape and color thresholds are 4, corresponding to a difference rate of 0.25. The cognitive structure and the PCN only need to be trained once, and the online learning abilities of the cognitive structure and the PCN can be compared. Since the GWR must be trained multiple times to form stable clusters, we first train the GWR network 200 times to learn shape and color clusters. Then we replace DT-SOINN in our cognitive structure with trained GWR and learn all objects again. The network parameter settings of the GWR are the same as OSS-GWR. In addition, the similarity threshold of GWR is set to {0.8,0.85,0.9,0.95} four values respectively to compare the performance of the dynamically adjustable threshold with the fixed threshold.
2.3 results and evaluation
The results of the experiment are shown in table I. In 60 experiments, our method learned an average of 94 sample nodes, 73 color sample nodes, 32 shape symbol nodes, 22 color symbol nodes, 21 name sample and symbol nodes, and 57 associated nodes. In both environments, the number of nodes per layer is very close. This means that our method is stable under different circumstances. The nodes of the symbol level indicate that our method forms a total of 42 shape classes and 26 color classes. The number of these categories exceeds the number of objects because different views of the object may have different shapes and colors, and rotation of the object has a greater effect on the shape categories than on the color categories. The average number of nodes per shape and color category is 3, and the average number of stored instances per category is 4 for shapes and 7 for colors. This shows that our method can both identify similar instances and learn new classes, i.e., both intra-class instances and inter-class instances can be learned at the same time.
The change in the number of categories may reflect the online learning process, as shown in FIG. 5. We can find that the number of classes increases rapidly at the beginning and becomes progressively more stable as the number of instances increases. This shows that our method can realize online identification and learning. However, GWR can only be learned during the training phase and identified during the testing phase. The PCN clusters the shape and color features only through topological connection, and does not store category information, and the number of categories in the PCN conceptual layer shown in the table I is calculated after learning is finished.
When at is 0.8, GWR learns a small number of shapes and color symbols, which is insufficient to represent all categories. When at is 0.95, the symbol node of GWR may satisfy the number of classes, but there are too many sample nodes. This indicates that the performance of GWR depends largely on the similarity threshold. However, it is difficult to find a suitable threshold before experimentation to resolve the quantity-quality dilemma. While our approach can implement the network autonomous learning category using dynamic adaptive thresholds and top-down response strategies. Furthermore, our method does not require multiple training and is more efficient than GWR.
Our approach is more autonomous than PCN, since the proposed top-down response strategy can utilize learned knowledge to resolve conflict situations without asking human beings. During the learning process, the PCN queries the experimenter 104 times on average when an unknown object or conflicting recognition result is encountered. For the number of nodes, although our shape and color sample nodes both exceed the number of nodes in the PCN, the number of nodes in both the auditory sample layer and the symbol layer are significantly smaller than the number of nodes in the PCN. In addition, our number of associated nodes is also close to PCN. In summary, our cognitive structure may be comparable in complexity to PCN.
Table i experimental results in closed and open environments
TABLE II callback rates in closed and open environments
To verify the learning effect of the network, we compared visual and auditory callback rates by testing one modality to recall another. The callback test is performed after each learning, and all objects and their names that have been learned before need to be tested. We still performed 30 conditioning experiments in closed and open environments. Since GWR is only used to learn visual features, there is no need to consider the auditory callback rate of GWR. As shown in table II, the overall visual recall rate of our method was 90.02%, which is higher than 83.98% obtained with PCN. When at is greater than or equal to 0.9, the visual recall ratio of GWR is better than our method, and when at is less than or equal to 0.85, the GWR results are worse than our method. This means that the similarity threshold not only affects the number of nodes generated by the GWR, but also determines the identification performance of the GWR. For the auditory callback test, our method achieves 100% accuracy, while PCN has only 61.47%. One reason is that we use words to represent auditory concepts, which is simpler than syllables used by PCN. Thus, the method of the present invention can correctly identify what people say each time. Another reason is that the method employs a top-down response strategy that can map names to the correct visual representation.
To test the effectiveness of the dynamic adaptive similarity threshold strategy, we recorded the final learning results of the two dynamic adaptive thresholds shown in fig. 6(a) and 6(b) (the vertical axis represents the error rate, and the horizontal axis represents the number of each node). The thresholds of most nodes are significantly adjusted in the learning process, which shows that the method of the present invention can autonomously learn from data and form two reliable similarity thresholds for each sample node. In contrast to GWR, we do not need to consider how to find a suitable threshold so that the number-quality dilemma can be avoided. Although PCN assigns a fixed rate of difference threshold to each node, it cannot distinguish between minor differences of nodes within a class of the TL, TH region, and may treat certain homogeneous nodes as heterogeneous nodes.
Experimental results show that the cognitive structure can autonomously realize online and incremental learning. Knowledge grows as nodes increase. The weight and similarity threshold of each node is also gradually adjusted as the samples are input. In addition, the structure utilizes a suitable number of nodes to form a stable object view and name representation, a compact class notation representation, and appropriate audiovisual associations. Under the condition of no manual help, the method improves the callback rate of the object. The proposed dynamic adaptive threshold strategy utilizes data characteristics to autonomously adjust node thresholds, so that the network learns category concepts and intra-class examples at the same time. The proposed top-down response strategy enables the network to autonomously guide type judgment, recall information and solve conflict situations without human guidance, so that the robot has autonomous cognitive ability. Our method can develop the cognitive ability of the robot step by step.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.
Claims (10)
1. An autonomic cognitive development system based on incremental associative neural networks and dynamic audiovisual fusion, comprising: a sample layer, a symbol layer and an associated layer three-layer network structure; the three-layer network structure comprises a visual path and an auditory path;
in the visual pathway:
the sample layer is used for respectively learning the original shape and the color characteristics of the object and carrying out autonomous clustering;
the symbol layer receives the autonomous clustering results of the shape and color sample layers and abstracts the autonomous clustering results into corresponding symbols;
in the auditory pathway:
a sample layer for learning word vectors of names;
the symbol layer receives the word vector type of the name and simplifies the word vector type into a symbol;
and the association layer establishes association relationship between symbols in the visual path and the auditory path and feeds back a response signal to the lower layer network according to the known association relationship.
2. The autonomic cognitive development system based on incremental relevance neural network and dynamic audiovisual fusion of claim 1, characterized in that the sample layer of the visual pathway extracts the shape normalized fourier descriptor and color histogram of the object as visual features and constructs two networks to represent their exclusive areas; and defining an activation function of the network according to the difference rate, and enabling the network to cluster according to data by adopting a dynamic self-adaptive similarity threshold strategy through a learning model of a visual access sample layer.
3. The autonomic cognitive development system based on incremental associative neural network and dynamic audiovisual fusion of claim 1, wherein the auditory pathway sample layer learns word vectors for each name, and calculates differences between word vectors using Levenshtein distance as an activation function; when receiving the word vectors, finding the best matching node, and if the two word vectors are completely the same, updating the node by increasing the number of the best node instances; otherwise, a new node is created.
4. The autonomic cognitive development system based on incremental associative neural network and dynamic audiovisual fusion of claim 1, wherein the learning algorithm of the symbol layer adopts incremental competitive learning, and a new symbol node is added from an empty network whenever an unknown class transmitted by the sample layer is encountered.
5. An autonomous cognitive development method based on incremental associative neural network and dynamic audio-visual fusion is characterized by comprising the following steps:
the visual access sample layer respectively learns the original shape and the color characteristics of the object and carries out autonomous clustering;
the visual path symbol layer receives the autonomous clustering result of the original shape or color characteristic sample layer and abstracts the autonomous clustering result into corresponding symbols;
the auditory pathway sample layer learns word vectors of names;
the auditory pathway symbolic layer receives the word vector category of the name and reduces the word vector category into a symbol;
the association layer establishes an association relationship between symbols in the visual pathway and the auditory pathway.
6. The autonomic cognitive development method based on incremental associative neural network and dynamic audio-visual fusion as claimed in claim 5, wherein the visual pathway sample layer learns the original shape and color characteristics of the object respectively, and performs autonomic clustering, specifically comprising:
(1) inputting a sample x to a sample layer;
(2) if the visual sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the sample layer has only one node, calculating the difference rate with the node; if the difference rate is smaller than the set maximum difference rate, updating the weight of the node and increasing the number of instances of the node; otherwise, a new node is created;
(4) if the number of the nodes in the sample layer is more than one, finding the node which is most matched with the sample x, and calculating the difference rate; if the difference rate is larger than the set minimum difference rate, a new node is created;
if the difference rate is smaller than the set maximum difference rate or the distance between the input sample and the most matched node is smaller than the intra-class distance, updating the most matched node and the neighborhood thereof, and updating the number of nodes; otherwise, checking whether the node fusion condition is satisfied: if the input can be merged with the best matching node, updating the best matching node; if the two nodes can not be combined, calculating the distance between the two nodes, if the distance exceeds an inter-class threshold value, creating a new class node, otherwise, creating a node of the same class as the most matched node by the network;
(5) updating the most matched node and the synapse effect of the neighborhood thereof;
(6) and transmitting the determined category information to a visual path symbol layer, waiting for response information fed back by the association layer, and then adjusting the learning result of the time.
7. The method according to claim 5, wherein the auditory pathway sample layer learns word vectors of names, specifically:
(1) inputting a name sample x into a sample layer;
(2) if the auditory sample network is empty, increasing the category and the number of instances of the first node; returning to the step (1);
(3) if the auditory sample network is not empty, finding the best matching node, and calculating the Levenshtein distance;
(4) if the Levenshtein distance of the two word vectors is zero, the two word vectors are completely the same, and the nodes are updated by increasing the number of the optimal node instances; otherwise, a new node is created;
(5) the result of the recognition is passed to the auditory pathway symbol layer.
8. The method for autonomic cognitive development based on incremental associative neural network and dynamic audiovisual fusion as claimed in claim 5, wherein the visual pathway symbol layer receives the autonomic clustering result of the original shape or color feature sample layer and abstracts the result into corresponding symbols, specifically:
(1) initializing an empty visual symbol layer; receiving a category number from a visual sample layer+;
(2) Combining the number l and the corresponding feature f to form the symbol fl;
(3) If the combined symbol does not exist, the symbol layer creates a new node; if the combined symbol is learned before, activating the corresponding symbol node and increasing the number of instances of the node;
(4) passing the symbols to an association layer;
(5) waiting for the reply signal of the associated layer, then adjusting the symbol node and passing the reply signal to the corresponding visual sample layer.
9. The autonomic cognitive development method based on incremental relevance neural network and dynamic audio-visual fusion as claimed in claim 5, wherein the relevance layer establishes the relevance relationship between symbols in the visual pathway and the auditory pathway, specifically:
if only a signal is received from the visual pathway and the visual symbol pair is the same as the visual portion of the associated node a, then the node is activated; the association layer feeds back the hearing part of the node to a lower layer network as a response from top to bottom, so that the name of the object can be recalled;
if only the name is received from the auditory pathway and the auditory symbol is matched with the visual symbol pair of the association node a, the association layer finds the most frequently activated node from the matched nodes as the best matched node and extracts the visual symbol pair thereof to call back the visual feature of the object;
when the symbols are transmitted from the audiovisual channel at the same time and the associated nodes matched with the audiovisual parts exist, activating the nodes and updating the number of the instances; if no node is activated, the audiovisual symbols are combined and a new associated node is created.
10. The method of claim 5, further comprising the step of: the response process from the association layer to the sample layer is from top to bottom; the method specifically comprises the following steps:
the guiding signal is used for guiding the lower-layer network to process new input by utilizing the knowledge learned by the associated layer; specifically, the method comprises the following steps: if the name of the current object has been heard before and the association layer has learned the associated visual symbol of the name, the association relation can compare the learned visual part of the node with the newly recognized visual symbol, and judge whether a new class node or an in-class node of the best matching node needs to be created in the visual sample layer;
or,
when the visual parts of the associated nodes are the same but the name symbols are different, the association layer will return a collision signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811527643.6A CN109685196B (en) | 2018-12-13 | 2018-12-13 | Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811527643.6A CN109685196B (en) | 2018-12-13 | 2018-12-13 | Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109685196A true CN109685196A (en) | 2019-04-26 |
CN109685196B CN109685196B (en) | 2020-07-31 |
Family
ID=66187654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811527643.6A Active CN109685196B (en) | 2018-12-13 | 2018-12-13 | Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109685196B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070188A (en) * | 2019-04-30 | 2019-07-30 | 山东大学 | A kind of increment type cognitive development system and method merging interactive intensified learning |
CN111012342A (en) * | 2019-11-01 | 2020-04-17 | 天津大学 | Audio-visual dual-channel competition mechanism brain-computer interface method based on P300 |
CN111062494A (en) * | 2019-12-26 | 2020-04-24 | 山东大学 | Robot self-organization-thinking-reversal cognitive development method and system with lifelong learning ability |
CN113344215A (en) * | 2021-06-01 | 2021-09-03 | 山东大学 | Extensible cognitive development method and system supporting new mode online learning |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6615197B1 (en) * | 2000-03-13 | 2003-09-02 | Songhai Chai | Brain programmer for increasing human information processing capacity |
CN103353883A (en) * | 2013-06-19 | 2013-10-16 | 华南师范大学 | Big data stream type cluster processing system and method for on-demand clustering |
RU2637300C1 (en) * | 2016-11-29 | 2017-12-01 | Государственное бюджетное образовательное учреждение высшего профессионального образования "Рязанский государственный медицинский университет имени академика И.П. Павлова" Министерства здравоохранения Российской Федерации | Epilepsy diagnostics method based on set of electroencephalographic indicators, characteristics of exogenous and cognitive evoked potentials, motor and autonomic provision activities using artificial neural networks technology |
CN108133259A (en) * | 2017-12-14 | 2018-06-08 | 深圳狗尾草智能科技有限公司 | The system and method that artificial virtual life is interacted with the external world |
CN108333941A (en) * | 2018-02-13 | 2018-07-27 | 华南理工大学 | A kind of robot cooperated learning method of cloud based on mixing enhancing intelligence |
CN108647850A (en) * | 2018-04-03 | 2018-10-12 | 杭州布谷科技有限责任公司 | It is a kind of based on artificial intelligence colleges and universities aspiration make a report on decision-making technique and system |
CN108764447A (en) * | 2018-05-16 | 2018-11-06 | 西安交通大学 | A kind of group robot Majiang game intelligence dynamicization system and mahjong identification learning algorithm |
CN109299777A (en) * | 2018-09-20 | 2019-02-01 | 于江 | A kind of data processing method and its system based on artificial intelligence |
-
2018
- 2018-12-13 CN CN201811527643.6A patent/CN109685196B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6615197B1 (en) * | 2000-03-13 | 2003-09-02 | Songhai Chai | Brain programmer for increasing human information processing capacity |
CN103353883A (en) * | 2013-06-19 | 2013-10-16 | 华南师范大学 | Big data stream type cluster processing system and method for on-demand clustering |
RU2637300C1 (en) * | 2016-11-29 | 2017-12-01 | Государственное бюджетное образовательное учреждение высшего профессионального образования "Рязанский государственный медицинский университет имени академика И.П. Павлова" Министерства здравоохранения Российской Федерации | Epilepsy diagnostics method based on set of electroencephalographic indicators, characteristics of exogenous and cognitive evoked potentials, motor and autonomic provision activities using artificial neural networks technology |
CN108133259A (en) * | 2017-12-14 | 2018-06-08 | 深圳狗尾草智能科技有限公司 | The system and method that artificial virtual life is interacted with the external world |
CN108333941A (en) * | 2018-02-13 | 2018-07-27 | 华南理工大学 | A kind of robot cooperated learning method of cloud based on mixing enhancing intelligence |
CN108647850A (en) * | 2018-04-03 | 2018-10-12 | 杭州布谷科技有限责任公司 | It is a kind of based on artificial intelligence colleges and universities aspiration make a report on decision-making technique and system |
CN108764447A (en) * | 2018-05-16 | 2018-11-06 | 西安交通大学 | A kind of group robot Majiang game intelligence dynamicization system and mahjong identification learning algorithm |
CN109299777A (en) * | 2018-09-20 | 2019-02-01 | 于江 | A kind of data processing method and its system based on artificial intelligence |
Non-Patent Citations (2)
Title |
---|
M. VAVREC ˇKA等: ""A Multimodal Connectionist Architecture for Unsupervised Grounding of Spatial Language"", 《COGN.COMPUT》 * |
马爽: ""具有自主发育能力的机器人感知与认知方法研究"", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070188A (en) * | 2019-04-30 | 2019-07-30 | 山东大学 | A kind of increment type cognitive development system and method merging interactive intensified learning |
CN110070188B (en) * | 2019-04-30 | 2021-03-30 | 山东大学 | Incremental cognitive development system and method integrating interactive reinforcement learning |
CN111012342A (en) * | 2019-11-01 | 2020-04-17 | 天津大学 | Audio-visual dual-channel competition mechanism brain-computer interface method based on P300 |
CN111012342B (en) * | 2019-11-01 | 2022-08-02 | 天津大学 | Audio-visual dual-channel competition mechanism brain-computer interface method based on P300 |
CN111062494A (en) * | 2019-12-26 | 2020-04-24 | 山东大学 | Robot self-organization-thinking-reversal cognitive development method and system with lifelong learning ability |
CN111062494B (en) * | 2019-12-26 | 2023-06-16 | 山东大学 | Robot self-organizing-thinking-back cognitive development method and system with life learning capability |
CN113344215A (en) * | 2021-06-01 | 2021-09-03 | 山东大学 | Extensible cognitive development method and system supporting new mode online learning |
Also Published As
Publication number | Publication date |
---|---|
CN109685196B (en) | 2020-07-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109685196B (en) | Autonomous cognitive development system and method based on incremental associative neural network and dynamic audio-visual fusion | |
US11514305B1 (en) | Intelligent control with hierarchical stacked neural networks | |
CN109783666B (en) | Image scene graph generation method based on iterative refinement | |
Steels et al. | Coordinating perceptually grounded categories through language: A case study for colour | |
Taniguchi et al. | Cross-situational learning with Bayesian generative models for multimodal category and word learning in robots | |
Rathi et al. | STDP based unsupervised multimodal learning with cross-modal processing in spiking neural networks | |
WO2021217282A1 (en) | Method for implementing universal artificial intelligence | |
KR100306848B1 (en) | A selective attention method using neural networks | |
Hagiwara et al. | Multiagent multimodal categorization for symbol emergence: emergent communication via interpersonal cross-modal inference | |
CN113344215B (en) | Extensible cognitive development method and system supporting new mode online learning | |
Pagliarini et al. | Vocal imitation in sensorimotor learning models: a comparative review | |
Štepánová et al. | Mapping language to vision in a real-world robotic scenario | |
Nakamura et al. | Concept formation by robots using an infinite mixture of models | |
Huang et al. | An autonomous developmental cognitive architecture based on incremental associative neural network with dynamic audiovisual fusion | |
Zhang et al. | Sound-Image Grounding Based Focusing Mechanism for Efficient Automatic Spoken Language Acquisition. | |
Weng et al. | Emergent Turing machines and operating systems for brain-like auto-programming for general purposes | |
Wang et al. | Emergent spatio-temporal multimodal learning using a developmental network | |
US20200257503A1 (en) | Auto-Programming for General Purposes and Auto-Programming Operating Systems | |
CN116401364A (en) | Language model training method, electronic device, storage medium and product | |
Levinson et al. | Automatic language acquisition by an autonomous robot | |
Xing et al. | Artificial evolution network: A computational perspective on the expansibility of the nervous system | |
Xing et al. | Perception coordination network: A neuro framework for multimodal concept acquisition and binding | |
Ghayoumi et al. | An adaptive fuzzy multimodal biometric system for identification and verification | |
KR102478367B1 (en) | Method, apparatus and system for matching and recommendation of sound source based on image recognition | |
Li et al. | Scalable cognitive developmental network: A strategy for integrating new perception online using relation evolution SOINN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |