CN118293927B

CN118293927B - Visual-voice navigation method and system with enhanced knowledge graph

Info

Publication number: CN118293927B
Application number: CN202410726056.9A
Authority: CN
Inventors: 刘坤华; 张云青; 陈成军; 代成刚; 郑义; 卢涛
Original assignee: Qingdao University of Technology
Current assignee: Qingdao University of Technology
Priority date: 2024-06-06
Filing date: 2024-06-06
Publication date: 2024-08-20
Anticipated expiration: 2044-06-06
Also published as: CN118293927A

Abstract

The invention provides a visual-voice navigation method and a visual-voice navigation system with enhanced knowledge patterns, which belong to the technical field of active navigation, wherein the scheme is characterized in that the whole navigation process is guided by extracting a hierarchical knowledge pattern under a real scene and utilizing the characteristics of the hierarchical knowledge pattern, the prior knowledge of the scene is effectively utilized, and the accuracy of behavior decision in the navigation process is effectively improved; meanwhile, the scheme effectively ensures the navigation accuracy under the guidance of the hierarchical knowledge graph features by directly fusing the voice and visual observation features; meanwhile, in the training of the active navigation strategy model based on deep learning, a supervised learning method based on a sub-target is adopted, a plurality of intermediate positions passed by when the robot reaches the target position are used as the sub-target, and when the robot reaches the sub-target, additional forward rewards are obtained, so that the training of the active navigation strategy model is accelerated, and the training efficiency is improved.

Description

Visual-voice navigation method and system with enhanced knowledge graph

Technical Field

The invention belongs to the technical field of active navigation, and particularly relates to a visual-voice navigation method and system with enhanced knowledge-graph.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The current visual-voice navigation method does not directly process navigation tasks with voice data, but performs subsequent navigation tasks after converting voice into text through a voice recognition model, and the method has the following problems: firstly, the robot comprises a voice recognition model and a navigation model, has larger requirements on calculation amount and storage amount, and is not suitable for robots with limited storage and calculation resources; secondly, training of the voice recognition model cannot be directly interacted with visual observation of the robot, error accumulation is easy to cause, and the effect of a subsequent navigation task is affected.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a visual-voice navigation method and a visual-voice navigation system with enhanced knowledge patterns, wherein the technical scheme is that the whole navigation process is guided by extracting the hierarchical knowledge patterns under the real scene and utilizing the features of the hierarchical knowledge patterns, so that the prior knowledge of the scene is effectively utilized, and the accuracy of behavior decision in the navigation process is effectively improved; meanwhile, the scheme effectively ensures the navigation accuracy under the guidance of the hierarchical knowledge graph features by directly fusing the voice and visual observation features.

According to a first aspect of the embodiment of the present invention, there is provided a visual-speech navigation method with enhanced knowledge-graph, including:

Collecting a voice instruction of a user and a scene image under a current visual angle in real time;

respectively extracting voice features and visual features based on the voice command and the scene image of the user;

Based on the voice characteristics and the visual characteristics, obtaining visual-voice fusion characteristics through characteristic fusion;

based on vision-voice fusion characteristics, the current pose of the robot and scene level knowledge graph characteristics constructed based on a real scene data set, a pre-trained active navigation strategy model based on deep learning is utilized to obtain a predicted behavior under a user voice instruction, and the robot is navigated based on the obtained predicted behavior;

The construction and feature extraction of the scene level knowledge graph comprise the following steps: hierarchical division is carried out on data in the real scene data set according to scenes, rooms and objects; based on the data after the hierarchical division, constructing a scene hierarchical knowledge graph by taking a scene, a room and an object as nodes and taking the relation between the scene and the room, the relation between the room and the object and the relation between the object as edges; based on a scene level knowledge graph corresponding to a scene to which a region to be navigated belongs, combining a relation between objects and a room and a relation between objects to obtain a frequency relation graph between objects and the room, a frequency relation graph between objects, a conditional probability relation graph between objects and the room and a conditional probability relation graph between objects; based on the obtained relation diagram, a pre-trained diagram convolution neural network is utilized to obtain scene level knowledge graph characteristics.

Further, based on the data after hierarchical division, a scene, a room and an object are taken as nodes, and a relation between the scene and the room, a relation between the room and the object and a relation between the object and the object are taken as edges, so that a scene hierarchical knowledge graph is constructed, specifically: obtaining a panoramic image corresponding to each view point from a real scene data set, wherein the panoramic image is composed of a plurality of image frames occupying a preset angle view angle; object appearing under the current viewpoint is obtained by carrying out target detection on each frame of image of each viewpoint; based on the obtained objects at each view point, the connection relation between nodes is determined by combining the relation between the objects and the relation between the rooms and the objects, so that the construction of the hierarchical knowledge graph is realized.

Further, for the obtained objects at the respective viewpoints, there is a correlation between the objects when two objects appear in the image at the same viewing angle; when an object is present in a room, the object has a correlation with the room, and edges are added between corresponding object nodes and room nodes, and object nodes for the correlated object and room and object.

Further, in constructing a scene-level knowledge graph based on a real scene data set, the object node is specifically obtained by: and performing target detection on each frame of image of each viewpoint by taking Faster R-CNN as a detector, recording the detection result of each viewpoint, and acquiring an object appearing under the current viewpoint through an adjacent frame object matching algorithm.

Furthermore, the active navigation strategy model based on deep learning adopts an A3C behavior estimation network model, wherein the behavior estimation network model comprises a plurality of layers of multi-layer perceptrons.

Further, the behavior estimation network model adopts the following loss function:

；

Wherein, The loss function of the network is estimated for the behavior,As a function of the strategic gradient loss,For the value of the residual loss,AndRespectively representing a strategy function and a value function; And Respectively, the parameters of which are indicated,For the accumulated return for the k steps,Indicating the current state of the system,The state of the navigation object is indicated,Representing the current selected behavior, T is the total time that a markov decision is run, E [ ] represents the desire,Is a super parameter.

Furthermore, in the training of the active navigation strategy model based on deep learning, a supervised learning method based on a sub-target is adopted, and a plurality of intermediate positions passed by the robot when the robot reaches the target position are used as the sub-target, so that additional forward rewards are obtained when the robot reaches the sub-target, and the training of the active navigation strategy model is accelerated.

Further, the extracting the voice feature and the visual feature respectively specifically includes: voice feature extraction is carried out by adopting a voice understanding module based on a Whisper neural network; visual feature extraction is performed using a ResNet-based visual feature extraction module.

Further, the feature fusion specifically adopts a visual-voice fusion module based on a multipath transducer network, and the visual-voice fusion module consists of a shared self-attention module and two modal expert networks, wherein the modal expert networks consist of a feedforward neural network; for the input visual features and voice features, firstly, a shared self-attention module is used for splicing the input visual features and the input voice features to obtain fusion features, and self-attention is applied to the fusion features. Then using two feedforward neural networks as a vision expert network and a voice expert network respectively to learn new vision and voice characteristics from the fusion space; and splicing based on the new visual features and the new voice features to obtain fusion features.

According to a second aspect of an embodiment of the present invention, there is provided a knowledge-graph enhanced visual-voice navigation system, including:

The data acquisition unit is used for acquiring a voice instruction of a user and a scene image under a current visual angle in real time;

A vision-voice fusion feature extraction unit for extracting a voice feature and a visual feature, respectively, based on a user voice instruction and a scene image; based on the voice characteristics and the visual characteristics, obtaining visual-voice fusion characteristics through characteristic fusion;

The behavior prediction unit is used for obtaining a predicted behavior under a user voice instruction by utilizing a pre-trained active navigation strategy model based on deep learning based on the vision-voice fusion characteristic, the current pose of the robot and scene level knowledge graph characteristics constructed based on a real scene data set; performing navigation of the robot based on the obtained predicted behavior;

The construction and feature extraction of the scene level knowledge graph comprise the following steps: hierarchical division is carried out on buildings under different scenes by scenes, rooms and objects; constructing a scene level knowledge graph by taking a scene, a room and an object as nodes, and taking the relation between the room and the object and the relation between the object and the object as edges; based on the constructed hierarchical knowledge graph, combining the relation between the objects and the room and the relation between the objects to obtain a plurality of relation graphs; based on the obtained relation diagrams, a scene level knowledge graph characteristic is obtained by utilizing a pre-trained diagram convolution neural network.

The one or more of the above technical solutions have the following beneficial effects:

(1) The invention provides a visual-voice navigation method and a system with enhanced knowledge patterns, wherein the scheme is characterized in that the whole navigation process is guided by extracting the hierarchical knowledge patterns under the real scene and utilizing the features of the hierarchical knowledge patterns, so that the prior knowledge of the scene is effectively utilized, and the accuracy of behavior decision in the navigation process is effectively improved; meanwhile, the scheme effectively ensures the navigation accuracy under the guidance of the hierarchical knowledge graph features by directly fusing the voice and visual observation features.

(2) According to the scheme, in the training of the active navigation strategy model based on deep learning, the supervised learning method based on the sub-targets is adopted, a plurality of intermediate positions passed by the robot when the robot reaches the target position are used as the sub-targets, and when the robot reaches the sub-targets, additional forward rewards are obtained, so that the training of the active navigation strategy model is accelerated, and the training efficiency is improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of a knowledge-graph enhanced visual-audio navigation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hierarchical knowledge graph according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a hierarchical knowledge graph construction process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a network model structure for A3C-based behavior estimation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of expert trajectories including a plurality of sub-targets in a sub-target-based supervised learning method according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a learning process based on sub-objective rewards in a sub-objective-based supervised learning method according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment aims to provide a visual-voice navigation method with enhanced knowledge graph.

In order to solve the problems in the prior art, the embodiment provides a visual-voice navigation method with enhanced knowledge-graph, which mainly adopts the following technical conception: by constructing a hierarchical knowledge graph based on a real scene, acquiring hierarchical knowledge from a large-scale real world data set, and extracting information in the knowledge graph by using a graph convolution network; and constructing an active navigation strategy model by utilizing a behavior prediction network based on reinforcement learning, providing a reinforcement learning training method based on sub-targets, and further utilizing the active navigation strategy model to select optimal behaviors to complete navigation tasks according to vision-voice fusion characteristics and hierarchical knowledge spectrum characteristics.

As shown in fig. 1, a visual-speech navigation method with enhanced knowledge graph specifically includes the following processing steps:

step 1: collecting a voice instruction of a user and a scene image under a current visual angle in real time;

Step 2: respectively extracting voice features and visual features based on the voice command and the scene image of the user;

Specifically, the voice feature extraction adopts a voice understanding module, and the visual feature extraction adopts a visual understanding module, wherein: the voice understanding module specifically adopts a Whisper neural network; the visual characteristic extraction module specifically adopts ResNet to 50.

Step3: based on the voice characteristics and the visual characteristics, obtaining visual-voice fusion characteristics through characteristic fusion;

The method comprises the steps of extracting vision-voice fusion characteristics, specifically adopting a pre-constructed vision-voice fusion module, wherein the vision-voice fusion module is based on a multipath transducer network, and the network is composed of a shared self-attention module and two modal specialists, wherein the modal specialists adopt a feedforward neural network; the vision-voice fusion module fuses and corresponds the acquired voice characteristics and the acquired visual characteristics, and the vision-voice fusion module is specific: visual features for input And speech featuresFirst, the two are spliced by using the shared self-attention module to obtain fusion characteristicsSelf-attention is then applied to the fused features:

，

。

Wherein, ，，In order for the parameters to be able to be learned,Representation layer normalization, B, H, W is a parameter representing the feature dimension,For a query in the self-attention mechanism,As a key in the self-attention mechanism,LN () represents layer normalization, which is a value in the self-attention mechanism.

The shared self-attention layer may model the dependency of visual features and speech features, projecting the visual features and speech features into a fusion space. The visual-voice characteristic fusion module respectively uses two feedforward neural networks as a visual expert network and a voice expert network to learn new visual and voice characteristics from a fusion space, wherein the expert network consists of two full-connection layers, and an activation function adopts a ReLU; after processing, the new vision and voice features have cross-modal fusion information, so that the corresponding relation between the image and the voice can be learned more easily.

Step 4: based on vision-voice fusion characteristics, the current pose of the robot and scene level knowledge graph characteristics constructed based on a real scene data set, a pre-trained active navigation strategy model based on deep learning is utilized to obtain a predicted behavior under a user voice instruction, and the robot is navigated based on the obtained predicted behavior;

In a specific implementation, the construction of the scene-level knowledge graph is performed by taking a scene, a room and an object as nodes, and a relation between the room and the object and a relation between the object and the object as edges, specifically: respectively constructing knowledge maps for different scenes, wherein when an object appears in a room, the object and the room have correlation; when two objects appear in the image under the same visual angle, correlation exists between the objects; for objects and rooms, and objects with correlation, adding edges between the objects and rooms, and objects; and no edge is arranged between rooms.

In a specific implementation, the method is based on the constructed hierarchical knowledge graph, and combines the relationship between the object and the room and the relationship between the object to obtain a plurality of relationship graphs, specifically: based on the constructed hierarchical knowledge graph, four relationship graphs are obtained according to the frequency relationship between the object and the room, the conditional probability relationship between the object and the room and the conditional probability relationship between the object and the object respectively.

In a specific implementation, in constructing a scene level knowledge graph based on a real scene, the object node is obtained specifically as follows: and performing target detection on each frame of image of each viewpoint by taking Faster R-CNN as a detector, recording the detection result of each viewpoint, and acquiring an object appearing under the current viewpoint through an adjacent frame object matching algorithm.

In specific implementation, the active navigation strategy model based on deep learning adopts an A3C behavior estimation network model, wherein the behavior estimation network model comprises a plurality of layers of multi-layer perceptrons.

In specific implementation, in the training of the active navigation strategy model based on deep learning, a supervised learning method based on a sub-target is adopted, and when the robot reaches the sub-target, additional forward rewards are obtained by taking a plurality of intermediate positions passed by the robot when the robot reaches the target position as the sub-target, so that the training of the active navigation strategy model is accelerated.

Specifically, the step4 specifically includes the following processing procedures:

step 401: hierarchical knowledge graph construction and feature extraction

In the field of active navigation, there have been methods of assisting navigation using knowledge maps or semantic maps. These methods are mostly based on large knowledge patterns, such as: semantic tags that may occur in the scene are used to retrieve the correlations. However, the knowledge collected in this way is not completely oriented to the specific application scenario of the robot, and is difficult to generalize. At the same time, these knowledge bases are not hierarchical, meaning that the nodes of the room, e.g. bedroom, kitchen etc. with scene level information and the nodes providing only object level information, e.g. table, chair etc. are of equal importance, which is clearly unreasonable.

Based on the above-mentioned problems, the present embodiment proposes a hierarchical knowledge graph to extract hierarchical knowledge from a specific real scene.

The solution described in this embodiment uses Neo4j as the graphics database engine, neo4j being a high-performance, scalable, locally stored graphics database system. Neo4j uses a graph data structure to store and manage data, where nodes represent entities and edges represent relationships between entities, and can record rich nodes and relationships and ensure efficient retrieval. The knowledge graph proposed by the scheme in this embodiment has a hierarchical structure, as shown in fig. 2, taking a home scene as an example, the scheme in this embodiment divides the scene into three levels: scene layer, room layer, and object layer.

The scene layer node is named by a scene name (it can be understood that the scene layer node in this embodiment may include other scenes such as a company scene, a factory scene, etc. besides a home scene), and is an entry node of the entire knowledge graph, and multiple knowledge graphs can be constructed according to different scenes, and then the corresponding knowledge graph is retrieved for use with respect to the scenes. The room layer nodes are named by room names, and common family rooms such as living rooms, bedrooms, kitchens and the like are included; there is no edge between the nodes of the room layers, because there is not necessarily a fixed positional relationship between rooms, only the room needs to be identified during navigation, and no further reasoning is needed. For the object layer, the nodes are named object names. Color, shape, size, or other attributes may also be included.

In order to facilitate management and application of the graph, the scheme of the embodiment sets a landmark object directory:

wherein tv is a television, sofa is a sofa, and refrigerator is a refrigerator.

The landmark object indicated by the dotted circle in fig. 2 is represented by the above-described landmark object directory. Where landmark objects refer to objects that have a fixed position, large volume, or easy to view in the environment. These objects play a positive role in revealing relevance to other objects or rooms.

In a specific implementation, whether the object is a landmark object is taken as an attribute of an object layer node, and the attribute of the node is encoded into a feature vector when the graph convolution is performed, wherein the attribute of the node comprises a node name and whether the object is a landmark object.

In particular implementations, the present embodiment uses Faster R-CNN as a detector to identify and tag object objects. While Neo4j is used as the data storage and query engine for the knowledge-graph, it stores structured data in the network instead of in tables. Neo4j allows one machine to handle billions of nodes and relationships and can be extended to multiple devices running in parallel. This is very friendly for deployment on mobile devices or robots.

First, defining the constructed knowledge graph asWherein, the method comprises the steps of, wherein,Representing a set of all the nodes in the graph,Representing a set of edges, the edges being weighted asRepresenting the relationship between two connected nodes in the graph. The hierarchical structure of the knowledge graph is embodied by two relations,And. The two relationships are defined as follows:

Representing a correlation between the object and the room, wherein, Refers to an objectAppear in a roomIs a frequency of (a) in the frequency range of (b).

Representing the correlation between the objects, wherein,Refers to an objectAnd an objectThe frequency of co-occurrence in the same view.

The process of constructing a hierarchical knowledge-graph from a real scene dataset (i.e., an existing general dataset) is shown in fig. 3. The real data for generating the knowledge-graph includes three levels of elements, namely, a building, a viewpoint, and a frame. Each building is independent, wherein multiple viewpoints are distributedEach view consists of 12 image framesThe composition, 12 image frames each occupy 30 degrees of viewing angle, constitutes the panorama at the current viewing angle. The present embodiment performs target detection on each frame image of each viewpoint using a fast R-CNN detector, and then records the detection result of each viewpoint to acquire an object appearing at the current viewpoint.

Since the same object may appear between adjacent frames, it is not sufficient to identify the object alone, and in addition to accurately detecting objects in the environment, it is also important to design instance tracking and matching strategies across multiple frames to ensure that the obtained instance relationships are not redundant or missing. In order to solve this problem, the present embodiment proposes an adjacent frame object matching algorithm, specifically:

for a certain frame of image The set of object instances extracted by the instance detector is represented asWherein, the method comprises the steps of, wherein,，Representing the image coordinates of the detected center of the object,Indicating the category to which the object belongs. Then the object in the next frameWhether or not an adjacent frame appears can be calculated using the image similarity. UsingExpressed in terms of objectsIs centered and has the size ofIs used for the image blocks of the (c), using feature extractorsCalculating the image block characteristics, then the image blockAndThe similarity of (2) may be measured by the cosine similarity of the features. Object of next frameThe similarity with all similar objects of the current frame is smaller than the threshold value, and can be considered asIs not counted.

Since the environments of the same object of adjacent frames are similar, only the texture of the image block can be used to judge whether the object is the same object, and the feature extractor is used in the implementation to improve the statistical efficiencyThe first three layers of pre-trained VGGNet (Visual Geometry Group Net) are used to extract SimpleTexture features of the image block. The image block size is fixed at 64×64 and the threshold is set at 0.6.

After the detected object is de-duplicated, the frequency of the occurrence of the object is counted. Let all buildings contained in the dataset beThe number of buildings contained in the data set isThe number of viewpoints under each building is. The following equation is used to calculate the overall dataset：

Wherein,Representing objectsFrequencies present in room r throughout the dataset; For identifying objects And roomWhether the relationship between them has been counted in adjacent frames; representing objects Whether or not to be present in a buildingViewing angle of (2)And (3) downwards. In a similar manner to that described above,Obtained by the following steps:

Wherein, For calculating objectsAnd an objectThe frequency of co-occurrence in the whole data set in the same viewpoint. And (3) withIs similar to the representation of (a),Representing objectsAnd an objectWhether the relationship between them has been counted in adjacent frames; representing objects And an objectWhether or not to be in a buildingViewpoint in (3)The lower frameCo-occurrence.

In summary, the hierarchical knowledge graph proposed in the present embodiment actually extracts the combined information of the object-object and the room-object, instead of the position information between the objects. This is done because the object level positional relationship tends to be different in different scenarios, even in the same room, where the furniture arrangement may be different. Thus, considering fine-grained object position relationships does not help the robot navigate efficiently. The combination relation of the objects can better reflect the characteristics of the scene, and the robot has great help to determine the direction of the next navigation.

In a specific implementation, the extraction of the hierarchical knowledge graph features specifically comprises the following processing steps:

A human can perform navigation tasks efficiently in a room because the human has a priori knowledge of the layout of the room, such as a refrigerator often in a kitchen, a sofa often in a living room, a bed often in a bedroom, etc. The present embodiment uses probabilistic models to model this knowledge, extracting relevant information from the knowledge graph, causing the robot to reason like a human.

(1) Object-room reasoning:

One common scenario in human navigation is when it is necessary to find a mobile phone, the human being considers that the mobile phone may be on a desk in a study or on a cabinet in a bedroom. This phenomenon is because a human can estimate the probability of an object from its own knowledge and the room in which it may appear, thereby inferring the most likely room. Similarly, knowledge-graph reasoning proposed by the solution of this embodiment aims to provide a priori knowledge by calculating the probability of the object position. To build an object And roomThe probability model between them, let the conditional probability be as follows:

Wherein, Is shown in a roomCan find objects inProbability of (2); representing the set of all rooms in the current navigation task, Is the kth room in the set of room prices.

(2) Object-object reasoning:

Likewise, the relationship between objects also provides an important reference for determining the position of the objects, such as tables and chairs, which tend to appear in one frame of image. The conditional probability of this relationship is expressed as follows:

Wherein, Representing objectsAndProbability of occurrence in the same frame; Representing the set of objects detected at the current view angle, Is the kth object in the object set.

(3) Graph convolution network feature extraction:

Four relationship graphs can be obtained through the operation: ，，， The object-room frequency relationship, the object-object frequency relationship, the object-room conditional probability relationship, and the object-object conditional probability relationship are respectively represented. The four relationship diagrams are expressed in the form of adjacent matrixes respectively ，，，。

In the feature extraction, four relationship graphs are used as the input of the graph convolution neural network, and the hierarchical knowledge graph features are output, wherein the network comprises four input channels for respectively receiving the input of each relationship graph, and the method specifically comprises the following steps:

Wherein, Representing the features of the graph extracted by the first layer, andThe characteristic vector of the node in the diagram is represented, m represents the dimension of the characteristic vector, and n represents the number of the node; is an adjacency matrix with self-connection, wherein, Representing the identity matrix; Representation of A degree matrix of (2); w represents the learnable weight of the first layer; Representing an activation function.

Step 402: construction and training of active navigation strategy model based on deep learning

In specific implementation, in this embodiment, an active exploration strategy is provided, under which the robot obtains the vision-voice feature and the knowledge graph feature under the current view angle through the foregoing method based on the A3C algorithm, and then estimates the optimal behavior by combining with its pose. In order to mitigate the sparse rewards and the delayed rewards in reinforcement learning, the present embodiment provides a sub-objective based supervision method that utilizes a truth path to provide additional forward rewards, guiding agents to learn the causal relationships of the variation and selection behavior between each sub-objective.

(1) Construction of active navigation strategy model

In the embodiment, the active navigation strategy model is constructed by adopting the behavior prediction network based on the A3C (Asynchronous Advantage Actor Critic), the A3C algorithm is used for learning the optimal behavior, and a plurality of agents can be trained by the A3C algorithm at the same time, so that the computing resource can be more effectively utilized, and the training effect is enhanced. At the same time, strong correlation between samples is reduced and faster convergence is achieved. Thus, A3C is robust, fast, and can be used to handle navigation tasks, conduct behavioral prediction learning in different scenarios. The A3C may learn the policy function and the value function simultaneously in the deep network and be implemented through multi-threaded processing. Specifically, the present embodiment formats the trajectory of robot navigation asFor each time step t, the time of the start of the process,Representing the current system state, consisting of the robot pose and the observed RGB image,The state of the navigation object is indicated,Representing the behavior of the current selection,Representing current time step taking actionThe rewards that the system can obtain,Indicating the state at the next moment.

The task of a behavior prediction network is to giveAndTo predictAnd. The behavior estimation network proposed in this embodiment is implemented by a five-layer Perceptron (MLP), and the network structure is shown in fig. 4:

The loss function adopted by the A3C algorithm in this embodiment is a loss function of an Actor-critter framework (AC), and the policy function is updated by minimizing the policy loss function:

where T is the total time that a Markov decision is run out, E [ ] represents the desire, AndRespectively representing a strategy function and a value function; And As a function of the parameters of the network,For the accumulated return for k steps, the equation is expressed as:

Wherein r represents a discount factor, so that the system pays more attention to instant rewards, and the future rewards are attenuated appropriately; representing the real-time prize value when the time step is t, k being the upper time step limit. The prize value in this embodiment is set as follows:

when the robot successfully completes the navigation, the system gives a large prize value of 10, in addition to a positive prize value for each time step if the distance between the robot and the target is reduced, wherein, Indicating the distance from the robot to the target when the time step is t,Is a scale factor. Meanwhile, in order to enable the robot to reach the target in the shortest time possible, penalty items are introduced in the rewards, namely, penalty with a value of 0.01 is generated when each step is taken, and each step of the robot is forced to pull the distance between the robot and the target as close as possible. The value function is implemented by minimizing the value loss function:

the loss function of the final behavior estimation network is:

Wherein, Is super-parametric, as described aboveThe loss function of the network is estimated for the behavior,As a function of the strategic gradient loss,Is the value residual loss.

(2) Training of active navigation strategy models

Sparse rewards are one of the biggest challenges faced in deep reinforcement learning. In the active navigation process, a larger positive reward can be obtained only after the robot reaches the target point, so the convergence speed of the network is slower.

Therefore, in this embodiment, a supervised learning method based on sub-targets is provided, in which a truth-value path is used as an expert trajectory, and includes a plurality of sub-targets, and when a robot reaches a sub-target, an additional forward reward is obtained, so as to accelerate training of a strategy network, where the truth-value path is a predetermined shortest path, and is used for error calculation in a navigation process.

As shown in fig. 5 and 6, the extraction and learning process of the sub-targets is illustrated, and specifically includes the following processing steps:

Given a random initial state And target stateFor example, the optimal path from the dataset may be obtained, called expert trajectoryAs shown in fig. 5. Wherein a plurality of intermediate sub-targets are included. The sub-objective is to correct the offset of the robots to achieve an approximately optimal path, as shown in fig. 6, where each robot reaches a sub-objective, and an additional sub-objective prize is obtained:

where N is the shortest number of steps from the start point to the end point, Is a super parameter.

In the process of navigating along the expert track, the sub-target image from the expert track is beneficial to the causal relation between the network learning state change and the behavior selection, is beneficial to the network expansion to strange scenes, for example, the mode of going out of a gate or going through a corridor learned from the expert track can be directly used for similar scenes of strange rooms, and therefore the generalization capability of the network is improved.

Embodiment two:

The embodiment provides a visual-voice navigation system with enhanced knowledge graph.

A knowledge-graph enhanced visual-to-speech navigation system, comprising:

It will be appreciated that the system in this embodiment corresponds to the method in the first embodiment, and its technical details have been described in the first embodiment, so they are not described in detail herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A knowledge-based enhanced visual-to-speech navigation method, comprising:

The construction and feature extraction of the scene level knowledge graph comprise the following steps: hierarchical division is carried out on data in the real scene data set according to scenes, rooms and objects; based on the data after the hierarchical division, constructing a scene hierarchical knowledge graph by taking a scene, a room and an object as nodes and taking the relation between the scene and the room, the relation between the room and the object and the relation between the object as edges; based on a scene level knowledge graph corresponding to a scene to which a region to be navigated belongs, combining a relation between objects and a room and a relation between objects to obtain a frequency relation graph between objects and the room, a frequency relation graph between objects, a conditional probability relation graph between objects and the room and a conditional probability relation graph between objects; based on the obtained relation diagram, a pre-trained diagram convolution neural network is utilized to obtain scene level knowledge graph characteristics;

The data after hierarchical division is based on the scene, the room and the object as nodes, and the relation between the scene and the room, the relation between the room and the object and the relation between the object and the object as edges, so as to construct a scene hierarchical knowledge graph, specifically comprising the following steps: obtaining a panoramic image corresponding to each view point from a real scene data set, wherein the panoramic image is composed of a plurality of image frames occupying a preset angle view angle; object appearing under the current viewpoint is obtained by carrying out target detection on each frame of image of each viewpoint; based on the obtained objects at each view point, the connection relation between nodes is determined by combining the relation between the objects and the relation between the rooms and the objects, so that the construction of the hierarchical knowledge graph is realized;

Setting a directory of landmark objects, wherein the landmark objects refer to objects with fixed positions, large volumes or easy observation in the environment, and the objects are disclosed to be related to other objects or rooms; the attributes of the nodes comprise node names and whether the nodes are landmark objects or not;

For the obtained objects at the respective viewpoints, there is a correlation between the objects when two objects appear in the image at the same viewing angle; when an object is present in a room, the object has a correlation with the room, and edges are added between corresponding object nodes and room nodes, and object nodes for the object and room and object having the correlation; no edge is arranged between rooms;

In constructing a scene-level knowledge graph based on a real scene data set, the object nodes are specifically obtained by: performing target detection on each frame of image of each view point by taking Faster R-CNN as a detector, recording a detection result of each view point, and acquiring an object appearing under the current view point through an adjacent frame object matching algorithm, wherein whether the object in the next frame appears in an adjacent frame or not can be calculated by using image similarity;

In the training of the active navigation strategy model based on deep learning, a supervised learning method based on a sub-target is adopted, and a plurality of intermediate positions passed by when the robot reaches a target position are used as the sub-target, so that additional forward rewards are obtained when the robot reaches the sub-target, and the training of the active navigation strategy model is accelerated;

By directly fusing the voice and visual observation features, the navigation accuracy is ensured under the guidance of the hierarchical knowledge graph features.

2. The knowledge-graph-enhanced visual-speech navigation method of claim 1, wherein the deep learning-based active navigation strategy model employs an A3C behavior-based estimation network model, wherein the behavior estimation network model comprises a plurality of layers of multi-layer perceptrons.

3. The knowledge-graph enhanced visual-speech navigation method of claim 2, wherein the behavior estimation network model employs the following loss function:

；

4. The visual-speech navigation method with enhanced knowledge-graph according to claim 1, wherein the extracting of the speech feature and the visual feature respectively comprises: voice feature extraction is carried out by adopting a voice understanding module based on a Whisper neural network; visual feature extraction is performed using a ResNet-based visual feature extraction module.

5. The visual-speech navigation method of claim 1, wherein the feature fusion specifically adopts a visual-speech fusion module based on a multipath transducer network, the visual-speech fusion module is composed of a shared self-attention module and two modal expert networks, wherein the modal expert networks are composed of a feedforward neural network; for the input visual features and voice features, firstly, a shared self-attention module is used for splicing the input visual features and the input voice features to obtain fusion features, and self-attention is applied to the fusion features. Then using two feedforward neural networks as a vision expert network and a voice expert network respectively to learn new vision and voice characteristics from the fusion space; and splicing based on the new visual features and the new voice features to obtain fusion features.

6. A knowledge-graph enhanced visual-to-speech navigation system, comprising:

The construction and feature extraction of the scene level knowledge graph comprise the following steps: hierarchical division is carried out on buildings under different scenes by scenes, rooms and objects; constructing a scene level knowledge graph by taking a scene, a room and an object as nodes, and taking the relation between the room and the object and the relation between the object and the object as edges; based on the constructed hierarchical knowledge graph, combining the relation between the objects and the room and the relation between the objects to obtain a plurality of relation graphs; based on the obtained relation diagrams, a pre-trained diagram convolution neural network is utilized to obtain scene-level knowledge graph characteristics;

Based on the data after hierarchical division, a scene, a room and an object are taken as nodes, and the relationship between the scene and the room, the relationship between the room and the object and the relationship between the object and the object are taken as edges, so that a scene hierarchical knowledge graph is constructed, specifically: obtaining a panoramic image corresponding to each view point from a real scene data set, wherein the panoramic image is composed of a plurality of image frames occupying a preset angle view angle; object appearing under the current viewpoint is obtained by carrying out target detection on each frame of image of each viewpoint; based on the obtained objects at each view point, the connection relation between nodes is determined by combining the relation between the objects and the relation between the rooms and the objects, so that the construction of the hierarchical knowledge graph is realized;