CN117685968A

CN117685968A - Method for navigating intelligent agent and intelligent agent

Info

Publication number: CN117685968A
Application number: CN202211035796.5A
Authority: CN
Inventors: 杨立荣; 赵禹昇; 张立鹏; 孔祥浩; 任海兵
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2024-03-12

Abstract

The application discloses an agent navigation method and an agent, and belongs to the technical field of navigation. The method comprises the following steps: determining a first reachable navigation point corresponding to a current navigation point, and acquiring a first image corresponding to the first reachable navigation point; determining a predicted navigation end point according to a navigation instruction of the navigation, position information corresponding to a history navigation point, gesture information of an agent at the history navigation point, topology information of the history navigation point and the first image, and determining a navigation point to be driven in candidate navigation points, wherein the candidate navigation points comprise the history navigation point and the first reachable navigation point; and controlling the intelligent body to drive to the navigation point to be driven. By adopting the method and the device, the navigation of the intelligent body can be realized.

Description

Method for navigating intelligent agent and intelligent agent

Technical Field

The present disclosure relates to the field of navigation technologies, and in particular, to a method and apparatus for navigating an agent, and a storage medium.

Background

Along with the development of science and technology, the intelligent body gradually goes into the field of vision of people. The intelligent agent is applied to various industries, and currently, the intelligent agent comprises a household service type robot, a distribution robot, a public place guiding robot and the like.

Navigation is performed according to a navigation instruction input by a user, and is one of the keys of an intelligent agent. How to realize the navigation of the intelligent agent is a problem to be solved at present.

Disclosure of Invention

The embodiment of the application provides an agent navigation method and an agent. The technical scheme is as follows:

in a first aspect, a method for agent navigation is provided, the method comprising:

determining a first reachable navigation point corresponding to a current navigation point, and acquiring a first image corresponding to the first reachable navigation point;

determining a predicted navigation end point according to a navigation instruction of the navigation, position information corresponding to a history navigation point, gesture information of an agent at the history navigation point, topology information of the history navigation point and the first image, and determining a navigation point to be driven in candidate navigation points, wherein the candidate navigation points comprise the history navigation point and the first reachable navigation point;

and controlling the intelligent body to drive to the navigation point to be driven.

In one possible implementation, the method further includes:

determining a second reachable navigation point corresponding to a navigation starting point, acquiring a second image corresponding to the second reachable navigation point, and determining an image feature vector corresponding to the second reachable navigation point according to the second image;

Determining a language feature vector corresponding to each word in the navigation instruction and an overall feature vector corresponding to the navigation instruction;

determining a navigation point marking vector corresponding to the navigation starting point according to the image feature vector corresponding to the second reachable navigation point, the position information corresponding to the navigation starting point, the gesture information corresponding to the navigation starting point of the intelligent body and the time sequence information corresponding to the navigation starting point;

selecting a plurality of initial candidate navigation terminals from candidate navigation terminals with the distance between the navigation terminals and the navigation start point being a preset distance, and determining navigation terminal mark vectors corresponding to the initial candidate navigation terminals respectively;

and determining a navigation point to be driven in the second reachable navigation point, and determining a predicted navigation terminal point in the plurality of initial candidate navigation terminals according to the image feature vector corresponding to the second reachable navigation point, the language feature vector, the integral feature vector, the navigation point mark vector corresponding to the navigation starting point and the navigation terminal point mark vector corresponding to each initial candidate navigation terminal point.

In one possible implementation manner, the determining the navigation point to be driven according to the image feature vector corresponding to the second reachable navigation point, the language feature vector and the overall feature vector, the navigation point mark vector corresponding to the navigation start point, and the navigation end point mark vector corresponding to each initial candidate navigation end point, and determining the predicted navigation end point in the plurality of initial candidate navigation end points includes:

The overall feature vector is used as a global mark vector, the language feature vector, the global mark vector, a navigation point mark vector corresponding to the navigation starting point, an image feature vector corresponding to the second reachable navigation point and a navigation end point mark vector corresponding to each initial candidate navigation end point are input into a transformer model, and a first updated global mark vector, a first updated navigation point mark vector corresponding to the navigation starting point, a first updated image feature vector corresponding to the second reachable navigation point and a first updated navigation end point mark vector corresponding to each initial candidate navigation end point are obtained;

determining a navigation point to be driven in the second reachable navigation point according to the first updated image feature vector and the first updated global mark vector;

and determining a predicted navigation endpoint from the plurality of initial candidate navigation endpoints according to the first updated navigation endpoint marker vector.

In one possible implementation manner, the determining a predicted navigation endpoint according to a navigation instruction of the current navigation, position information corresponding to a historical navigation point, gesture information of an agent at the historical navigation point, topology information of the historical navigation point and the first image, and determining a navigation point to be driven in candidate navigation points includes:

Acquiring a second updated global marker vector obtained from a previous historical navigation point through the transformer model, a second updated navigation point marker vector corresponding to a historical navigation point except the previous historical navigation point and a plurality of second updated navigation terminal marker vectors;

determining a navigation point marking vector corresponding to the previous navigation point according to the image characteristic vector corresponding to the third reachable navigation point corresponding to the previous navigation point, the position information corresponding to the previous navigation point, the gesture information of the intelligent agent at the previous navigation point and the time sequence information corresponding to the previous navigation point;

determining an image feature vector corresponding to the first reachable navigation point according to the first image;

and determining a navigation point to be driven in the candidate navigation points according to the second updated global mark vector, the language feature vector corresponding to each word in the navigation instruction, the second updated navigation point mark vector, the navigation point mark vector corresponding to the previous historical navigation point, the topology information of the historical navigation point, the image feature vector corresponding to the first reachable navigation point and the plurality of second updated navigation end point mark vectors, and determining a predicted navigation end point.

In one possible implementation manner, the determining, according to the first image, an image feature vector corresponding to the first reachable navigation point includes:

determining a position feature vector corresponding to the first reachable navigation point according to the position information corresponding to the first reachable navigation point;

inputting the first image into an image feature extractor to obtain an initial feature vector of the first image;

and adding the initial feature vector and the position feature vector corresponding to the first reachable navigation point to obtain the image feature vector corresponding to the first reachable navigation point.

In one possible implementation manner, the determining a navigation point to be driven among candidate navigation points according to the second updated global marker vector, the language feature vector corresponding to each word in the navigation instruction, the second updated navigation point marker vector, the navigation point marker vector corresponding to the previous history navigation point, topology information of the history navigation point, the image feature vector corresponding to the first reachable navigation point, and the plurality of second updated navigation end point marker vectors, and determining a predicted navigation end point include:

determining an adjacent matrix corresponding to the historical navigation point according to the topology information of the historical navigation point, and taking the adjacent matrix as the attention matrix of the navigation point mark vector corresponding to the second updated navigation point mark vector and the previous historical navigation point mark vector;

Inputting the second updated global marker vector, the language feature vector corresponding to each word in the navigation instruction, the second updated navigation point marker vector, the navigation point marker vector corresponding to the previous history navigation point, the attention matrix, the image feature vector corresponding to the first reachable navigation point and the plurality of second updated navigation end point marker vectors into the transform model to obtain a third updated global marker vector, a third updated navigation point marker vector corresponding to each history navigation point, a second updated image feature vector corresponding to the first reachable navigation point and a plurality of third updated navigation end point marker vectors;

determining a navigation point to be driven in the candidate navigation points according to the third updated global marker vector, the third updated navigation point marker vector and the second updated image feature vector;

determining update candidate navigation terminals respectively corresponding to the plurality of third update navigation terminal mark vectors;

and determining a predicted navigation terminal point in the update candidate navigation terminal points respectively corresponding to the third updated navigation terminal point mark vectors according to the third updated navigation terminal point mark vectors.

In a possible implementation manner, the determining the navigation point to be driven according to the third updated global marker vector, the third updated navigation point marker vector and the second updated image feature vector in the candidate navigation points includes:

calculating the similarity of the second updated image feature vector corresponding to each first reachable navigation point and the third updated global marker vector, and taking the similarity as the confidence coefficient corresponding to each first reachable navigation point;

calculating the similarity of a third updated navigation point mark vector corresponding to each historical navigation point and the third updated global mark vector, and taking the similarity as the confidence coefficient corresponding to each historical navigation point;

and determining a navigation point corresponding to the maximum confidence coefficient from the first reachable navigation point and the historical navigation point as a navigation point to be driven.

In one possible implementation manner, the determining, according to the plurality of third updated navigation endpoint marker vectors, a predicted navigation endpoint in the update candidate navigation endpoints respectively corresponding to the plurality of third updated navigation endpoint marker vectors includes:

inputting the plurality of third updated navigation terminal point marking vectors into a navigation terminal point prediction model to obtain the confidence coefficient of the updated candidate navigation terminal points respectively corresponding to the plurality of third updated navigation terminal point marking vectors;

And taking the updated candidate navigation terminal corresponding to the maximum confidence as a predicted navigation terminal.

In one possible implementation, the navigation endpoint prediction model consists of a multi-layer perceptron MLP and a normalization function.

In a second aspect, there is provided an apparatus for agent navigation, the apparatus comprising:

the determining module is used for determining a first reachable navigation point corresponding to the current navigation point and acquiring a first image corresponding to the first reachable navigation point;

the prediction module is used for determining a predicted navigation end point according to a navigation instruction of the current navigation, position information corresponding to a history navigation point, gesture information of an agent at the history navigation point, topology information of the history navigation point and the first image, and determining a navigation point to be driven in candidate navigation points, wherein the candidate navigation points comprise the history navigation point and the first reachable navigation point;

and the navigation module is used for controlling the intelligent body to travel to the navigation point to be traveled.

In one possible implementation, the prediction module is further configured to:

In one possible implementation, the prediction module is configured to:

taking the integral feature vector as a global mark vector, inputting the language feature vector, the global mark vector, a navigation point mark vector corresponding to the navigation starting point, an image feature vector corresponding to the second reachable navigation point and a navigation end point mark vector corresponding to each initial candidate navigation end point into a transformer (transformer) model to obtain a first updated global mark vector, a first updated navigation point mark vector corresponding to the navigation starting point, a first updated image feature vector corresponding to the second reachable navigation point and a first updated navigation end point mark vector corresponding to each of the plurality of initial candidate navigation end points;

In one possible implementation, the prediction module is configured to:

In one possible implementation, the navigation endpoint prediction model consists of a multi-layer perceptron and a normalization function.

In a third aspect, there is provided an agent comprising a processor and a memory having stored therein at least one instruction loaded and executed by the processor to implement the method of agent navigation as described in the first aspect above.

In a fourth aspect, a computer readable storage medium is provided, in which at least one instruction is stored, the at least one instruction being loaded and executed by a processor to implement a method of agent navigation as described in the first aspect above.

In a fifth aspect, there is provided a computer readable storage medium having stored therein at least one instruction that is loaded and executed by the processor to implement the method of agent navigation as described in the first aspect above.

The beneficial effects that technical scheme that this application embodiment provided include at least:

in the embodiment of the application, the agent first determines a first reachable navigation point corresponding to the current navigation point, and acquires a first image corresponding to the first reachable navigation point. And then, combining the navigation instruction of the navigation, the position information corresponding to the historical navigation point, the gesture information of the intelligent agent at the historical navigation point, the topology information of the historical navigation point and the first image, determining a predicted navigation end point, and determining a navigation point to be driven in the candidate navigation points. In the embodiment of the application, the candidate navigation points not only comprise the reachable navigation points corresponding to the current navigation point, but also comprise the historical navigation points, so that even if the intelligent body is in navigation error, the intelligent body can return to the historical navigation points in the subsequent navigation process, and the navigation accuracy is higher. In addition, in the embodiment of the application, the topology information of the historical navigation points is also introduced when the navigation points to be driven are determined, so that the navigation accuracy can be improved to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation scenario provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method for agent navigation provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a candidate navigation endpoint provided in an embodiment of the present application;

FIG. 4 is a flowchart of a method for agent navigation provided in an embodiment of the present application;

FIG. 5 is a flow chart of a method for agent navigation provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus for navigating an agent according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an agent according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides a method for navigating an agent, which can be realized by the agent. Wherein the agent may be a home service robot, a distribution robot, a public place guidance robot, or the like. The method can be applied to various scenes, such as public places like shops, museums and restaurants, individual houses, outdoor designated distribution areas and the like. Referring to fig. 1, an implementation scenario is shown. The scenario shown in fig. 1 is a personal residence, in fig. 1, where an agent is in a bedroom, a user can instruct the agent to move from a current location (which may be referred to as a navigation start point) to a specified location (which may be referred to as a navigation end point) by issuing a navigation instruction to the agent.

In the method for navigating an agent provided by the embodiment of the application, the agent can determine the reachable navigation point according to the scene image around the current position. Then, the intelligent agent obtains a navigation instruction, an image corresponding to the reachable navigation point, position information of the historical navigation point, gesture information of the historical navigation point and topology information of the historical navigation point according to user input, predicts a navigation terminal point, and determines a navigation point to be driven in the historical navigation point and the reachable navigation point. And then, the navigation to the navigation point to be driven can be performed. Therefore, in the navigation scheme provided by the embodiment of the application, the topology information of the historical navigation points is introduced, and the historical navigation points are added into the candidate navigation points, so that the navigation accuracy is higher.

The following describes a method for navigating an agent according to an embodiment of the present application with reference to the accompanying drawings. As shown in fig. 2, the process flow of the method may include the following steps:

step 201, obtaining a navigation instruction.

The navigation instruction may be a natural language instruction, such as a chinese instruction, an english instruction, etc.

In practice, a user may issue navigation instructions to an agent when he wants to instruct the agent to go to a certain location. There are a number of ways to issue navigation instructions, a few of which are described below.

The first method is to send navigation instruction to the intelligent agent by voice mode.

For example, in the scenario shown in fig. 1, where the agent is at the bedside, the user may speak a navigational command, such as "walk out of the room, then turn right, continue straight forward, walk to the end of the corridor," or "Go out of the room, then turn right and continue to the end of the hallway. Furthermore, the intelligent agent can acquire the navigation instruction spoken by the user through the audio acquisition device.

Method two, send out the navigation instruction to the agent through the terminal station

In the second method, the terminal can establish communication connection with the intelligent agent, wherein the communication connection can be wired connection or wireless connection, and the wireless connection can be WiFi (Wireless Fidelity ), bluetooth and the like. The terminal can be a mobile terminal such as a mobile phone, a tablet personal computer and the like, and can also be computer equipment,

for example, the user may enter a navigation instruction in the terminal, such as "walk out of the room, then turn right, continue straight forward, walk to the end of the corridor," or "Go out of the room, then turn right and continue to the end of the hallway," and the terminal may issue the navigation instruction to the agent via a communication connection with the agent. And the intelligent agent can receive the navigation instruction sent by the terminal.

Third, inputting navigation instruction through touch screen of intelligent body

In the third method, the intelligent body can be provided with a touch screen, and a user can find a navigation instruction input interface on the touch screen of the intelligent body and input a navigation instruction on the interface, for example, input of 'go out of a room, turn right, continue to go straight forward and go to the end of a corridor', or 'Go out of the room, then turn right and continue to the end of the hallway'. Furthermore, the intelligent agent can acquire the navigation instruction input by the user.

In addition, the relevant person may be configured with a work area map in advance in the agent so that the agent can learn the maximum range in which the agent can move. For example, if the work area is a personal house as shown in fig. 1, a map of the inside of the personal house may be provided in the agent. Each time a navigation instruction is acquired, the intelligent agent can generate a minimum rectangle which comprises the working area on the basis of the known working area, and then the minimum rectangle is divided into a plurality of squares with equal areas according to a preset scale, and the center of each square is a candidate navigation terminal point.

For example, as shown in fig. 3, a minimum rectangle including the individual residence in fig. 1 is generated on the basis of fig. 1, and then the minimum rectangle is divided to obtain a plurality of equal squares with equal areas, and the center of each square is a candidate navigation terminal point.

Step 202, acquiring a scene image.

In practice, at least one image acquisition device, such as a camera, is mounted on the agent. The agent may continuously acquire surrounding images of the scene via the at least one image acquisition device.

Step 203, determining the reachable navigation points according to the scene images, and obtaining images corresponding to the reachable navigation points.

The reachable navigation point is a navigation point reachable by the current position, and the reachable navigation point does not comprise the historical navigation point of the navigation.

In practice, the agent may discrete the work area into a plurality of points, each point being one possible navigation point. The intelligent agent can analyze the obstacle of the scene image acquired at the current navigation point to determine the reachable navigation point when the intelligent agent starts the navigation or reaches one navigation point every time. Then, the agent can acquire an image including the reachable navigation points from the acquired scene image as an image corresponding to the reachable navigation points.

And 204, determining a predicted navigation end point according to the navigation instruction, the position information corresponding to the historical navigation point, the gesture information of the intelligent agent at the historical navigation point, the topology information of the historical navigation point and the image corresponding to the reachable navigation point, and determining the navigation point to be driven in the candidate navigation points.

Wherein the candidate navigation points include historical navigation points and reachable navigation points.

The following describes determining a navigation point to be driven and predicting a navigation end point when an agent is at a navigation start point (i.e. a first navigation point). Referring to fig. 4, the determination process may include the steps of:

s11, inputting the navigation instruction into a language model to obtain a language feature vector corresponding to each word in the navigation instruction and an overall feature vector corresponding to the navigation instruction.

The language model may be a transformer-based bi-directional coded representation (Bidirectional Encoder Representation from Transformers, BERT) model, among others.

Inputting the navigation instruction into the BERT model to obtain the language feature vector corresponding to each word (or word) in the navigation instruction and the integral feature vector corresponding to the navigation instruction. For ease of description, each word (orWords of the person) corresponding language feature vectors are respectively noted as: x is x ₁ 、x ₂ …x _m Wherein m is the number of words in the navigation instruction, and the integral feature vector corresponding to the navigation instruction is denoted as x ₀ 。

S12, obtaining a position feature vector corresponding to the reachable navigation point corresponding to the navigation starting point.

In the embodiment of the present application, the coordinates of the navigation start point may be preset to be (0, 0), that is, the navigation start point is taken as the origin of the navigation coordinate system.

The position information of the reachable navigation point can be the coordinates of the reachable navigation point under the navigation coordinate system, and for convenience of description, the position information of the reachable navigation point is recorded as:wherein i e {1,2,., k } ^t }，k ^t The number of reachable navigation points corresponding to the t-th time (also referred to as time sequence t) is represented, and the t-th time is the time (or time sequence information) corresponding to the current navigation point. The time of day may be an increasing positive integer, such as: the moment corresponding to the first navigation point (navigation start point) is 1, the moment corresponding to the second navigation point is 2, and so on.

The position information of each reachable navigation point is input into a position encoder to obtain a position feature vector corresponding to each reachable navigation point, and for convenience of description, the position feature vector corresponding to the reachable navigation point is recorded as:wherein f _p (·) represents a position encoder.

S13, obtaining an image feature vector corresponding to the reachable navigation point corresponding to the navigation starting point.

And inputting the image corresponding to each reachable navigation point into an image feature extractor to obtain an initial feature vector of the image corresponding to each reachable navigation point. For ease of description, the initial feature vector will be described as: v _i ^t Where i e {1,2,., k } ^t }，k ^t Indicating that the t-th time corresponds toIf the agent is at the navigation origin, t=1.

And adding the initial feature vector and the position feature vector corresponding to the reachable navigation point to obtain the image feature vector corresponding to the reachable navigation point. The formula is as follows:

wherein,and if the intelligent agent is currently at the navigation starting point, t=1.

S14, according to the image feature vectors corresponding to the reachable point navigation points, the panoramic view feature vectors corresponding to the starting points are navigated.

And inputting the image feature vectors corresponding to the reachable navigation points corresponding to the navigation starting point into a panoramic view feature extractor to obtain the panoramic view feature vectors corresponding to the navigation starting point. For convenience of description, the panorama view feature vector will be described as:wherein f _V (. Cndot.) represents a panoramic view feature extractor.

S15, acquiring a time sequence feature vector of the moment corresponding to the navigation starting point.

And inputting the time corresponding to the navigation starting point into a time sequence feature extractor to obtain a time sequence feature vector of the time corresponding to the navigation starting point. For ease of description, the timing feature vector will be described as: f (f) _T (t) wherein f _T (·) represents the timing feature extractor, t=1 if the agent is at the navigation origin.

S16, acquiring a gesture feature vector corresponding to the navigation starting point.

And inputting the posture information of the intelligent body at the navigation starting point into an action feature extractor to obtain a posture feature vector corresponding to the intelligent body at the navigation. Wherein the posture isThe information is the direction of the intelligent agent, and r can be used ^t ＝(sinθ ^t ，cosoθ ^t ，sinφ ^t ，cosφ ^t ) Representation, wherein θ ^t Is the horizontal orientation angle phi of the intelligent agent at the t moment ^t For the vertical orientation angle of the agent at time t, t=1 if the agent is at the navigation origin. For ease of description, the gesture feature vector will be described as: f (f) _A (r ^t ) Wherein f _A (. Cndot.) represents an action feature extractor.

And S17, adding the panoramic view feature vector, the time sequence feature vector, the position feature vector and the gesture feature vector corresponding to the navigation starting point to obtain a navigation point marking vector corresponding to the navigation starting point.

The formulation may be as follows:

wherein h is ^t And marking a vector for the navigation point corresponding to the t navigation point, wherein if the agent is at the navigation starting point, t=1.

S18, selecting a plurality of initial candidate navigation terminals from the candidate navigation terminals with the distance between the candidate navigation terminals and the navigation start point being the preset distance.

In the candidate navigation terminal obtained in step 201, q candidate navigation points with a distance from the navigation start point being a preset distance are selected, wherein q is a preset value, and the value is a positive integer greater than 1.

S19, for each initial candidate navigation terminal point in the selected plurality of initial candidate navigation terminal points, acquiring a position feature vector corresponding to the initial candidate navigation terminal point, and multiplying the position feature vector corresponding to the initial candidate navigation terminal point by an integral feature vector corresponding to the navigation instruction to acquire a navigation terminal point mark vector corresponding to the initial candidate navigation terminal point.

And inputting the position information of each selected initial candidate navigation terminal point into a position encoder to obtain a position feature vector corresponding to each initial candidate navigation terminal point. For ease of description, location information of initial candidate navigation end points will be describedThe method is characterized by comprising the following steps: l (L) _i Where i e {1, 2..q }, q is the number of candidate navigation endpoints selected in S18.

And then, multiplying the position characteristic vector corresponding to the initial navigation endpoint and the integral characteristic vector corresponding to the navigation instruction for each selected initial candidate navigation endpoint to obtain a navigation endpoint mark vector corresponding to the initial candidate navigation endpoint. The formula is as follows:

c _i ＝f _p (l _i )*x ₀

Wherein c _i And marking the vector for the navigation endpoint corresponding to the ith candidate navigation endpoint.

S20, taking the integral feature vector corresponding to the navigation instruction as a global mark vector. And inputting the global mark vector, the language feature vector corresponding to each word in the navigation instruction, the navigation point mark vector corresponding to the navigation start point, the image feature vector corresponding to the reachable navigation point and the navigation end point mark vector corresponding to the initial candidate navigation end point into a pre-trained transducer model to obtain an updated global mark vector, an updated navigation point mark vector corresponding to the navigation start point, an updated image feature vector corresponding to the reachable navigation point and a plurality of updated navigation end point mark vectors.

The transducer model is a Cross-modal structured (Cross-modal Structured Transformer) model based on a transducer, the number of updated navigation endpoint marker vectors output by the model in S20 is the same as the number of candidate navigation endpoints selected in S18, and each updated navigation endpoint marker vector corresponds to one candidate navigation endpoint.

S21, determining the navigation point to be driven in the reachable navigation points according to the updated image feature vector and the updated global mark vector corresponding to the reachable navigation points.

And calculating the similarity of the updated image feature vector corresponding to each reachable navigation point and the updated global mark vector for the updated image feature vector corresponding to each reachable navigation point. And taking the reachable navigation point corresponding to the highest similarity as the navigation point to be driven. The similarity may be a cosine distance, a euclidean distance, or the like.

S22, determining a predicted navigation terminal point in the plurality of initial candidate navigation terminal points according to the updated navigation terminal point marking vectors respectively corresponding to the plurality of candidate navigation terminal points.

And inputting the plurality of the output updated navigation terminal point mark vectors into a navigation terminal point prediction model to obtain the confidence coefficient of the initial candidate navigation terminal points corresponding to the plurality of the updated navigation terminal point mark vectors respectively, and taking the initial candidate navigation terminal points corresponding to the maximum confidence coefficient as the predicted navigation terminal points. Wherein the navigation endpoint prediction model consists of a multi-layer perceptron (Multilayer Perceptron, MLP) and a normalization function.

The following describes determining a navigation point to be driven and predicting a navigation end point when the agent is at other navigation points except the navigation start point. Referring to fig. 5, the determination process may include the steps of:

s30, acquiring a first updated global marker vector output by a previous historical navigation point transducer model, a first updated navigation point marker vector corresponding to the historical navigation points except the previous historical navigation point and a plurality of first updated navigation end point marker vectors.

The navigation points (including the navigation start point) that the agent has traveled in this navigation may be referred to as historical navigation points, and the navigation point where the agent is currently located is not calculated as the historical navigation point. Assuming that the current navigation point is the t navigation point, the previous historical navigation point is the t-1 navigation point.

For ease of description, the first updated global marker vector output at the previous historical navigation point transducer model will be denoted as: g ^t-1 . The first updated navigation point marker vector output at the previous historical navigation point transducer model is noted as:wherein (1)>Representing a t-1 time (the time at the current navigation point is taken as the t time, and the t-1 time is the time at the previous historical navigation point) transform moduleUpdating navigation point mark vector corresponding to the first navigation point (navigation starting point) in the current navigation of the output, +.>And (5) representing an updated navigation point mark vector corresponding to a second navigation point in the current navigation output by the t-1 moment transducer model, and the like. The first updated navigation endpoint marker vectors output at the previous historical navigation point transducer model are noted as:where q is the number of candidate navigation destinations selected at the time of navigation start.

S31, obtaining a position feature vector corresponding to the previous historical navigation point.

For convenience of description, a position feature vector corresponding to a previous history navigation point will be described as: f (f) _p (l ^t-1 ) Wherein l is ^t ^-1 Is the location information of the previous historical navigation point.

S32, obtaining the image feature vector corresponding to the reachable navigation point corresponding to the previous history navigation point.

For convenience of description, the image feature vector corresponding to the reachable navigation point corresponding to the previous history navigation point is recorded asWherein i e {1,2,., k } ^t-1 }，k ^t-1 The number of reachable navigation points corresponding to the t-1 time is indicated, namely the number of reachable navigation points corresponding to the previous historical navigation point. />

S33, starting the panoramic view feature vector corresponding to the previous history navigation point according to the image feature vector corresponding to the reachable navigation point corresponding to the previous history navigation point.

And inputting the image feature vectors corresponding to the reachable navigation points corresponding to the previous historical navigation points into a panoramic view feature extractor to obtain the panoramic view feature vectors corresponding to the navigation starting points. For ease of description, panoramic views will be presentedThe graph feature vector is written as:

s34, acquiring a time sequence feature vector of the moment corresponding to the previous historical navigation point.

And inputting the moment corresponding to the previous historical navigation point into a time sequence feature extractor to obtain the time sequence feature vector of the moment corresponding to the previous historical navigation point. For convenience of description, the timing feature vector is written as: f (f) _T (t-1)。

S35, acquiring a gesture feature vector corresponding to the previous historical navigation point.

And inputting the posture information of the intelligent agent at the previous historical navigation point into an action feature extractor to obtain a posture feature vector corresponding to the intelligent agent in navigation. Wherein, the gesture information of the agent at the previous historical navigation point can be expressed as: r is (r) ^t ^-1 ＝(sinθ ^t-1 ，cosθ ^t-1 ，sinφ ^t-1 ，cosφ ^t-1 ) Representation, wherein θ ^t-1 Is the horizontal orientation angle of the agent at time t-1 (i.e. the orientation of the agent when it is traveling to the previous historical navigation point), phi ^t Is the vertical orientation angle of the intelligent agent at the t-1 time. For convenience of description, the gesture feature vector corresponding to the previous historical navigation point is recorded as: f (f) _A (r ^t-1 )。

S36, adding the panoramic view feature vector, the time sequence feature vector, the position feature vector and the gesture feature vector corresponding to the previous historical navigation point to obtain a navigation point marking vector corresponding to the previous historical navigation point.

The formulation may be as follows:

s37, obtaining the image feature vector corresponding to the reachable navigation point corresponding to the current navigation point.

The specific acquisition method is the same as S13 described above, and will not be described here again. For ease of description, the current references will beThe image feature vector corresponding to the reachable navigation point corresponding to the navigation point is recorded as:

s38, determining an adjacent matrix corresponding to the historical navigation point according to the topology information of the historical navigation point, and taking the adjacent matrix as the attention matrix of the first updated navigation point mark vector corresponding to the historical navigation point except the previous historical navigation point and the navigation point mark vector corresponding to the previous historical navigation point.

Generating an undirected graph corresponding to the historical navigation points according to the topology information of the historical navigation points, and generating an adjacency matrix corresponding to the historical navigation points according to the undirected graph. Assuming that the current navigation point is a fourth navigation point, the agent runs from the first navigation point to the second navigation point and then from the second navigation point to the third navigation point, then in the undirected graph corresponding to the history navigation point, a connection relationship exists between the first navigation point and the second navigation point, and a connection relationship exists between the second navigation point and the third navigation point, and then the generated adjacency matrix can be as follows:

s39, inputting a first updated global mark vector, a language feature vector corresponding to each word in a navigation instruction, a first updated navigation point mark vector, a navigation point mark vector corresponding to a previous historical navigation point, an attention matrix, an image feature vector corresponding to an reachable navigation point corresponding to a current navigation point and a plurality of first updated navigation end point mark vectors into a transform model to obtain a second updated global mark vector, a second updated navigation point mark vector corresponding to each historical navigation point, a second updated image feature vector corresponding to the reachable navigation point and a plurality of second updated navigation end point mark vectors.

For ease of description, the second updated global marker vector will be referred to as: g ^t The second updated navigation point marker vector is noted as:the second updated image feature vector is noted as: />The second updated navigation endpoint marker vector is noted as: />

S40, determining the navigation points to be driven in the reachable navigation points and the historical navigation points according to the second updated global marker vector, the second updated navigation point marker vector and the second updated image feature vector.

And calculating the similarity of each second updated navigation point marker vector and each second updated global marker vector, and calculating the similarity of each second updated image feature vector and each second updated global marker vector. And determining the navigation point (possibly the reachable navigation point or the historical navigation point) corresponding to the maximum similarity as the navigation point to be driven.

S41, determining update candidate navigation terminals respectively corresponding to the second update navigation terminal point mark vectors.

For each second updated navigation endpoint marker vector of the output (i.e) And determining an updated position feature vector corresponding to each updated navigation point according to the second updated navigation end point mark vector and the integral feature vector corresponding to the navigation instruction. The formulation may be as follows:

Wherein f _p (l′ _i ) For the i-th updated position feature vector corresponding to the updated candidate navigation end point,update candidate for ithAnd updating the navigation endpoint mark vector corresponding to the navigation endpoint. />And x ₀ Are known.

And then, respectively inputting each determined updated position feature vector into a position decoder to obtain the position information of each updated candidate navigation terminal.

The position decoder corresponds to the position encoder, the position information is input into the position encoder to obtain corresponding position characteristic information, and the position characteristic information is input into the position decoder to obtain the corresponding position information.

S42, determining a predicted navigation terminal point in the plurality of updating candidate navigation terminal points according to the plurality of second updating navigation terminal point mark vectors.

The specific process of step S42 is the same as the specific process of step S22, and will not be described here.

And 205, controlling the intelligent body to travel to the navigation point to be traveled.

In the implementation, after determining the navigation point to be driven and the predicted navigation end point, the intelligent agent drives to the navigation point to be driven. And if the to-be-driven navigation point and the predicted navigation terminal point meet the short-distance condition, stopping driving after the intelligent agent reaches the to-be-driven navigation point. The short-distance condition may be that the distance between the to-be-driven navigation point and the predicted navigation end point is smaller than a preset value, that is, when the distance between the to-be-driven navigation point and the predicted navigation end point is smaller than the preset value, the to-be-driven navigation point and the predicted navigation end point may be approximately regarded as the same point.

The preset value can be configured correspondingly according to the requirement on navigation precision. For example, if high navigation accuracy is required, the preset value may be configured to be small, such as 5cm.

Based on the same technical concept, the embodiment of the present application further provides an apparatus for navigating an agent, as shown in fig. 6, where the apparatus includes a determining module 610, a predicting module 620, and a navigating module 630, where:

a determining module 610, configured to determine a first reachable navigation point corresponding to a current navigation point, and obtain a first image corresponding to the first reachable navigation point;

the prediction module 620 is configured to determine a predicted navigation endpoint according to a navigation instruction of the current navigation, position information corresponding to a historical navigation point, gesture information of an agent at the historical navigation point, topology information of the historical navigation point, and the first image, and determine a navigation point to be driven in candidate navigation points, where the candidate navigation points include the historical navigation point and the first reachable navigation point;

and the navigation module 630 is used for controlling the intelligent body to travel to the navigation point to be traveled.

In one possible implementation, the prediction module 620 is further configured to:

In one possible implementation, the prediction module 620 is configured to:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

It should be noted that: in the device for navigating an agent provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the device for navigating an agent provided in the above embodiment and the method embodiment for navigating an agent belong to the same concept, and the specific implementation process of the device is detailed in the method embodiment, which is not described herein again.

Fig. 7 shows a block diagram of an agent 500 according to an exemplary embodiment of the present application. The agent 500 may be an unmanned aerial vehicle, an unmanned delivery vehicle, or the like.

Generally, the agent 500 includes: a processor 501 and a memory 502.

Processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 501 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 501 may also include a main processor and a coprocessor, the main processor being a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen is required to display. In some embodiments, the processor 501 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the method of global positioning provided by the method embodiments herein.

In some embodiments, the agent 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502, and peripheral interface 503 may be connected by buses or signal lines. The individual peripheral devices may be connected to the peripheral device interface 503 by buses, signal lines or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, a display 505, a camera assembly 506, audio circuitry 507, a positioning assembly 508, and a power supply 509.

Peripheral interface 503 may be used to connect at least one Input/Output (I/O) related peripheral to processor 501 and memory 502. In some embodiments, processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 501, memory 502, and peripheral interface 503 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 504 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 504 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 504 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 504 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 504 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display 505 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 505 is a touch display, the display 505 also has the ability to collect touch signals at or above the surface of the display 505. The touch signal may be input as a control signal to the processor 501 for processing. At this time, the display 505 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 505 may be one, and disposed on the front panel of the smart body 500; in other embodiments, the display screen 505 may be at least two, and disposed on different surfaces of the smart body 500 or in a folded design; in other embodiments, the display 505 may be a flexible display disposed on an exterior surface of the agent 500. Even more, the display 505 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 505 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 506 is used to capture images or video. Optionally, the camera assembly 506 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different locations of the intelligent agent 500. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 507 may also include a headphone jack.

The location component 508 is used to locate the current geographic location of the agent 500 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 508 may be a GPS (Global Positioning System ), beidou system or galileo system based positioning component.

The power supply 509 is used to power the various components in the agent 500. The power supply 509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 509 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, agent 500 further includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: an acceleration sensor 511, a gyro sensor 512, and a proximity sensor 513.

The acceleration sensor 511 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the agent 500. For example, the acceleration sensor 511 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 501 may control the display 505 to display a user interface in a landscape view or a portrait view according to a gravitational acceleration signal acquired by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 512 may detect the body direction and the rotation angle of the smart body 500, and the gyro sensor 512 may collect the 3D motion of the user to the smart body 500 in cooperation with the acceleration sensor 511. The processor 501 may implement the following functions based on the data collected by the gyro sensor 512: motion sensing, image stabilization at shooting, and inertial navigation.

A proximity sensor 513, also referred to as a distance sensor, is typically provided on the front panel of the agent 500. The proximity sensor 513 is used to collect the distance between the external object and the front surface of the smart body 500.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is not limiting of the agent 500 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising instructions executable by a processor in an agent to perform the method of smart navigation of the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

It should be noted that, the information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals (including but not limited to signals transmitted between the user terminal and other devices, etc.) referred to in this application are all authorized by the user or are fully authorized by the parties, and the collection, use, and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, the images, location information, etc. referred to in this application are acquired with sufficient authorization.

Claims

1. A method of agent navigation, the method comprising:

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the determining a point to be navigated from the image feature vector corresponding to the second reachable navigation point, the language feature vector and the global feature vector, the navigation point marker vector corresponding to the navigation start point, and the navigation end point marker vector corresponding to each initial candidate navigation end point, wherein determining a predicted navigation end point from the plurality of initial candidate navigation end points comprises:

4. The method according to claim 3, wherein determining a predicted navigation end point according to the navigation instruction of the present navigation, the position information corresponding to the history navigation point, the posture information of the agent at the history navigation point, the topology information of the history navigation point, and the first image, and determining a navigation point to be driven among candidate navigation points includes:

5. The method of claim 4, wherein determining the image feature vector corresponding to the first reachable navigation point according to the first image comprises:

6. The method of claim 4, wherein the determining a navigation point to be driven among candidate navigation points and determining a predicted navigation endpoint according to the second updated global marker vector, the language feature vector corresponding to each word in the navigation instruction, the second updated navigation point marker vector, the navigation point marker vector corresponding to the previous history navigation point, the topology information of the history navigation point, the image feature vector corresponding to the first reachable navigation point, and the plurality of second updated navigation endpoint marker vectors comprises:

7. The method of claim 6, wherein the determining a navigation point to be driven among candidate navigation points based on the third updated global marker vector, the third updated navigation point marker vector, and the second updated image feature vector comprises:

8. The method according to claim 6 or 7, wherein determining a predicted navigation end point from the plurality of third updated navigation end point marker vectors among the updated candidate navigation end points respectively corresponding to the plurality of third updated navigation end point marker vectors comprises:

9. The method of claim 8, wherein the navigation endpoint prediction model consists of a multi-layer perceptron MLP and a normalization function.

10. An agent comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of agent navigation of any of claims 1-9.