CN108537820B

CN108537820B - Dynamic prediction method, system and applicable equipment

Info

Publication number: CN108537820B
Application number: CN201810348528.6A
Authority: CN
Inventors: 张崇洁; 朱广翔
Original assignee: Tuling Artificial Intelligence Institute Nanjing Co ltd
Current assignee: Turing Artificial Intelligence Research Institute (Nanjing) Co., Ltd.
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2021-02-09
Anticipated expiration: 2038-04-18
Also published as: CN108537820A

Abstract

The application provides a dynamic prediction method, a dynamic prediction system and applicable equipment. The dynamic prediction method comprises the following steps: acquiring a current frame image; determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image; and predicting the motion of the object to be predicted based on the relation between the object mask matrix to be predicted and the reference object mask matrix and preset behaviors. According to the dynamic prediction method, the obtained current frame image is divided into the object to be predicted and the reference object, the motion of the object to be predicted is predicted based on the relation between the reference object mask matrix represented by the object mask matrix and the object mask matrix to be predicted and the preset behavior, so that the generalization capability of dynamic prediction can be improved, and the object is represented by the mask matrix, so that the prediction process can be explained.

Description

Dynamic prediction method, system and applicable equipment

Technical Field

The present application relates to the field of image analysis technologies, and in particular, to a dynamic prediction method, a dynamic prediction system, and an apparatus suitable for the dynamic prediction method.

Background

In recent years, due to popularization of big data and improvement of computing power, deep reinforcement learning, which is a combination of reinforcement learning and deep learning, has been developed. In practical applications, generalization and interpretability are major challenges facing deep reinforcement learning. Wherein, generalization refers to adaptation of the model to a fresh sample, i.e. the performance of the trained model on unseen data. Interpretability refers to the property that can explain how the problem is solved, as distinguished from a "black box". Generally, in order to solve the above problems, it is critical to learn the action of behavior of an agent from an object level, where the agent represents an object with behavior capability, such as a robot, an unmanned vehicle, etc.

However, for the learning of the effect of the behavior of a smart agent, the behavior-conditioned dynamic prediction (dynamics prediction) generally adopted by the prior art focuses on pixel-level motion and directly predicts the effect of the behavior, which limits the interpretability and generalization capability for the learned dynamics.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, the present application aims to provide a dynamic prediction method, system and applicable device, so as to solve the problems of low generalization capability and inexplicability in dynamic prediction in the prior art.

To achieve the above and other related objects, a first aspect of the present application provides a dynamic prediction method, comprising the steps of: acquiring a current frame image; determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image; and predicting the motion of the object to be predicted based on the relation between the object mask matrix to be predicted and the reference object mask matrix and preset behaviors.

In certain embodiments of the first aspect of the present application, the object mask matrix to be predicted is determined based on the object to be predicted, and the reference object mask matrix is determined based on the association relationship between the reference object and the motion of the object to be predicted or the type of the reference object.

In certain embodiments of the first aspect of the present application, the determining, based on the current frame image, an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object, includes: and determining an object mask matrix to be predicted comprising an object to be predicted and a reference object mask matrix comprising a reference object based on the current frame image by using a pre-trained first convolutional neural network.

In certain embodiments of the first aspect of the present application, the predicting the motion of the object to be predicted based on the relation between the object mask matrix to be predicted and the reference object mask matrix and a preset behavior comprises: cutting the reference object mask matrix by taking the position of the object to be predicted in the object mask matrix to be predicted as a center according to the size of a preset view window to obtain a cut reference object mask matrix; determining the effect of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a pre-trained second convolutional neural network; predicting the motion of the object to be predicted based on a preset behavior and the determined effect.

In certain embodiments of the first aspect of the present application, the dynamic prediction method further comprises the steps of: adding position information to the obtained clipped reference object mask matrix and determining the effect of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance.

In certain embodiments of the first aspect of the present application, the role further comprises a preset role of the subject to be predicted itself.

In certain embodiments of the first aspect of the present application, the dynamic prediction method further comprises the steps of: extracting a time-invariant background from the current frame image; and combining the extracted time-invariant background and the predicted motion of the object to be predicted to obtain a next frame image.

In certain embodiments of the first aspect of the present application, the dynamic prediction method further comprises the steps of: extracting a time-invariant background from the current frame image based on a pre-trained third convolutional neural network; and combining the extracted time-invariant background and the predicted motion of the object to be predicted to obtain a next frame image.

In certain embodiments of the first aspect of the present application, the third convolutional neural network is arranged as a convolutional deconvolution structure.

In certain embodiments of the first aspect of the present application, the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network are obtained by uniform training according to a loss function.

In certain embodiments of the first aspect of the present application, the current frame image is obtained based on raw data or external input data with a priori knowledge.

The second aspect of the present application also provides a dynamic prediction system, including: the acquisition unit is used for acquiring a current frame image; an object detection unit for determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image; and the prediction unit is used for predicting the motion of the object to be predicted based on the relation between the mask matrix of the object to be predicted and the mask matrix of the reference object and preset behaviors.

In certain embodiments of the second aspect of the present application, the object mask matrix to be predicted is determined based on the object to be predicted, and the reference object mask matrix is determined based on the association relationship between the reference object and the motion of the object to be predicted or the type of the reference object.

In certain embodiments of the second aspect of the present application, the object detection unit is configured to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image using a pre-trained first convolutional neural network.

In certain embodiments of the second aspect of the present application, the prediction unit comprises: the cutting module is used for cutting the reference object mask matrix by taking the position of the object to be predicted in the object mask matrix to be predicted as a center according to the size of a preset view window so as to obtain a cut reference object mask matrix; the action determining module is used for determining the action of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance; a prediction module to predict motion of the object to be predicted based on a preset behavior and the determined effect.

In certain embodiments of the second aspect of the present application, the contribution determination module is configured to add location information to the obtained clipped reference object mask matrix and to determine a contribution of a reference object represented by the clipped reference object mask matrix to the object to be predicted based on a second convolutional neural network trained in advance.

In certain embodiments of the second aspect of the present application, the role further comprises a preset role of the subject to be predicted itself.

In certain embodiments of the second aspect of the present application, the dynamic prediction system further comprises: an extraction unit, configured to extract a time-invariant background from the current frame image; the prediction unit is further configured to obtain a next frame image in combination with the extracted time-invariant background and the predicted motion of the object to be predicted.

In certain embodiments of the second aspect of the present application, the extraction unit is configured to extract a time-invariant background from the current frame image based on a pre-trained third convolutional neural network; the prediction unit is further configured to obtain a next frame image in combination with the extracted time-invariant background and the predicted motion of the object to be predicted.

In certain embodiments of the second aspect of the present application, the third convolutional neural network is arranged as a convolutional deconvolution structure.

In certain embodiments of the second aspect of the present application, the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network are obtained by uniform training according to a loss function.

In certain embodiments of the second aspect of the present application, the current frame image is obtained based on raw data or external input data with a priori knowledge.

The third aspect of the present application also provides a computer-readable storage medium storing at least one program which, when executed, implements the dynamic prediction method described in any one of the above.

The fourth aspect of the present application also provides an apparatus comprising: storage means for storing at least one program; processing means, coupled to the storage means, for invoking the at least one program to perform the dynamic prediction method of any of claims 1-11.

In certain embodiments of the fourth aspect of the present application, the apparatus further comprises a display device for displaying at least one of the object mask matrix to be predicted, the reference object mask matrix, and the predicted motion data of the object to be predicted.

In certain embodiments of the fourth aspect of the present application, the processing device is further configured to generate an object mask image to be predicted and a reference object mask image based on the current frame image, the object mask matrix to be predicted and the reference object mask matrix; the display device is further configured to display the object mask image to be predicted and/or the reference object mask image.

In certain embodiments of the fourth aspect of the present application, the processing device is further configured to generate an object mask image based on the object mask image to be predicted and the reference object mask image; the display device is further configured to display the object mask image.

As described above, the dynamic prediction method, system and device applied thereto of the present application have the following beneficial effects: according to the dynamic prediction method, the obtained current frame image is divided into the object to be predicted and the reference object, the motion of the object to be predicted is predicted based on the relation between the reference object mask matrix represented by the object mask matrix and the object mask matrix to be predicted and the preset behavior, so that the generalization capability of dynamic prediction can be improved, and the object is represented by the mask matrix, so that the prediction process can be explained.

Drawings

FIG. 1 is a schematic diagram of an exemplary scenario in which the dynamic prediction method of the present application is applied, in one embodiment.

FIG. 2 is a flow chart of a dynamic prediction method according to an embodiment of the present application.

FIG. 3 is a schematic diagram of a convolutional neural network used in the dynamic prediction method of the present application in one embodiment.

FIG. 4 is a flowchart illustrating the step S150 of the dynamic prediction method according to an embodiment of the present invention.

FIG. 5 is a flow chart of a dynamic prediction method according to another embodiment of the present application.

Fig. 6 is a schematic structural diagram of a third convolutional neural network used in the dynamic prediction method of the present application in another embodiment.

FIG. 7 is a schematic diagram of a dynamic prediction system according to an embodiment of the present invention.

FIG. 8 is a schematic diagram of a prediction unit in the dynamic prediction system according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a prediction unit in the dynamic prediction system according to another embodiment of the present application.

Fig. 10 is a schematic structural diagram of a dynamic prediction system according to another embodiment of the present application.

Fig. 11 is a schematic structural diagram of a dynamic prediction system according to still another embodiment of the present application.

FIG. 12 is a schematic diagram of the apparatus of the present application in one embodiment.

Fig. 13 shows a schematic structural view of the apparatus of the present application in another embodiment.

Detailed Description

The following description of the embodiments of the present application is provided for illustrative purposes, and other advantages and capabilities of the present application will become apparent to those skilled in the art from the present disclosure.

In the following description, reference is made to the accompanying drawings that describe several embodiments of the application. It is to be understood that other embodiments may be utilized and that compositional and operational changes may be made without departing from the spirit and scope of the present disclosure. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the patent of the present application. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first object to be predicted may be referred to as the second object to be predicted, and similarly, the second object to be predicted may be referred to as the first object to be predicted, without departing from the scope of the various described embodiments. The first object to be predicted and the second object to be predicted are both describing one object to be predicted, but they are not the same object to be predicted unless the context clearly indicates otherwise. Similar situations also include a first set of reference objects and a second set of reference objects, etc.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

Deep reinforcement learning is end-to-end learning which combines deep learning and reinforcement learning to realize actions from perception, and has the potential of enabling an intelligent agent to realize complete autonomous learning. In practical applications, generalization and interpretability are major challenges facing deep reinforcement learning. The key to addressing the above challenges is to learn the role of agent behavior from the object level. However, the prior art has made significant advances in behavior-conditioned dynamic prediction for the learning of the effects of agent behavior, but still suffers from some limitations. First, the behavior-conditioned dynamic prediction employed by the prior art focuses only on pixel-level motion and does not follow an object-oriented paradigm, ignoring the fact that objects are basic prototypes of physical dynamics and always move as a whole, and thus the predicted moving objects generally have blurred contours and textures. Secondly, the dynamic prediction using behavior as a condition adopted by the prior art directly predicts the behavior rather than the relationship between objects as a condition, and the interpretability and generalization capability of the learned dynamic are limited.

In view of this, the present application provides a dynamic prediction method that takes the behaviors of a video frame and an agent as inputs, segments an environment into objects, and predicts on the condition of the relationship between the behaviors and the objects, and thus, may also be referred to as an object-oriented dynamic prediction method. Wherein the object refers to a target thing, and in an environment where dynamic prediction is applied, the object generally refers to an object in the environment, such as a ladder, an agent, and the like in an example application scenario described later.

In order to clearly describe the dynamic prediction method of the present application, the dynamic prediction method will now be described in detail with reference to an example application scenario. Referring to fig. 1, fig. 1 is a schematic diagram of an exemplary scenario in which the dynamic prediction method of the present application is applied, wherein the exemplary scenario includes a ladder a, a wall B, a space C, and an agent D. Further, based on this example scenario, preset behaviors of agent a may include up, down, left, right, and no operation. In practical application, agent D can move up or down when encountering ladder a, and agent D can block and can not move when encountering wall B, and agent D can drop when being in space C.

The form of the application scenario, the elements, and the number thereof are merely examples, and are not intended to limit the application scenario and the elements of the present application. In fact, the application scenario may be other scenarios that need to be dynamically predicted, and in addition, there may be a case where there are multiple agents in the same application scenario, which is not described herein any more.

Referring to fig. 2, fig. 2 is a flowchart illustrating a dynamic prediction method according to an embodiment of the present invention, wherein the dynamic prediction method includes steps S110, S130, and S150. The dynamic prediction method of the present application is described below with reference to fig. 1 and 2.

In step S110, a current frame image is acquired.

Wherein the current frame image is relative to a next frame image described later. In this example, the current frame image refers to the image I (t) at the t-th time, and the next frame image refers to the image I (t +1) at the t + 1-th time to be predicted. In the following description, the current frame image i (t) is represented by an example scene image shown in fig. 1.

In addition, in some embodiments, the current frame image may be obtained based on raw data. In other embodiments, the current frame image may be obtained based on external input data with a priori knowledge. Wherein the a priori knowledge refers to information known in advance. In this example, the external input data with a priori knowledge may include dynamic region information, which is obtained by, for example, a foreground detection manner, and facilitates determination of an object-to-be-predicted mask matrix including an object to be predicted, which is described later, so that the object-to-be-predicted mask matrix may be determined concentrated within the dynamic region with reference to the dynamic region information in the step of determining the object-to-be-predicted mask matrix, which is described later, to improve the recognition rate.

In step S130, an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object are determined based on the current frame image.

The object to be predicted refers to a movable object whose dynamic state needs to be predicted in the current scene, for example, an agent D shown in fig. 1, and the object to be predicted may also be referred to as a dynamic object due to the movable motion property of the object. The reference object refers to other objects except the object to be predicted in the current scene. In some embodiments, the object in the current scene may be divided into a static object and a dynamic object based on the motion attribute, and in such a case, the reference object may include the static object and other dynamic objects except the dynamic object as the object to be predicted. Taking fig. 1 as an example, the reference object may include a ladder a, a wall B and a space C shown in fig. 1, wherein the ladder a is stationary in the current scene and may allow the object to be predicted to move up and down and translate left and right at a position coinciding with the ladder a, the wall B is stationary in the current scene and may prevent the object to be predicted from moving in a direction of the position of the wall B, and the space C is stationary in the current scene and may move the object to be predicted in various directions. Among other things, the ladder a, the wall B, and the space C may also be referred to as static objects due to their motionless nature. In addition, for example, if a smart agent D ' is also included in fig. 1, the smart agent D ' is a movable dynamic object similarly to the smart agent D, and the smart agent D ' is also a reference object corresponding to the smart agent D as an object to be predicted. In view of this, for the object agent D to be predicted, the reference objects corresponding to the object agent D include the ladder a, the wall B, the space C, and the agent D'.

In some embodiments, in the case that an independent dynamic object is included in the example scenario, the dynamic object is an object to be predicted, and one or more reference objects are corresponding to the object to be predicted, and the one or more reference objects are referred to as a group of reference objects corresponding to the object to be predicted, and taking fig. 1 as an example, the object to be predicted is an agent D, and the group of reference objects includes a ladder a, a wall B, and a space C. In other embodiments, in the case that two or more independent dynamic objects are included in the example scene, the two or more dynamic objects may be predicted separately, and then two or more objects to be predicted correspond to the two or more dynamic objects, and two or more groups of reference objects correspond to the two or more objects to be predicted, where each group of reference objects includes other objects except the objects to be predicted in the current frame image. For example, in a case where two agents D are included in the example scene and the movement pattern of each agent D in the current scene is similar, the two agents D are respectively represented as a first object to be predicted and a second object to be predicted, and there is a first set of reference objects corresponding to the first object to be predicted, the first set of reference objects including a ladder a, a wall B, a space C, and the second object to be predicted. A second set of reference objects is associated with the second object to be predicted, the second set of reference objects including a ladder a, a wall B, a space C, and the first object to be predicted.

It should be noted that the above-mentioned object to be predicted and the reference object are only examples, and those skilled in the art may determine the corresponding object to be predicted and the reference object based on different application scenarios, which is not described herein again.

In addition, the mask matrix of the object to be predicted is a mask matrix which is obtained after the current frame image is shielded and only includes the object to be predicted, and the mask matrix of the reference object is a mask matrix which is obtained after the current frame image is shielded and only includes the reference object. Wherein the mask matrix of the object represents a probability that each pixel of the image belongs to the object, the probability being a number between 0 and 1, wherein 0 represents that the probability of belonging to the object is 0 and 1 represents that the probability of belonging to the object is 1. For convenience of description, the object mask matrix to be predicted and the reference object mask matrix are collectively referred to as an object mask matrix. In addition, based on the above-mentioned condition that one object to be predicted corresponds to one group of reference objects, correspondingly, the one object mask matrix to be predicted corresponds to one group of reference object mask matrices.

In some embodiments, the object mask matrix to be predicted is determined based on the object to be predicted, and the reference object mask matrix is determined based on the association relationship between the reference object and the motion of the object to be predicted or the type of the reference object. That is, the object mask matrix to be predicted is determined object-specifically, and the reference object mask matrix is determined class-specifically.

For example, regarding the mask matrix of the object to be predicted, a corresponding mask matrix of the object to be predicted is generated for the object to be predicted, for example, the agent D, and in the case where there are a plurality of agents D, a plurality of mask matrices of the object to be predicted corresponding to each agent D are generated.

Regarding the reference object mask matrix, various types of reference object mask matrices corresponding to the association relationship are generated according to the association relationship between the reference object and the motion of the corresponding object to be predicted, wherein the association relationship is the influence of the reference object on the movement of the object to be predicted. That is, the incidence relation may be divided based on the influence of the reference object on the object to be predicted. The influence depends on the motion state of the reference object relative to the object to be predicted and the motion properties of the reference object. Taking fig. 1 as an example, the example scenario shown in fig. 1 includes a static object ladder a capable of climbing an object to be predicted, a static object wall B preventing the object to be predicted from moving, and a static object space C capable of dropping the object to be predicted, and although a plurality of ladders a, walls B, and spaces C are shown in the example scenario, a ladder class reference object mask matrix corresponding to a reference object such as a ladder a may be generated based on a motion relationship of the reference object with respect to the object to be predicted, and similarly, a wall class reference object mask matrix corresponding to a reference object such as a wall B may be generated, and a space class reference object mask matrix corresponding to a reference object such as a space C may be generated. In addition, if a color flag is further included in the example scenario in fig. 1 as a static object, and the color flag only represents a final destination to which the agent D needs to reach but does not affect the movement of the agent D, a color flag type reference object mask matrix may be generated based on the motion relationship of the reference object with respect to the object to be predicted. In addition, if the example scenario in fig. 1 further includes an obstacle, which is also a static object for preventing the movement of the object to be predicted, based on the motion relationship between the reference object and the object to be predicted, a blocking-type reference object mask matrix corresponding to the reference object such as the wall B and the reference object such as the obstacle may be generated. Furthermore, if the scene shown in fig. 1 includes two agents, i.e., two dynamic objects, then the two agents can be predicted separately, in which case the two agents are referred to as a first object to be predicted and a second object to be predicted separately. When the first object to be predicted is predicted, a second object to be predicted, which is a dynamic object, is used as a reference object of the first object to be predicted. When the second object to be predicted is predicted, the first object to be predicted, which is a dynamic object, is used as a reference object of the second object to be predicted. In view of this, there are a first to-be-predicted object mask matrix, a second to-be-predicted object mask matrix, a first set of reference object mask matrices corresponding to the first to-be-predicted object mask matrix, and a second set of reference object mask matrices corresponding to the second to-be-predicted object mask matrix. The first group of reference object mask matrixes comprise a ladder type reference object mask matrix, a wall type reference object mask matrix (or a blocking type reference object mask matrix), a space type reference object mask matrix, a color flag type reference object mask matrix and a second object mask matrix to be predicted. The second set of reference object mask matrices includes a ladder class reference object mask matrix, a wall class reference object mask matrix (or a blocking class reference object mask matrix), a spatial class reference object mask matrix, a color flag class reference object mask matrix, and a first to-be-predicted object mask matrix.

Alternatively, the reference object mask matrix may be determined according to the type of the reference object. For example, in the case where an obstacle is included in the exemplary scene in fig. 1 as described above, the obstacle is also a static object for preventing the movement of the object to be predicted but is of a different type from the wall B, a wall class reference object mask matrix corresponding to a reference object of the wall B type and an obstacle class reference object mask matrix corresponding to a reference object of the obstacle type may be generated based on the type of the reference object, respectively.

For the sake of simplicity, the present application is described by taking an example that an exemplary scene includes one object to be predicted, the object to be predicted corresponds to a group of reference objects, and a reference object mask matrix including the reference objects is determined based on the association relationship between the reference objects and the motion of the object to be predicted, but the present application is not limited thereto. Those skilled in the art should understand that, based on different application scenarios, the present application may also be applied to a case that includes a plurality of objects to be predicted and a plurality of groups of reference objects corresponding to the objects to be predicted, and details are not repeated here.

In an embodiment, an object to be predicted may be obtained from a sequence image by using, for example, a foreground detection manner, a reference object may be obtained through feature identification based on features of the reference object in an application scene input in advance, and a mask matrix processing may be performed on a current frame image including the object to be predicted and the reference object by using a mask matrix to obtain an object mask matrix to be predicted and a reference object mask matrix. The mask matrix of the object to be predicted and the mask matrix of the reference object respectively comprise position information of the mask matrix in the current frame image, so that the positions of the mask matrix of the object to be predicted and the mask matrix of the reference object relative to the current frame image can be determined through the position information. In an example, the reference object mask matrix and the object mask matrix to be predicted are obtained by performing a mask matrix operation on a current frame image in an original size.

In another embodiment, a pre-trained first convolutional neural network may be used to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image. In an example, the first convolutional neural network may include a plurality of convolutional neural networks that are identical in structure but different in weight. The current frame image may be input into each convolutional neural network trained in advance, output layers of each convolutional neural network are interconnected with each other via channels to form a fully connected feature map, and then a pixel-wise softmax layer is followed to obtain an object mask matrix to be predicted specific to an object and a reference object mask matrix specific to a class, wherein the number of the convolutional neural networks may be determined based on the number of the object mask matrices. Taking fig. 1 as an example, according to the exemplary scenario shown in fig. 1, the object mask matrix includes one object mask matrix to be predicted and three reference object mask matrices, and thus, four corresponding object mask matrices may be obtained through four convolutional neural networks, which may have the same structure but different weights. If the example scenario shown in FIG. 1 includes two dynamic objects, agent D, then, in contrast, the corresponding five object mask matrices are obtained by five convolutional neural networks, which may have the same structure but different weights.

For example, referring to fig. 3, fig. 3 is a schematic structural diagram of a convolutional neural network used in the dynamic prediction method of the present application in an embodiment, as shown in the figure, the structure of the convolutional neural network may be a multi-layer convolution plus full convolution structure, where i (t) represents a current frame image, a solid arrow represents a convolution plus activation function, a dashed arrow represents an amplification plus full connection, and a long and short interval dashed arrow represents a copy plus full connection, in this example, a ReLU is used as the activation function. Where Conv (F, K, S) may be set to represent convolutional layers with F filters, a convolutional kernel of K and a step size of S, and assuming that R () represents an activation function layer, i.e., a ReLU layer, and BN () represents a batch normalization layer, the five convolutional layers shown in fig. 3 may be represented as R (BN (Conv (64, 5, 2))), R (BN (Conv (64, 3, 1))), R (BN (32, 1, 1))), R (BN (Conv (1, 3, 1))).

It should be noted that the structure and the parameters of the convolutional neural network are only examples, and those skilled in the art may modify and modify the structure and the parameters of the convolutional neural network based on the object to be predicted and the reference object included in different application scenarios, which is not described herein again.

In step S150, the motion of the object to be predicted is predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior.

Wherein the preset behavior is preset based on an application scenario. The preset behavior may output, for example, in a form of coding, one or more behaviors that control the motion of the object to be predicted, such as "behavior 1", "behavior 2", and the like, facing the machine. In addition, the behavior may refer to a corresponding specific behavior in a specific application scenario. Taking fig. 1 as an example, in the example scenario shown in fig. 1, the preset behaviors may include behaviors 1 to 5, where, applied in the scenario, behaviors 1 to 5 represent up, down, left, right, and no operation, respectively. The preset behavior can be set by means of encoding, for example one-hot encoding.

In the present application, the motion of the object to be predicted is predicted by the preset behavior and the relationship between the object mask matrix to be predicted and the reference object mask matrix. In some embodiments, the motion of all objects including the object to be predicted and the reference object may also be predicted based on the relationship between the mask matrix of the object to be predicted and the mask matrix of the reference object and the preset behavior, but the prediction method is more computationally intensive and inefficient than predicting the motion of the object to be predicted only.

Taking fig. 1 as an example, in the application scenario shown in fig. 1, the object mask matrix to be predicted is a first object mask matrix to be predicted including an agent D, and the reference object mask matrices are a first type reference object mask matrix including a ladder a, a second type reference object mask matrix including a wall B, and a third type reference object mask matrix including a space C, respectively, where the first type reference object mask matrix, the second type reference object mask matrix, and the third type reference object mask matrix are collectively referred to as a first group of reference object mask matrices corresponding to the first object mask matrix to be predicted. The relation between the mask matrix of the object to be predicted and the mask matrix of the reference object comprises the action of the mask matrix of the first type of reference object on the mask matrix of the object to be predicted based on the preset behavior, the action of the mask matrix of the second type of reference object on the mask matrix of the object to be predicted based on the preset behavior and the action of the mask matrix of the third type of reference object on the mask matrix of the object to be predicted based on the preset behavior.

In addition, the predicted motion of the object to be predicted may include, but is not limited to, a direction of motion of the object to be predicted, a moving distance, position information of the object to be predicted after the motion, and the like.

Referring to fig. 4, fig. 4 is a flowchart illustrating a step S150 of the dynamic prediction method according to an embodiment of the present invention, and as shown in the drawing, the step S150 may include a step S1501, a step S1503 and a step S1505.

In step S1501, the reference object mask matrix is clipped according to the preset view window size centering on the position of the object to be predicted in the object mask matrix to be predicted to obtain a clipped reference object mask matrix.

Wherein the position of the object to be predicted is defined based on the expected position of the mask matrix of the object to be predicted. For example, for the jth object to be predicted Dj, its position

Can be expressed by the following formula (1):

wherein H and W represent the height and width of the image, respectively, M_DjAnd representing the mask matrix of the object to be predicted of the jth object to be predicted.

The view window size is a maximum effective range that can represent the relationship between objects. The inter-object relationship refers to a relationship between an object to be predicted and a reference object. The viewing window size may be preset empirically by the technician. Assuming the size of the view window is w, the method comprises

The view window Bw of size w as a center is expressed by the following formula (2):

that is, based on the above-described formula (1) and formula (2), the object to be predicted

And as the center, clipping the reference object mask matrix according to the Bw. In one example, the clipping process may be implemented in a bilinear sampling manner. In addition, in the case where the preset view window size is equal to the original input image size, it is regarded as no cropping. In practical application, the locality principle usually exists in the relationship between objects, so that the application introduces the locality principle through clipping processing, and further focuses the dynamic influence on the object to be predicted on the relationship between the object to be predicted and other objects adjacent to the object to be predicted.

In step S1503, the effect of the reference object represented by the clipped reference object mask matrix on the object to be predicted is determined based on the second convolutional neural network trained in advance.

In an embodiment, the cropped reference object mask matrix obtained in step S1501 may be input to a second convolutional neural network trained in advance, wherein the second convolutional neural network may include a plurality of convolutional neural networks having the same structure but different weights. In another embodiment, the position information may be added to the obtained clipped reference object mask matrix, and then the clipped reference object mask matrix and the xy coordinate map obtained in step S1501 may be input to the second convolutional neural network trained in advance. Wherein the obtained cropped reference object mask matrix is added with location information to make subsequent processing more sensitive to the location information. For example, the clipped reference object mask matrix is connected to a constant xy coordinate graph to add spatial information to the network, thereby increasing the change in position and decreasing symmetry.

The second convolutional neural network is used for determining the effect of the reference object on the motion of the object to be predicted. Suppose that in an application scenario, n_ORepresenting the total number of object mask matrices, n_DRepresenting the number of objects to be predicted, for (n)_O-1)×n_DFor the object, the total second convolutional neural network comprises (n)_O-1)×n_DA convolutional neural network. Taking fig. 1 as an example, in the scenario shown in fig. 1, the object mask matrix includes one object mask matrix to be predicted and three reference object mask matrices, where the object mask matrix to be predicted is an object mask matrix to be predicted including agent D, and the reference object mask matrices are respectively a first type reference object mask matrix including ladder a, a second type reference object mask matrix including wall B, and a third type reference object mask matrix including space C, so that the effect of the corresponding three types of reference objects on one object to be predicted can be obtained through three convolutional neural networks. Furthermore, similarly, if two dynamic objects are included in the application scenario, that is, if two agents D are included in the application scenario, then two dynamic objects can be predicted respectively, and the dynamic objects are represented by the first object to be predicted and the second object to be predicted respectively, accordingly, five object mask matrices are shared in the application scenario, namely, the first predicted object mask matrix, the second predicted object mask matrix, the first type reference object mask matrix, the second type reference object mask matrix, and the third type reference object mask matrix. Accordingly, the first set of reference object mask matrices corresponding to the first predicted object mask matrix includes: second step ofThe device comprises a test object mask matrix, a first type reference object mask matrix, a second type reference object mask matrix and a third type reference object mask matrix. Then for the first predicted object, there need to be four convolutional neural networks corresponding to the first predicted object, which constitute a first set of convolutional neural networks corresponding to the first predicted object. In addition, the second group of reference object mask matrices corresponding to the second predicted object mask matrix includes: a first predicted object mask matrix, a first class reference object mask matrix, a second class reference object mask matrix, and a third class reference object mask matrix. For the second prediction object, there are needed four convolutional neural networks corresponding to the second prediction object, and the four convolutional neural networks constitute a second group of convolutional neural networks corresponding to the second prediction object. In summary, the second convolutional neural network includes two groups of eight convolutional neural networks respectively corresponding to the objects to be predicted.

For convenience of description, the example scenario includes one object to be predicted, and a corresponding set of convolutional neural networks is taken as an example for description, but the present application is not limited thereto, and it should be understood by those skilled in the art that in the case of including two or more objects to be predicted, and correspondingly including two or more sets of convolutional neural networks, the processing may be performed in parallel to obtain the effect of each set of reference objects on the corresponding object to be predicted respectively.

In the example illustrated in fig. 1, the object mask matrix includes one object mask matrix to be predicted and three reference object mask matrices, and the second convolutional neural network includes three convolutional neural networks, which may have the same structure but different weights. In one particular implementation, the structure of the convolutional neural network is similar to that shown in FIG. 3. The convolutional neural network has the connection sequence of R (BN (Conv (16, 3, 2))), R (BN (Conv (32, 3, 2))), R (BN (Conv (64, 3, 2))), R (BN (Conv (128, 3, 2))), and the last convolutional layer is sequentially reconstructed and fully connected by a 128-dimensional hidden layer and a 2-dimensional output layer.

It should be noted that the structure and the parameters of the convolutional neural network are only examples, and those skilled in the art may modify and modify the structure and the parameters of the convolutional neural network based on the object to be predicted, the reference object, and the effect of the reference object on the object to be predicted based on the preset behavior, which are included in different application scenarios, and are not described in detail herein.

Here, the effect is based on an effect of a behavior-based movement of the object to be predicted by the reference object learned by the convolutional neural network. For example, in the case where the object to be predicted is currently located at a ladder and the entered behavior is upward, the effect, i.e., the vector that represents the effect, represents that the object to be predicted moves a set distance upward along the ladder. For another example, in a case that the object to be predicted is currently located at the left side of the flag and the input behavior is right, since the flag has no influence on the motion of the object to be predicted, the effect of the corresponding flag class reference object mask on the object mask to be predicted may be represented as 0.

In addition, the action may also include the action of the preset object to be predicted itself. For example, an object to be predicted, which is set in advance based on an application scene, moves a certain distance to the right in any case, and thus, in consideration of the effect of the reference object on the object to be predicted, the effect of the object to be predicted itself is also taken into consideration to finally determine the movement of the object to be predicted based on the behavior. The effects are represented by vectors, for example.

In step S1505, the motion of the object to be predicted is predicted based on the preset behavior and the determined effect.

In one embodiment, the action of the reference object represented by each reference object mask matrix obtained by each cut reference object mask matrix based on each convolutional neural network on the object to be predicted is added to the action of the object to be predicted and multiplied by a preset action based on, for example, one-hot coding to obtain the dynamic prediction of the object to be predicted.

In another embodiment, the action of the reference object represented by each reference object mask matrix obtained by each convolutional neural network based on each clipped reference object mask matrix on the object to be predicted and the action of the object to be predicted are multiplied by a preset behavior based on one-hot coding, for example, and then added to obtain the dynamic prediction of the object to be predicted.

In summary, the dynamic prediction method of the present application may predict the motion of the object to be predicted based on the relationship between the reference object mask matrix represented by the object mask matrix and the object mask matrix to be predicted and the preset behavior by dividing the acquired current frame image into the object to be predicted and the reference object, so that the generalization capability of the dynamic prediction may be improved, and the object may be represented by the mask matrix, so that the prediction process may be explained.

In practical applications, in some cases, it is necessary to predict not only the motion of an object to be predicted but also the next frame image. In view of the above, referring to fig. 5, fig. 5 is a flowchart illustrating a dynamic prediction method according to another embodiment of the present application, and as shown in the drawing, the dynamic prediction method includes steps S510, S530, S550, S570 and S590.

In step S510, a current frame image is acquired. Step S510 is the same as or similar to step S110 in the previous example, and will not be described in detail here.

In step S570, a time-invariant background is extracted from the current frame image.

The time-invariant background is an image formed by an object that does not change with time in the image. In some embodiments, the image background may be obtained by, for example, foreground detection. In other embodiments, the time-invariant background may be extracted from the current frame image based on a pre-trained third convolutional neural network. For example, the structure of the third convolutional neural network includes, but is not limited to: full convolution, convolution deconvolution, residual error networks (ResNet), Unet, etc.

In one example, the third convolutional neural network is arranged as a convolutional deconvolution structure. Referring to fig. 6, fig. 6 is a schematic structural diagram of a third convolutional neural network used in the dynamic prediction method of the present application in another embodiment, where the third convolutional neural network is a codec structure. Wherein I (t) represents the current frame image, I_bg(t) represents the current background image, solid arrow represents the convolution plus activation function, dashed arrow represents reconstruction, single dashThe line arrows indicate full connections, and the two-dot chain line arrows indicate deconvolution plus activation functions, in this example, ReLU is used as the activation function. Wherein, for all convolution and deconvolution, the convolution kernel, step size and number of channels are set to 3, 2 and 64 respectively, and the dimension of the hidden layer between the encoder and the decoder is 128. In addition, for training in a large number of environments, the number of channels of convolution may be set to 128 to improve the effect of background separation. In addition, the activation function ReLU of the last deconvolution layer may also be replaced with a tanh function to output a value in the range of-1 to 1.

In step S530, an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object are determined based on the current frame image. Step S530 is the same as or similar to step S130 in the previous example, and will not be described in detail here.

In step S550, the motion of the object to be predicted is predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior. Step S550 is the same as or similar to step S150 in the previous example, and will not be described in detail here.

In step S590, a next frame image is obtained in conjunction with the extracted time-invariant background and the predicted motion of the object to be predicted.

The next frame image is the image I (t +1) at the t +1 th time corresponding to the current frame image I (t). In one embodiment, based on the current frame image, the background image, the object mask matrix and the predicted motion of the object to be predicted, a Spatial Transform Network (STN) is used for spatial transform processing to obtain the next frame image. Specifically, on the one hand, based on the mask matrix of the object to be predicted and the predicted motion of the object to be predicted, the first spatial transformation network is adopted to perform spatial transformation processing and perform a negation operation so as to perform a multiplication operation with the extracted time-invariant background image, and further obtain a background image at the t +1 th moment, wherein the multiplication operation refers to performing an array element multiplication algorithm. On the other hand, based on the mask matrix of the object to be predicted, the current frame image and the predicted motion of the object to be predicted, the second spatial transformation network is adopted to perform spatial transformation processing to obtain the object image at the t +1 th moment. And performing an addition operation on the background image at the t +1 th moment and the object image at the t +1 th moment to obtain an image at the t +1 th moment, namely a next frame image, wherein the addition operation refers to performing an array element addition algorithm. Similarly, when the application scene comprises two objects to be predicted, the two objects to be predicted are respectively subjected to dynamic prediction, and the results are simultaneously displayed on the next frame of image.

In the case where the next frame image is obtained, the next frame image, i.e., the predicted image, may be provided as a new frame image to step S510, thus looping the entire process of predicting the movement of the object to be predicted.

In addition, based on the above description, for the neural network adopted in the present application, a loss function may be introduced to adjust the neural network when the neural network is trained, and the current model may be evaluated according to the loss function when the neural network is applied. For example, in this application, but not limited to: introducing an entropy loss function to limit the entropy of the object mask matrix, introducing a regression loss function to optimize the object mask matrix and the motion vector of the object to be predicted, and introducing a pixel loss function to limit image prediction errors, reconstruct a current image and the like. In one embodiment, the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network are obtained by uniform training according to a loss function. In one example, for the case where the current frame image is obtained based on raw data, the introduced loss function is as follows.

1) For the step of extracting a time-invariant background from the current frame image, L_BackgroundA background loss function is expressed for making the network meet time invariant performance.

Wherein, assuming that H and W respectively represent the height and width of the image,

is shown asAt the time t, the background image,

representing the background image at the t +1 th time, if the current frame image I^(t)∈R^H×W×3Then, the background image corresponding to the current frame image

Where R represents a real space and the pixels of the background image do not change over time.

2) A step of determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object, based on the current frame image, wherein,

i)L_{entropy of the entropy}A pixel-by-pixel entropy loss function is represented to limit the entropy of the object mask matrix. The pixel-by-pixel entropy loss is introduced herein in order to reduce the uncertainty of the per-pixel I (u, v) relationship and to excite the object mask matrix to obtain a more discrete distribution.

Wherein n is_ORepresenting the total number of object mask matrices, c representing the c-th channel of the fully connected feature layer in the first convolutional neural network, f (u, v, c) representing the value at position (u, v) in the c-th channel of the fully connected feature layer, I representing the I-th reference object mask matrix, and p representing the probability that pixel I (u, v) of the input image belongs to the c-th channel object.

ii)L_{Short-cut path}A regression loss function is expressed to optimize the reference object mask matrix and motion vectors to provide early feedback prior to reconstructing the image.

Wherein n is_ORepresenting the total number of object mask matrices, n_DRepresents the number of independent objects to be predicted, j represents the jth object to be predicted, t represents the t-th time, t +1 represents the t + 1-th time,

represents a motion vector of an object to be predicted Dj and

E_self(Dj) represents the role of the object to be predicted itself and

E(O_i,D_j) Represents the effect of the ith reference object on the jth object to be predicted

n_αRepresenting the number of actions, α^(t)Express behavior and

3) a step of predicting the motion of the object to be predicted based on a preset behavior and the determined effect and obtaining a next frame image in combination with the extracted time-invariant background and the predicted motion of the object to be predicted, wherein,

i)L_predictionIndicating the image prediction error, in the present application, |₂Pixel loss to limit image prediction error.

Wherein, I^(t+1)Representing the pixel frame at time t +1,

represents the predicted frame of pixels at time t +1, where,

a pixel frame representing the predicted object to be predicted at the t +1 th time,

the reference frame at the t +1 th time point represents a predicted reference object, and STN represents a spatial transform network and represents multiplication of array elements.

ii)L_{Reconstruction}Indicating reconstruction errors, in this application similar l is used₂Pixel loss to reconstruct the current frame image.

iii)L_{Consistency error}To describe the consistency of the pixel change when the object moves.

In summary, the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network described in the present application are adjusted and trained based on the total loss function obtained by giving different weights to the loss functions and combining them. That is, the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network described in the present application are trained based on the total loss in equation (9).

L_{General assembly}＝L_{Short-cut path}+λ_pL_Prediction+λ_rL_{Reconstruction}+λ_cL_{Consistency error}+λ_bgL_Background+λ_eL_{Entropy of the entropy}(formula 9)

Where λ p, λ r, λ c, λ bg, and λ e represent weights.

In addition, for the case that the current frame image is obtained based on external input data with a priori knowledge, an additional candidate region error L is introduced in the training of the neural network_{Candidate region}。

Wherein the content of the first and second substances,

representing the dynamic region as a candidate.

In addition, on one hand, the object mask matrix to be predicted including the object to be predicted and the reference object mask matrix including the reference object are determined on the basis of the current frame image in the aspect of appearance when the object detection is carried out, and on the other hand, the object mask matrix to be predicted and the reference object mask matrix can be determined on the basis of the behavior and the inter-object relation in the aspect of relation when the dynamic prediction is carried out, so that the results determined by the object mask matrix to be predicted and the reference object mask matrix can be compared and compensated to determine the more accurate reference object mask matrix and the more accurate reference object mask matrix.

The application also provides a dynamic prediction system. Referring to fig. 7, fig. 7 is a schematic structural diagram of a dynamic prediction system according to an embodiment of the present disclosure, and as shown in the drawing, the dynamic prediction system includes an obtaining unit 11, an object detecting unit 12, and a prediction unit 13.

The acquiring unit 11 is used for acquiring a current frame image. Wherein the current frame image is relative to a next frame image described later. In this example, the current frame image refers to the image I (t) at the t-th time, and the next frame image refers to the image I (t +1) at the t + 1-th time to be predicted. In the following description, the current frame image i (t) is represented by an example scene image shown in fig. 1.

The object detection unit 12 is configured to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image.

In an embodiment, the object detection unit 12 obtains an object to be predicted from a sequence image by using, for example, a foreground detection method, obtains a reference object by feature recognition based on a feature of a reference object in an application scene input in advance, and performs mask matrix processing on a current frame image including the object to be predicted and the reference object by using a mask matrix to obtain an object mask matrix to be predicted and a reference object mask matrix. The mask matrix of the object to be predicted and the mask matrix of the reference object respectively comprise position information of the mask matrix in the current frame image, so that the positions of the mask matrix of the object to be predicted and the mask matrix of the reference object relative to the current frame image can be determined through the position information. In an example, the reference object mask matrix and the object mask matrix to be predicted are obtained by performing a mask matrix operation on a current frame image in an original size.

In another embodiment, the object detection unit 12 is configured to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image using a pre-trained first convolutional neural network. In an example, the first convolutional neural network may include a plurality of convolutional neural networks that are identical in structure but different in weight. The current frame image may be input into each convolutional neural network trained in advance, output layers of each convolutional neural network are interconnected with each other via channels to form a fully connected feature map, and then a pixel-wise softmax layer is followed to obtain an object mask matrix to be predicted specific to an object and a reference object mask matrix specific to a class, wherein the number of the convolutional neural networks may be determined based on the number of the object mask matrices. Taking fig. 1 as an example, according to the exemplary scenario shown in fig. 1, the object mask matrix includes one object mask matrix to be predicted and three reference object mask matrices, and thus, four corresponding object mask matrices may be obtained through four convolutional neural networks, which may have the same structure but different weights. If the example scenario shown in FIG. 1 includes two dynamic objects, agent D, then, in contrast, the corresponding five object mask matrices are obtained by five convolutional neural networks, which may have the same structure but different weights.

The prediction unit 13 is configured to predict the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior.

In the present application, the prediction unit 13 predicts the motion of the object to be predicted with a preset behavior and a relationship between the mask matrix of the object to be predicted and the mask matrix of the reference object. In some embodiments, the prediction unit 13 may also predict the motion of all objects including the object to be predicted and the reference object based on the relationship between the mask matrix of the object to be predicted and the mask matrix of the reference object and the preset behavior, but the prediction method is more computationally inefficient than predicting the motion of the object to be predicted only.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a prediction unit in the dynamic prediction system according to an embodiment of the present invention, and as shown in the drawing, the prediction unit 13 may include a clipping module 131, an action determining module 132, and a prediction module 133.

The cropping module 131 is configured to crop the reference object mask matrix according to a preset view window size with a position of an object to be predicted in the object mask matrix to be predicted as a center to obtain a cropped reference object mask matrix.

Can be expressed by the following formula (1):

In one embodiment, the action determination module 132 is configured to determine an action of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a pre-trained second convolutional neural network. In another embodiment, the action determining module 132 may be further configured to add location information to the obtained cropped reference object mask matrix and determine an action of the reference object represented by the cropped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance.

That is, the cropped reference object mask matrix obtained via the cropping module 131 may be input to a second convolutional neural network trained in advance, wherein the second convolutional neural network may include a plurality of convolutional neural networks having the same structure but different weights. Alternatively, the position information may be added to the obtained clipped reference object mask matrix, and then the clipped reference object mask matrix and the xy coordinate map obtained via the clipping module 131 are input into the second convolutional neural network trained in advance. Wherein the obtained cropped reference object mask matrix is added with location information to make subsequent processing more sensitive to the location information. For example, the clipped reference object mask matrix is connected to a constant xy coordinate graph to add spatial information to the network, thereby increasing the change in position and decreasing symmetry.

The second convolutional neural network is used for determining the effect of the reference object on the motion of the object to be predicted. Suppose that in an application scenario, n_ORepresenting the total number of object mask matrices, n_DRepresenting the number of objects to be predicted, for (n)_O-1)×n_DFor the object, the total second convolutional neural network comprises (n)_O-1)×n_DA convolutional neural network. Taking fig. 1 as an example, in the scenario shown in fig. 1, the object mask matrix includes one object mask matrix to be predicted and three reference object mask matrices, where the object mask matrix to be predicted is an object mask matrix to be predicted including agent D, and the reference object mask matrices are respectively a first type reference object mask matrix including ladder a, a second type reference object mask matrix including wall B, and a third type reference object mask matrix including space C, so that three corresponding types of reference objects can be obtained by three convolutional neural networks to one reference object mask matrixThe role of the object to be predicted. Furthermore, similarly, if two dynamic objects are included in the application scenario, that is, if two agents D are included in the application scenario, then two dynamic objects can be predicted respectively, and the dynamic objects are represented by the first object to be predicted and the second object to be predicted respectively, accordingly, five object mask matrices are shared in the application scenario, namely, the first predicted object mask matrix, the second predicted object mask matrix, the first type reference object mask matrix, the second type reference object mask matrix, and the third type reference object mask matrix. Accordingly, the first set of reference object mask matrices corresponding to the first predicted object mask matrix includes: a second predicted object mask matrix, a first type reference object mask matrix, a second type reference object mask matrix, and a third type reference object mask matrix. Then for the first predicted object, there need to be four convolutional neural networks corresponding to the first predicted object, which constitute a first set of convolutional neural networks corresponding to the first predicted object. In addition, the second group of reference object mask matrices corresponding to the second predicted object mask matrix includes: a first predicted object mask matrix, a first class reference object mask matrix, a second class reference object mask matrix, and a third class reference object mask matrix. For the second prediction object, there are needed four convolutional neural networks corresponding to the second prediction object, and the four convolutional neural networks constitute a second group of convolutional neural networks corresponding to the second prediction object. In summary, the second convolutional neural network includes two groups of eight convolutional neural networks respectively corresponding to the objects to be predicted.

The prediction module 133 is configured to predict the motion of the object to be predicted based on the predetermined behavior and the determined effect.

In an embodiment, the prediction module is configured to add the effect of the reference object to the object to be predicted, which is represented by each reference object mask matrix obtained by each convolutional neural network based on each clipped reference object mask matrix, to the effect of the object to be predicted and to multiply the effect by a preset behavior based on, for example, one-hot coding, so as to obtain the dynamic prediction of the object to be predicted.

In another embodiment, the prediction module is configured to multiply the action of the reference object represented by each reference object mask matrix obtained by each convolutional neural network based on each clipped reference object mask matrix and the action of the object to be predicted by the reference object and the preset behavior based on, for example, one-hot coding respectively and then add the multiplied actions to obtain the dynamic prediction of the object to be predicted.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a prediction unit in the dynamic prediction system according to another embodiment of the present invention, and referring to fig. 1, as shown in fig. 9, in one specific implementation, after the cropping module in the prediction unit receives the object mask matrix including the object mask matrix to be predicted and the reference object mask matrix determined by the object detection unit, the cutting module determines the position of an object to be predicted in the mask matrix of the object to be predicted and cuts the mask matrix of the reference object according to a preset visual field window by taking the position as a center to obtain a cut mask matrix of the reference object, then, the action determining module adds the position information to the obtained clipped reference object mask matrix through the xy coordinate graph and determines the action of each reference object represented by each clipped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance. Then, the prediction module sums the action of the reference object on the object to be predicted and the self action of the preset object to be predicted and performs dot product with the preset action to predict the motion of the object to be predicted.

In summary, the dynamic prediction system of the present application divides the current frame image acquired by the acquisition unit into the object to be predicted and the reference object by using the object detection unit, and predicts the motion of the object to be predicted based on the relationship between the reference object mask matrix represented by the object mask matrix and the object mask matrix to be predicted and the preset behavior by using the prediction unit, so that the generalization ability of dynamic prediction can be improved, and represents the object by using the mask matrix so that the prediction process can be interpreted.

In practical applications, in some cases, it is necessary to predict not only the motion of an object to be predicted but also the next frame image. In view of the above, referring to fig. 10, fig. 10 is a schematic structural diagram of another embodiment of the dynamic prediction system of the present application, and as shown in the drawing, the dynamic prediction system includes an obtaining unit 91, an object detecting unit 92, a predicting unit 93 and an extracting unit 94.

The acquiring unit 91 is configured to acquire a current frame image. The acquisition unit 91 is identical or similar to the acquisition unit 11 of the previous example and is not similar here.

The extraction unit 94 is used to extract a time-invariant background from the current frame image.

The time-invariant background is an image formed by an object that does not change with time in the image. In some embodiments, the extraction unit 94 may obtain the image background by, for example, foreground detection. In other embodiments, the extraction unit 94 may be configured to extract the time-invariant background from the current frame image based on a pre-trained third convolutional neural network. For example, the structure of the third convolutional neural network includes, but is not limited to: full convolution, convolution deconvolution, residual error networks (ResNet), Unet, etc.

In one example, the third convolutional neural network is arranged as a convolutional deconvolution structure. Referring to fig. 6, fig. 6 is a schematic structural diagram of a third convolutional neural network used in the dynamic prediction method of the present application in another embodiment, where the third convolutional neural network is a codec structure. Wherein I (t) represents the current frame image, I_bg(t) represents the current background image, the solid line arrow represents the convolution plus activation function, the dashed line arrow represents reconstruction, the single-dot chain line arrow represents full connectivity, the double-dot chain line arrow represents the deconvolution plus activation function, and in this example, the activation function is ReLU. Wherein, for all convolution and deconvolution, the convolution kernel, step length and channel number are respectively set to be 3 and 2 so as toAnd 64, the dimension of the hidden layer between encoder and decoder is 128. In addition, for training in a large number of environments, the number of channels of convolution may be set to 128 to improve the effect of background separation. In addition, the activation function ReLU of the last deconvolution layer may also be replaced with a tanh function to output a value in the range of-1 to 1.

The object detection unit 92 is configured to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image. The object detection unit 92 is the same as or similar to the object detection unit 12 in the previous example and is not similar here.

The prediction unit 93 is configured to predict the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior, and to obtain the next frame image by combining the extracted time-invariant background and the predicted motion of the object to be predicted. The prediction unit 93 is configured to predict the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior, and the prediction unit 13 is configured to predict the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior, and is the same as or similar to the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior.

In addition, regarding the embodiment where the prediction unit 93 is configured to obtain the next frame image by combining the extracted time-invariant background and the predicted motion of the object to be predicted, the next frame image is the image I (t +1) at the t +1 th time corresponding to the current frame image I (t). In one embodiment, the prediction unit 93 performs a spatial transform process using a Spatial Transform Network (STN) to obtain a next frame image based on the current frame image, the background image, the object mask matrix, and the predicted motion of the object to be predicted. Specifically, on the one hand, the prediction unit 93 performs spatial transform processing by using the first spatial transform network and performs a negation operation to perform a multiplication operation with the extracted time-invariant background image based on the mask matrix of the object to be predicted and the predicted motion of the object to be predicted, so as to obtain the background image at the t +1 th time, where the multiplication operation refers to performing an array element multiplication algorithm. On the other hand, the prediction unit 93 performs spatial transform processing using the second spatial transform network to obtain an object image at the t +1 th time, based on the object mask matrix to be predicted, the current frame image, and the predicted motion of the object to be predicted. And performing an addition operation on the background image at the t +1 th moment and the object image at the t +1 th moment to obtain an image at the t +1 th moment, namely a next frame image, wherein the addition operation refers to performing an array element addition algorithm. Similarly, in the case where the application scene includes two objects to be predicted, the prediction unit performs dynamic prediction on the two objects to be predicted respectively and displays the results on the next frame image at the same time.

In the case where the prediction unit 93 obtains the next frame image, i.e., the predicted image, may be supplied to the acquisition unit 91 as a new frame image, thus looping the entire process of predicting the movement of the object to be predicted.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a dynamic prediction system according to another embodiment of the present application, and as shown in the drawing, in a specific implementation, the dynamic prediction system may be an end-to-end deep neural network, where the deep neural network includes a plurality of convolutional neural networks, and the deep neural network takes a current frame image and a behavior as input, and outputs a predicted next frame image after passing through a trained neural network. Referring to fig. 1, as shown in fig. 11, the deep neural network takes the current frame image I (t) and the behavior as input, and on one hand, the extracting unit extracts the background image I from the current frame image I (t) based on the third convolutional neural network_bg(t) of (d). On the other hand, the object detection unit determines an object mask matrix including a reference object mask matrix shown in an upper part and an object mask matrix to be predicted shown in a lower part based on the current frame image i (t) using the first convolutional neural network. Then, the prediction unit predicts the motion of the object to be predicted based on the relationship between the object mask matrices and the behavior using a second convolutional neural network. Then, on the one hand, as shown by the two-dot chain line in the figure, based on the mask matrix of the object to be predicted and the predicted movement of the object to be predicted, the STN is adopted for space transformationAnd processing and executing an inverse operation to multiply the extracted time-invariant background image to obtain a background image at the t +1 th moment, wherein the multiplication operation refers to an array element multiplication algorithm. On the other hand, as shown by the dotted line in the figure, the spatial transform processing is performed using STN to obtain the object image at the t +1 th time, based on the object mask matrix to be predicted and the current frame image performing the multiplication operation and then combining the predicted motion of the object to be predicted on the basis. And finally, performing addition operation on the background image at the t +1 th moment and the object image at the t +1 th moment to obtain an image at the t +1 th moment, namely a next frame image I (t +1), wherein the multiplication operation refers to performing an array element multiplication algorithm, and the addition operation refers to performing an array element addition algorithm. In addition, the deep neural network can also take the next frame image I (t +1) as a new frame image to predict the image I (t +2) at the t +2 moment, and the whole process of the movement of the object to be predicted is obtained by the loop operation. The dynamic prediction system adopts an integral end-to-end neural network, so that the operation is carried out as a whole in the training and using of the neural network, the manual intervention is reduced, and the good prediction performance is realized.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of the apparatus of the present application, and as shown in the drawing, the apparatus includes a storage device 21 and a processing device 22.

The storage means 21 is used to store at least one program. The programs include respective programs that are called by the processing device 22 to perform the steps of acquiring, determining, extracting, predicting, etc., as described later. The storage device includes, but is not limited to, high speed random access memory, non-volatile memory. Such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices. In certain embodiments, the storage device may also include memory that is remote from the one or more processors, e.g., network-attached memory that is accessed via RF circuitry or external ports and a communication network (not shown), which may be the internet, one or more intranets, Local Area Networks (LANs), wide area networks (WLANs), storage local area networks (SANs), etc., or a suitable combination thereof. The memory controller may control access to the storage device by other components of the robot, such as the CPU and peripheral interfaces.

The processing means 22 are connected to the storage means 21. The processing means may comprise one or more processors. The processing device is operatively coupled to volatile memory and/or non-volatile memory in the storage device. The processing means may execute instructions stored in the memory and/or non-volatile storage device to perform operations in the device, such as analyzing the acquired current frame and predicting motion of an object to be predicted. As such, the processor may include one or more general purpose microprocessors, one or more application specific processors (ASICs), one or more Digital Signal Processors (DSPs), one or more field programmable logic arrays (FPGAs), or any combinations thereof. In one example, the processing device is connected to the storage device via a data line. The processing device interacts with the storage device through a data reading and writing technology. Wherein the data reading and writing technology includes but is not limited to: high speed/low speed data interface protocols, database read and write operations, etc.

The processing device 22 is configured to call the at least one program to perform any of the dynamic prediction methods described above. The dynamic prediction method comprises the following steps: first, an image i (t) at time t is acquired based on raw data. Then, based on the convolutional neural network trained in advance, the following operations are performed with the current frame image i (t) and preset behaviors as inputs: 1) extracting a time-invariant background from the image I (t); 2) an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object are determined based on the current frame image i (t). The mask matrix of the object to be predicted is determined based on the object to be predicted, and the mask matrix of the reference object is determined based on the incidence relation between the reference object and the motion of the object to be predicted; 3) and predicting the motion of the object to be predicted based on the relation between the mask matrix of the object to be predicted and the mask matrix of the reference object and the preset behavior. In one example, a reference object mask matrix is firstly cut according to a preset view window size by taking the position of an object to be predicted in the object mask matrix to be predicted as a center to obtain a cut reference object mask matrix, then, position information is added to the obtained cut reference object mask matrix, the action of the reference object represented by the cut reference object mask matrix on the object to be predicted is determined, and then, the motion of the object to be predicted is predicted based on a preset behavior and the determined action; 4) and combining the extracted time-invariant background and the predicted motion of the object to be predicted to obtain an image I (t +1) at the t +1 th moment. That is, the image I (t +1) at the t +1 th time is output after being processed by the convolutional neural network trained in advance. In addition, the above operation is repeated based on the image I (t +1) to obtain the image I (t +2) at the t +2 th moment, so as to loop to predict the whole process of the movement of the object to be predicted.

Referring to fig. 13, fig. 13 is a schematic structural diagram of another embodiment of the apparatus of the present application, and as shown, the apparatus further includes a display device 23. The display device 23 is connected to the processing device 22. In one example, the processing device is connected to the display device through a data line. The processing device interacts with the display device through an interface protocol. Wherein the interface protocol includes but is not limited to: HDMI interface protocol, serial interface protocol, etc.

In some embodiments, the display device is configured to display at least one of an object mask matrix to be predicted, a reference object mask matrix, and predicted motion data of the object to be predicted. Taking the scenario shown in fig. 1 as an example, the object mask matrix to be predicted is an object mask matrix to be predicted that represents the agent D. The reference object mask matrix is a ladder type reference object mask matrix representing a ladder A, a wall type reference object mask matrix representing a wall B, and a space type reference object mask matrix representing a space C. The motion data of the predicted object to be predicted includes but is not limited to: the motion track of the object to be predicted, the motion direction and the numerical value of the object to be predicted, such as three pixel points moving upwards, and the like, and the next frame of image.

In some embodiments, the dynamic prediction process may also be more intuitively observed by displaying an object mask image rather than an object mask matrix. In view of this, the processing means is further configured to generate an object mask image to be predicted and a reference object mask image based on the current frame image, the object mask matrix to be predicted and the reference object mask matrix; the display device is also used for displaying the mask image of the object to be predicted and/or the mask image of the reference object. Taking the scene shown in fig. 1 as an example, the object mask image to be predicted is an object mask image to be predicted including the agent D, which is generated by the processing device based on the current frame image and the object mask matrix to be predicted. The reference object mask image is a ladder type reference object mask image including a ladder A, a wall type reference object mask image including a wall B and a space type reference object mask image including a space C, which are generated by the processing device based on the current frame image and the reference object mask matrix. The display means may display the prediction object mask image, the reference object mask image, the partial reference image mask image, or a combination thereof according to the demand.

In addition, the processing device is also used for generating an object mask image based on the object mask image to be predicted and the reference object mask image; the display device is further configured to display the object mask image. In an embodiment, the processing device performs superposition processing on the object mask image to be predicted and the corresponding reference object mask image according to user requirements to generate an object mask image and outputs the object mask image through the display device. Taking the scenario shown in fig. 1 as an example, the processing device performs an overlay process on an object mask image to be predicted including an agent D, a ladder type reference object mask image including a ladder a, and a wall type reference object mask image including a wall B to obtain an object mask image, and a user can visually observe the object to be predicted based on the position of the object to be predicted displayed on the display device with respect to the ladder and the wall. Thus, displaying the object mask matrix and the object mask image by the display device enables a user to know the relative positional relationship between the object to be predicted and the reference object and the movement of the object to be predicted by viewing the image and the corresponding numerical value, and to visually and semantically interpret the dynamic model.

It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that part or all of the present application can be implemented by software and combined with necessary general hardware platform. Based on such understanding, the present application also provides a computer-readable storage medium storing at least one program which, when executed, implements any of the dynamic prediction methods described above.

With this understanding in mind, the technical solutions of the present application and/or portions thereof that contribute to the prior art may be embodied in the form of a software product that may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines such as a computer, network of computers, or other electronic devices, may cause the one or more machines to perform operations in accordance with embodiments of the present application. For example, each step in the positioning method of the robot is performed. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disc-read only memories), magneto-optical disks, ROMs (read only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. Wherein the storage medium may be located in the robot or in a third party server, such as a server providing an application mall. The specific application mall is not limited, such as the millet application mall, the Huawei application mall, and the apple application mall.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims

1. A method of dynamic prediction, comprising the steps of:

acquiring a current frame image;

determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image;

predicting the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and a preset behavior, wherein the predicting the motion of the object to be predicted based on the relationship between the object mask matrix to be predicted and the reference object mask matrix and the preset behavior comprises:

cutting the reference object mask matrix by taking the position of the object to be predicted in the object mask matrix to be predicted as a center according to the size of a preset view window to obtain a cut reference object mask matrix;

determining the effect of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a pre-trained second convolutional neural network;

predicting the motion of the object to be predicted based on a preset behavior and the determined effect.

2. The dynamic prediction method according to claim 1, wherein the mask matrix for the object to be predicted is determined based on the object to be predicted, and the mask matrix for the reference object is determined based on a correlation between the reference object and the motion of the object to be predicted or a type of the reference object.

3. The dynamic prediction method according to claim 1, wherein the step of determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image comprises: and determining an object mask matrix to be predicted comprising an object to be predicted and a reference object mask matrix comprising a reference object based on the current frame image by using a pre-trained first convolutional neural network.

4. The dynamic prediction method of claim 1, further comprising the steps of: adding position information to the obtained clipped reference object mask matrix and determining the effect of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance.

5. The dynamic prediction method according to claim 1, wherein the role further comprises a preset role of the object to be predicted.

6. The dynamic prediction method of claim 1, further comprising the steps of:

extracting a time-invariant background from the current frame image;

and combining the extracted time-invariant background and the predicted motion of the object to be predicted to obtain a next frame image.

7. The dynamic prediction method of claim 3, further comprising the steps of:

extracting a time-invariant background from the current frame image based on a pre-trained third convolutional neural network;

8. The dynamic prediction method of claim 7, wherein the third convolutional neural network is configured as a convolutional deconvolution structure.

9. The dynamic prediction method of claim 7, wherein the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network are obtained by uniform training according to a loss function.

10. The dynamic prediction method of claim 1, wherein the current frame image is obtained based on raw data or external input data with a priori knowledge.

11. A dynamic prediction system, comprising:

the acquisition unit is used for acquiring a current frame image;

an object detection unit for determining an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image;

a prediction unit, configured to predict a motion of the object to be predicted based on a relationship between the object mask matrix to be predicted and the reference object mask matrix and a preset behavior, wherein the prediction unit includes:

the cutting module is used for cutting the reference object mask matrix by taking the position of the object to be predicted in the object mask matrix to be predicted as a center according to the size of a preset view window so as to obtain a cut reference object mask matrix;

the action determining module is used for determining the action of the reference object represented by the clipped reference object mask matrix on the object to be predicted based on a second convolutional neural network trained in advance;

a prediction module to predict motion of the object to be predicted based on a preset behavior and the determined effect.

12. The dynamic prediction system of claim 11, wherein the mask matrix for the object to be predicted is determined based on the object to be predicted, and the mask matrix for the reference object is determined based on the association relationship between the reference object and the motion of the object to be predicted or the type of the reference object.

13. The dynamic prediction system of claim 11, wherein the object detection unit is configured to determine an object mask matrix to be predicted including an object to be predicted and a reference object mask matrix including a reference object based on the current frame image using a pre-trained first convolutional neural network.

14. The dynamic prediction system of claim 11, wherein the contribution determination module is configured to add location information to the obtained clipped reference object mask matrix and determine a contribution of a reference object represented by the clipped reference object mask matrix to the object to be predicted based on a second convolutional neural network trained in advance.

15. The dynamic prediction system of claim 11, wherein the role further comprises a role of a preset object to be predicted itself.

16. The dynamic prediction system of claim 13, further comprising:

an extraction unit, configured to extract a time-invariant background from the current frame image;

the prediction unit is further configured to obtain a next frame image in combination with the extracted time-invariant background and the predicted motion of the object to be predicted.

17. The dynamic prediction system of claim 16, wherein the extraction unit is configured to extract a time-invariant background from the current frame image based on a pre-trained third convolutional neural network; the prediction unit is further configured to obtain a next frame image in combination with the extracted time-invariant background and the predicted motion of the object to be predicted.

18. The dynamic prediction system of claim 17, wherein the third convolutional neural network is configured as a convolutional deconvolution structure.

19. The dynamic prediction system of claim 17, wherein the first convolutional neural network, the second convolutional neural network, and the third convolutional neural network are uniformly trained according to a loss function.

20. The dynamic prediction system of claim 11, wherein the current frame image is obtained based on raw data or external input data with a priori knowledge.

21. A computer-readable storage medium storing at least one program, wherein the at least one program when executed implements the dynamic prediction method of any one of claims 1-10.

22. A dynamic prediction apparatus, comprising:

storage means for storing at least one program;

processing means, coupled to the storage means, for invoking the at least one program to perform the dynamic prediction method of any of claims 1-10.

23. The dynamic prediction apparatus of claim 22, further comprising a display device for displaying at least one of the object mask matrix to be predicted, the reference object mask matrix, and the predicted motion data of the object to be predicted.

24. The dynamic prediction device of claim 23, wherein the processing means is further configured to generate an object mask image to be predicted and a reference object mask image based on the current frame image, the object mask matrix to be predicted and the reference object mask matrix; the display device is further configured to display the object mask image to be predicted and/or the reference object mask image.

25. The dynamic prediction device of claim 24, wherein the processing means is further configured to generate an object mask image based on the object mask image to be predicted and the reference object mask image; the display device is further configured to display the object mask image.