WO2022023806A1 - 程序场景信息的检测方法、装置、电子设备、介质和程序 - Google Patents

程序场景信息的检测方法、装置、电子设备、介质和程序 Download PDF

Info

Publication number
WO2022023806A1
WO2022023806A1 PCT/IB2020/059587 IB2020059587W WO2022023806A1 WO 2022023806 A1 WO2022023806 A1 WO 2022023806A1 IB 2020059587 W IB2020059587 W IB 2020059587W WO 2022023806 A1 WO2022023806 A1 WO 2022023806A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
feature
scene
end point
target
Prior art date
Application number
PCT/IB2020/059587
Other languages
English (en)
French (fr)
Inventor
张明远
吴金易
金代圣
赵海宇
伊帅
Original Assignee
商汤国际私人有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 商汤国际私人有限公司 filed Critical 商汤国际私人有限公司
Priority to JP2022529946A priority Critical patent/JP2023504387A/ja
Priority to KR1020227017414A priority patent/KR20220075442A/ko
Publication of WO2022023806A1 publication Critical patent/WO2022023806A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/457Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by analysing connectivity, e.g. edge linking, connected component analysis or slices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • TECHNICAL FIELD The present application relates to computer vision technology, and relates to, but is not limited to, a method, apparatus, electronic device, computer-readable storage medium, and computer program for detecting scene information.
  • Background Art With the continuous development of deep learning technology, scene understanding algorithms can acquire scene information contained in scene images, for example, the scene information may be which objects are included in the scene images, or what kind of relationships are between each object in the scene images. relationship, i.e. understanding what events are happening in this scene image.
  • the embodiments of the present application provide at least a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program for detecting scene information.
  • An embodiment of the present application provides a method for detecting scene information. The method includes: obtaining an aggregated feature to be propagated according to node features of each auxiliary node connected to a target node in a heterogeneous scene graph, and a feature dimension of the aggregated feature.
  • the scene heterogeneous graph includes at least two kinds of heterogeneous nodes,
  • the at least two types of heterogeneous nodes include: the auxiliary node and the target node obtained based on the scene image; based on the aggregation feature, updating the node feature of the target node; according to the updated target node
  • the node feature of obtains scene information in the scene image.
  • the updating the node feature of the target node based on the aggregation feature includes: according to the channel feature of each channel of the aggregation feature, corresponding to the node feature of the target node All feature positions of each channel are subjected to feature update processing using the channel features.
  • obtaining the aggregated features to be propagated according to the node features of each auxiliary node connected to the target node in the scene heterogeneous graph includes: according to the scene heterogeneous graph of each auxiliary node connected to the target node node features, obtaining at least one of a reweighted vector and a residual vector as the aggregation feature; the updating the node feature of the target node based on the aggregation feature includes: Each channel of the node feature of , is multiplied, and is summed, and each channel of the node feature of the target node is added by the residual vector.
  • the obtaining at least one of a reweighted vector and a residual vector as the pooled feature includes: using an activation function and a standard deviation of node features of the target node, converting the residual The value of the vector is mapped to a predetermined value interval as the aggregation feature.
  • the target node includes: an object group node, where the object group includes two objects in the scene image; and the scene image is obtained according to the updated node feature of the target node
  • the scene information in the object group includes: obtaining a prediction result of the relationship between two objects in the object group node according to the updated node feature of the object group node.
  • the scene heterogeneous graph includes: an information transmission chain with one of the object group nodes as the end point, the information transmission chain includes at least two directed edge groups, and each directed edge group includes an a plurality of directed edges with a plurality of starting points pointing to the same end point; each start point and end point in the information transmission chain include at least two kinds of the heterogeneous nodes; the node characteristics of each auxiliary node connected to the target node, Obtaining the aggregation feature to be propagated, and updating the node feature of the target node based on the aggregation feature, including: for a first directed edge group in the at least two directed edge groups, using the first directed edge group The same first end point pointed to by the edge group is used as the target node, and the aggregation feature is obtained according to the node features of each starting point connecting the first end point, and the node feature of the first end point is updated based on the aggregation feature; The first end point is also used as one of the starting points of the
  • the starting point and the ending point of one of the at least two directional edge groups include one of the following: the starting point includes: each pixel obtained by extracting features from the scene image node, the end point is an object node extracted from the scene image; or, both the start point and the end point include: an object node extracted from the scene image; or, the start point includes an object node extracted from the scene image
  • the end point includes the object group node; or, the start point includes the object group node, and the end point includes the object node.
  • each auxiliary node includes: multiple pixel nodes; the method further includes: performing feature extraction according to the scene image to obtain multiple feature maps, where the multiple feature maps respectively have different sizes; The plurality of feature maps are scaled to the same size and then fused to obtain a fused feature map; and according to the fused feature map, node features of a plurality of the pixel nodes are obtained.
  • obtaining a prediction result of the relationship between two objects in the object group node according to the updated node feature of the object group node includes: according to the node feature of the object group node, A predicted initial classification confidence level is obtained, where the initial classification confidence level includes: the initial classification confidence level of the object group node corresponding to each predetermined relationship category; according to one of the object group nodes in each predetermined relationship category
  • the initial classification confidence corresponding to the target predetermined relationship category, and the object detection confidence corresponding to the two objects in the object group node respectively, obtain the confidence that the two objects in the object group node correspond to the target predetermined relationship category degree; if the confidence degree is greater than or equal to a preset confidence threshold, confirming that the predicted result of the relationship between the two objects in the object group node is the target predetermined relationship category.
  • An embodiment of the present application provides a method for detecting scene information, and the method is executed by an image processing device; the method includes: acquiring a scene image collected by an image acquisition device; according to the detection method provided by any embodiment of the present application, The scene image is processed, and scene information in the scene image is output.
  • An embodiment of the present application provides an apparatus for detecting scene information, and the apparatus includes: a feature processing module, configured to obtain the aggregation feature to be propagated according to the node features of each auxiliary node connected to the target node in the heterogeneous scene graph, so the The feature dimension of the converged feature is Cy*1, wherein the Cy is the channel dimension of the converged feature, and the Cy is the same as the channel dimension of the node feature of the target node; wherein, the scene heterogeneous graph includes at least Two kinds of heterogeneous nodes, the at least two kinds of heterogeneous nodes include: the auxiliary node and the target node obtained based on the scene image; a feature update module, configured to update the target node based on the aggregation feature the node feature of the target node; an information determination module configured to obtain scene information in the scene image according to the updated node feature of the target node.
  • the feature updating module when configured to update the node feature of the target node based on the aggregation feature, it includes: 4. According to the channel feature of each channel of the aggregation feature, update the target node's All feature positions corresponding to each channel in the node feature use the channel feature to perform feature update processing.
  • the feature processing module is specifically configured to obtain at least one of a reweighted vector and a residual vector as the aggregation according to the node features of each auxiliary node connected to the target node in the heterogeneous scene graph feature; the feature update module is specifically configured to perform multiplication processing on each channel of the node feature of the target node based on the re-weighted vector, and the sum is formed, and each channel of the node feature of the target node is processed by the residual vector. Additive processing.
  • the feature processing module when the feature processing module is configured to obtain at least one of a re-weighted vector and a residual vector as the pooled feature, the feature processing module includes: an activation function and a node feature of the target node. Standard deviation, the value of the residual vector is mapped to a predetermined numerical interval as the aggregation feature.
  • the target node includes: an object group node, where the object group includes two objects in the scene image; the information determination module is specifically configured to be based on the updated node characteristics of the object group node , to obtain the prediction result of the relationship between the two objects in the object group node.
  • the scene heterogeneous graph includes: an information transmission chain with one of the object group nodes as an end point, the information transmission chain includes at least two directed edge groups, and each directed edge group includes multiple multiple directed edges whose starting points point to the same end point; each start point and end point in the information transmission chain includes at least two kinds of the heterogeneous nodes;
  • the feature processing module is configured to: for the at least two For the first directed edge group in the directed edge group, the same first end point pointed to by the first directed edge group is used as the target node, and the aggregation feature is obtained according to the node characteristics of each starting point connecting the first end point ; the first end point simultaneously serves as one of the starting points of the second directed edge group in the at least two directed edge groups; for the second directed edge group, point to the second directed edge group The same second end point of the second end point is used as the target node, and the aggregation feature is obtained according to the node characteristics of each start point connecting the second end point;
  • the feature update module is configured to: based
  • the starting point and the ending point of one of the at least two directional edge groups include one of the following: the starting point includes: each pixel obtained by extracting features from the scene image node, the end point is an object node extracted from the scene image; or, both the start point and the end point include: an object node extracted from the scene image; or, the start point includes an object node extracted from the scene image
  • the end point includes the object group k, the start point includes the object group node, and the end point includes the object node.
  • each auxiliary node includes: a plurality of pixel nodes; and the feature processing module is further configured to: perform feature extraction according to the scene image to obtain a plurality of feature maps, where the plurality of feature maps are respectively have different sizes; the multiple feature maps are scaled to the same size and then fused to obtain a fused feature map; according to the fused feature map, node features of a plurality of the pixel nodes are obtained.
  • the information determination module when configured to obtain a prediction result of the relationship between the two objects in the object group node according to the updated node characteristics of the object group node, it includes: The node feature of the object group node is used to obtain the predicted initial classification confidence, where the initial classification confidence includes: the initial classification confidence of the object group node corresponding to each predetermined relationship category; The initial classification confidence corresponding to one of the target predetermined relationship categories in each predetermined relationship category, and the object detection confidence corresponding to the two objects in the object group node respectively, to obtain the corresponding objects in the object group node.
  • An embodiment of the present application provides an apparatus for detecting scene information, and the apparatus is applied to an image processing device.
  • the apparatus includes: an image acquisition module, configured to acquire a scene image collected by an image acquisition device; an information output module, configured to process the scene image according to the detection method of any embodiment of the present application, and output the scene image scene information.
  • An embodiment of the present application provides an electronic device, including: a memory and a processor, where the memory is configured to store computer-readable instructions, and the processor is configured to invoke the computer instructions to implement the detection method of any embodiment of the present application .
  • An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the detection method of any embodiment of the present application is implemented.
  • An embodiment of the present application provides a computer program, including computer-readable code. When the computer-readable code is executed in an electronic device, a processor in the electronic device executes the detection for implementing any embodiment of the present application. method.
  • the method, apparatus, electronic device, computer-readable storage medium, and computer program for detecting scene information provided by the embodiments of the present application transmit channel-level information between different nodes when node features are updated, so that it is possible to connect heterogeneous nodes between heterogeneous nodes. By transmitting information, it is possible to integrate various types of information to detect scene information, thereby making scene information detection more accurate.
  • FIG. 1 shows a method for detecting scene information provided by at least one embodiment of the present application
  • FIG. 2 shows a schematic diagram of a feature update principle provided by at least one embodiment of the present application
  • Another method for detecting scene information provided by an embodiment
  • FIG. 4 shows a schematic diagram of a scene heterogeneous graph provided by at least one embodiment of the present application
  • FIG. 5 shows a schematic diagram of scene information provided by at least one embodiment of the present application.
  • Detection device FIG. 6 shows another scene information detection device provided by at least one embodiment of the present application.
  • the scene information includes but is not limited to: for example, recognizing the target object contained in the scene image, detecting what the object in the scene image is doing, detecting the relationship between different objects in the scene image, identifying the object in the image according to the content of the scene image information contained therein, etc.
  • images of the scene may be captured by an image capture device.
  • the scene may be a place where there is a need for automatic analysis of scene information, for example, a place where urban security risks such as violent fights often occur, and image acquisition equipment such as surveillance cameras may be installed; for example, if a shopping place such as a supermarket wants to To automatically collect images of customers shopping, and to analyze which products customers are more interested in, image collection devices such as surveillance cameras can also be installed in supermarkets.
  • the scene image can be either a single frame image or one of the video streams. Part of the image frame.
  • the scene image may be transmitted to an image processing device for performing image analysis and processing.
  • the image is analyzed, and finally the scene information in the scene image is output, for example, the scene information may be that some people in the image are fighting.
  • the target scene content to be identified and detected is usually obtained according to the partial information in the scene as an aid.
  • This process involves the feature update process of fusing the auxiliary information.
  • a variety of auxiliary information is fused together to predict and identify the target through feature update.
  • An embodiment of the present application provides a method for detecting scene information, and the method provides a method for updating a feature, where a feature is updated by the method provided by the method, and scene information is detected according to the updated feature.
  • image processing such as feature extraction on a scene image to be recognized (for example, a captured image of a tennis court)
  • a plurality of nodes can be obtained, and these nodes can form a graph network.
  • the graph network is referred to as a different scene composition.
  • the plurality of nodes in the scene heterogeneous graph include at least two types of heterogeneous nodes, and the heterogeneous nodes refer to nodes that are both in terms of node feature shapes (feature shapes) and node feature distributions (feature distributions).
  • heterogeneous nodes are specifically included in the foregoing scene heterogeneous graph may be determined according to the actual processing target, which is not limited in this embodiment. It should be noted that the heterogeneous scene graph in this embodiment is allowed to include multiple types of heterogeneous nodes to integrate richer information for scene understanding, and a directed relationship can be established between each node in the graph In the connection of the edge, the feature of the starting point of the directed edge is merged into the feature of the end point of the directed edge, so as to realize the optimized update of the feature of the end point of the directed edge.
  • the nodes in the graph may include different nodes such as object nodes (objects, which may be people or objects), pixel nodes, and the like.
  • object nodes objects, which may be people or objects
  • pixel nodes pixel nodes
  • the nodes in the graph may also include nodes corresponding to human key points. It is possible to connect edges between key points of the same person or between the same key points of different people, and these key points can be connected to the nodes corresponding to the human detection frame. Through the information transfer between nodes with connected edges, the human body feature can be optimized and updated, so that the human action posture can be better captured according to the updated human body feature.
  • the nodes in the graph may include pixel nodes and object nodes, and a scene at a moment may also be condensed into a moment node corresponding to the moment.
  • the moment node can be connected to a pixel node to optimize the feature representation of each pixel position at each moment, or the moment node can be connected to a specific object node for optimization.
  • the scene understanding task is also expected to be based on some more holistic environmental factors, such as factors and characteristics such as overall lighting conditions, weather, etc., nodes corresponding to these holistic factors can also be added to the graph.
  • the nodes included in the scene heterogeneous graph can be determined according to the specific scene understanding task, and the present embodiment allows the graph to include multiple heterogeneous nodes.
  • FIG. 1 will describe the processing of scene information detection according to the scene heterogeneous graph, which may include: Step 100: Obtain the aggregation feature to be propagated according to the node features of each auxiliary node connected to the target node in the scene heterogeneous graph.
  • the feature dimension of the aggregated feature is Cy*1, where the Cy is the channel dimension of the aggregated feature, and the Cy is the same as the channel dimension of the node feature of the target node.
  • the scene heterogeneous graph includes at least two kinds of heterogeneous nodes, and the at least two kinds of heterogeneous nodes include: the auxiliary node and the target node obtained by feature extraction on the scene image.
  • both the target node and the auxiliary node may be obtained based on the scene image, for example, the target detection in the image may be performed on the scene image, and the detection to an object in the image (such as a person, or an object), thereby generating a node corresponding to the object, which can be an auxiliary node.
  • two objects in the scene image may also be formed into an object group (eg, a person and a tennis ball), and a node corresponding to the object group may be generated, which may be a target node.
  • auxiliary nodes can also be obtained in other ways, for example, the time information when the scene image is collected, the lighting condition information, etc. These information can also correspond to a node, which can be an auxiliary node.
  • the subsequent information can be encoded and fused into the in the node feature corresponding to the auxiliary node. It can be seen that after a scene image is obtained, the above-mentioned target node and auxiliary node can be generated based on the scene image, and these nodes further constitute a scene heterogeneous graph.
  • the at least two heterogeneous nodes may include four types of nodes: node A, node B, node C, and node D, and the number of nodes of each type may be multiple.
  • the following node connection relationships may be included in the scene heterogeneous graph: For example, multiple nodes A are connected to one of the nodes B, and node A is the starting point of the directed edge, and node B is the end point of the directed edge, then The target node and each auxiliary node in this step may be, the multiple nodes A are each auxiliary node, and the node B is the target node.
  • the aggregation feature to be propagated can be obtained according to the node features of each auxiliary node, and the feature dimension of the aggregation feature is Cy*1, wherein the Cy is the channel dimension of the aggregation feature, and the difference between Cy and the target node is The channel dimensions of the node features are the same.
  • the node feature of the target node has 256 channels
  • the aggregation feature may be a 256-dimensional vector.
  • the node feature of the target node mentioned above may be a kind of information obtained based on at least a part of the image content of the scene image, and the node feature incorporates image information of the object corresponding to the target node in the scene image.
  • Step 102 Based on the aggregation feature, update the node feature of the target node.
  • the aggregation feature is obtained by synthesizing the node features of each auxiliary node corresponding to the target node, and the aggregation feature is used to represent the influence of each auxiliary node on the node feature update of the target node, which is equivalent to combining the corresponding auxiliary nodes of each auxiliary node.
  • the information of the image content is transmitted to the object corresponding to the target node, so that the node feature of the target node is fused into the image content corresponding to the auxiliary node.
  • the channel dimensions of the aggregation feature and the node feature are the same.
  • the update method is also channel-wise information update. Specifically, according to the channel feature of each channel of the aggregated features, feature update processing is performed on all feature positions corresponding to the channel among the node features of the target node by using the channel feature.
  • the aggregation feature may be a 256-dimensional vector as an example. Please refer to FIG.
  • an aggregated feature ⁇ pl, p2, p3...p256 ⁇ can be calculated and obtained, and the aggregated feature is a 256-dimensional vector.
  • the node feature it can be updated one by one. For example, as shown in Figure 2, when the first channel of the target node is updated, the first vector element pi can be taken out from the vector of the aggregated features, and all feature positions in the first channel of the target node are added.
  • Step 104 According to the updated node characteristics of the target node, obtain the scene information in the scene image. As in steps 100 and 102 above, take one of the target node updates as an example, in actual implementation, the scene image The process of detecting and obtaining scene information may involve multiple such feature updates.
  • the node B after updating the feature of a node B that points to a common node according to the features of multiple nodes A, the node B can update the feature of the node C that points to the common node based on the node features of these node B together with other node B,
  • the update method is the same as Figure 2 same.
  • the updated node feature of the target node may be used to finally obtain scene information in the scene image.
  • the node feature of the updated target node here may be the target node that is finally updated (that is, the end point of the last directed edge, no longer as a starting point to continue to point to other nodes) , or, may also be some nodes selected in the scene heterogeneous graph, which is not limited in this embodiment.
  • the way to obtain the scene information and the specific scene information can be determined according to actual business requirements. For example, if the actual business goal is to predict the relationship between objects in the scene, the multi-layer perceptron can be used to predict the relationship between objects according to the updated node characteristics. relationship category.
  • the method for detecting scene information in this embodiment transmits channel-level information between different nodes when updating node features, so that information can be transmitted between heterogeneous nodes, so that multiple types of information can be integrated to detect scene information. , so that the scene information detection is more accurate.
  • FIG. 3 illustrates another method for detecting scene information. Based on the method in FIG. 1 , the method illustrates a specific form of channel information. As shown in FIG. 3, the method may include the following processing: Step 300: According to the node characteristics of each auxiliary node connected to the target node in the scene heterogeneous graph, obtain at least one of a reweighted vector and a residual vector as the Aggregate features.
  • the aggregation feature obtained according to the node features of the multiple auxiliary nodes may be at least one of a reweighting vector and a residual vector.
  • a reweighting vector and a residual vector may be calculated.
  • the re-weighting vector channel-wise re-weighting vector
  • the residual vector channel-wise residual vector
  • the influence parameters of node features, and then the influence parameters of different auxiliary nodes are combined, and there can be many ways of combining, for example, it can be summed by force or weight, or it can also be done by a multi-layer perceptron.
  • Two calculation methods of the reweighting vector and the residual vector are illustrated as follows, but it should be understood that the specific calculation method is not limited to this:
  • the reweighting vector and the residual vector can be calculated according to the following formula:
  • H w and H b are two linear transformation matrices, which can be used to change the node feature of the auxiliary node dimension C' * L'L' into the feature of the channel dimension C y , / x represents the node feature of the auxiliary node .
  • the attention weight which can be calculated by the following formula: Among them, and are two linear transformation matrices, which can be used to transform the node feature/ x of the auxiliary node and the node feature/ y of the target node into features of the same dimension 4.
  • dk is a hyperparameter, which can be set according to the situation.
  • ⁇ .,> A calculation function of the inner product of two vectors ⁇
  • the reweighted vector and the residual vector can also be calculated according to the following formula: Among them, the functions of and//y are similar to and y in the previous calculation method, and can be used to turn /x and /y into the same dimension 4.
  • Step 302 Based on the aggregation feature, update the node feature of the target node, including at least one of the following: Multiplying each channel of the node feature of the target node based on the reweighted vector, or sending the target node to the target node through the residual vector The channels of the node feature of the node are added together.
  • f y Conv UUC C (sigmoid(w y )0(f y + a(f y )® tanh(& y ))) (7)
  • the target node is y and the dimension is Cy is the channel dimension, L ) ; is the feature size of each channel of the target node; the feature before the update of the target node is / y , and the new feature after the update is //, and it is assumed that there are M directed edges pointing to this
  • the target node y, the starting point of these M directed edges is M auxiliary nodes, the set composed of these M auxiliary nodes is N(y), and the feature dimension of each auxiliary node is C'*L'.
  • the aggregated features are obtained from the node features of the M auxiliary nodes and then passed to the target node y to obtain the updated new feature / y '.
  • W y and W can be obtained in two ways as exemplified in step 300, and the dimensions of these two vectors are Cy*1.
  • the operations represented by the formula include:
  • G(/ y ) is to find the standard deviation of each channel of / y , which is a vector of length Cy * 1, and each bit represents the standard deviation of / y on the corresponding channel L>>the standard deviation of these position data .
  • Conv is a 1-dimensional convolution operation, the size of the convolution kernel is 1, the number of input channels and the number of output channels are Cy.
  • the residual vector crC/ ⁇ Otanhf ⁇ ) is "broadcasted" to all feature positions of each channel of / y , that is, / y +cr(/: y )®tanh(fe y ). Then, the number of each channel of / y is multiplied by the reweighting vector. Specifically, in the formula, it can be the number of all feature positions on each channel multiplied by the reweighting vector transformed by the sigmoid activation function. Finally, the information of each channel is fused through the convolution operation to obtain the updated features.
  • the above formula is described by taking the calculation of the reweighted vector and the residual vector as an example for illustration, and there may be various deformation forms in actual implementation. For example, do not use reweight vector W y , or do not use residual vector fr y , or do not use convolution operation Conv and so on. For another example, the size of the convolution kernel of the convolution operation may also be changed, or the reweighted vector and the residual vector may be convolved first and then propagated to each channel of / y .
  • the aggregation feature when integrated into the node feature of the target node, in addition to the multiplication and addition operations in the above formula example, it can also be in other forms, such as division, subtraction, or multiple nesting (for example, adding first and then multiplying Etc.
  • the scene information detection method of this embodiment has the following effects: First, by transmitting channel-level information between different nodes when updating node features, information can be transmitted between heterogeneous nodes, so that a variety of type information to detect scene information, so that the scene Information detection is more accurate; in addition, only transmitting channel-level information also reduces the amount of information transmission, enabling fast information transmission between heterogeneous nodes; it also makes it unnecessary to pre-compress the information of node characteristics of different heterogeneous nodes , so as to fully retain the original content of the node feature, and because the original feature does not need to be irreversibly compressed, it can be easily applied to different frameworks and has wide applicability.
  • the optimization effect of the target node is better, and the detection based on the final scene information of the target node is more accurate.
  • the value range of the residual vector is also constrained by the standard deviation of the feature of the target node, so that the new feature after the update will not deviate greatly from the feature distribution of the feature before the update, Thereby, the influence of the difference of the feature distribution of heterogeneous nodes on the update of the target node is alleviated.
  • the information transmission mechanism between heterogeneous nodes realizes the information transmission between heterogeneous nodes with different feature dimensions through the transmission of channel-level information, and limits the value of the residual vector through the standard deviation.
  • the range reduces the influence of heterogeneous nodes with different feature distributions on the feature distribution of the target node, so that the mechanism realizes the information transfer between heterogeneous nodes, so that the target node features can be optimized through a variety of more abundant node features, thereby making The scene information detection based on the optimized target node features is more accurate.
  • the detection method of scene information will be described below by taking object relationship detection in a scene image as an example.
  • the detected scene information will be the relationship between two objects in the scene image, and the The two objects are people and objects, for example, to identify the relationship between people and objects (Human-object Interaction Detection, referred to as HOI detection), for example, people play ball.
  • HOI detection Human-object Interaction Detection
  • FIG. 4 illustrates a scene heterogeneous graph constructed from scene images during HOI detection.
  • a scene heterogeneous graph includes three types of nodes as an example: pixel nodes, object nodes, and object group nodes; in other optional embodiments, the heterogeneous graph may also include other types of nodes.
  • Pixel node V pix One specific implementation method may be to use FPN to perform feature extraction on the scene image to obtain multiple feature maps, and the multiple feature maps respectively have different sizes; then, the multiple feature maps are scaled After reaching the same size, a convolution layer is used for fusion to obtain a fusion feature map; finally, according to the fusion feature map, node features of a plurality of the pixel nodes are obtained.
  • the feature dimension of the fusion feature map is 256 * 7 * 7, where 256 is the channel dimension, and H and W represent the height and width of the feature map, respectively.
  • the scene heterogeneous graph may contain H*W nodes used to represent pixels, that is, pixel nodes, and the dimension of each pixel node is 256.
  • the fused feature map can contain not only many low-semantic features and local features (from high-resolution maps), but also many high-semantic information and global features (from high-resolution maps). low-resolution image), so that the pixel nodes can integrate richer image content, which helps to improve the detection accuracy of subsequent scene information.
  • Object node V inst For example, Faster R-CNN can be used to process the scene image, detect the categories and positions of all objects in the scene image, and use the Rol Align algorithm to extract the features of each object. Assuming that the detection algorithm detects that there are N objects in the scene, there will be N object nodes in the scene heterogeneous graph to represent different objects, and the feature dimension of each object node is 256 * 7 * 7.
  • the object node may be, for example, a person, a ball, a horse, and the like. Or, in other examples, features can also be extracted from the content in the object detection frame through a deep convolutional neural network such as ResNet50.
  • Object group node V pair Assuming that there are N objects in the scene image, it can form N*(N-1) object group nodes. Among them, for the two object nodes 01 and 02, "01-02" is an object group node, the subject of the object group node is 01, and the object is 02; and "02-01” is another object group node, the object The subject of the group node is 02 and the object is 01. The characteristics of each object group node are determined by the characteristics of the three regions.
  • the positions of the objects corresponding to the two object nodes included in the object group node are (ax 1, ay 1, ax 2 , ay 2 ) and (bxl, byl, bx 2 , by 2 ), where axl is the abscissa of the upper left corner of the detection frame of the first object, ayl is the ordinate of the upper left corner of the detection frame of the first object, ax2 is the abscissa of the lower right corner of the detection frame of the first object, ay2 is the first object The ordinate of the lower right corner of the detection frame of the second object, bxl is the abscissa of the upper left corner of the detection frame of the second object, byl is the ordinate of the upper left corner of the detection frame of the second object, bx2 is the lower right corner of the detection frame of the second object The abscissa, by2 is the ordinate of the lower right corner of the detection frame of the detection frame of the first object
  • the Rol Align algorithm will be used to extract features for three regions: (axl, ay2, ax2, ay 2 ), (bxl, byl, bx2, by2), (min(axl, bxl), min(ayl, byl), max(ax2, bx2), max(ay2, by2)).
  • the four feature dimensions obtained by each region after the Rol Align algorithm are 256 * 7 * 7, so three feature maps of 256 * 7 * 7 will be obtained.
  • a feature map with a dimension of 768 * 7 * 7 can be obtained, which will be used as the node feature of the object group node.
  • the scene heterogeneous graph will contain these N * ( N - 1 ) object group nodes, and the feature dimension of each object group node is 768 * 7 * 7.
  • [Edge construction method 1 1 Connect all pixel nodes to all object group nodes, and get H * W * N * (N - 1) directed edges. Connecting the edges between all object nodes will get N * (N - 1) directed edges. Connect all object nodes to their corresponding object group nodes (that is, the subject i in this object group node is the object), and 2 * N * (N-1) directed edges will be obtained.
  • the node features of the pixel nodes are not directly transmitted to the object group nodes, but are first transmitted to the object nodes, and then the object nodes are transmitted to the object group nodes. In this way, the object nodes are used as bridges. The number is relatively small, which can reduce the amount of information transmission and improve the transmission efficiency.
  • the edges connected between nodes are directed edges. For example, if one of the pixel nodes Vpix is connected to an object node Vinst, the directed edge is directed from the pixel node Vpix to the object. Node Vinst, the starting point is the pixel node Vpix, and the end point is the object node Vinst.
  • the number of pixel nodes, object nodes and object group nodes may be multiple, and correspondingly, the number of the above three types of directed edges may also be multiple.
  • the set of these three directed edges can be represented as follows: In addition, when creating a directed edge, it is not limited to the two methods listed above, and can be adjusted.
  • the edges between object nodes can be deleted, or when there are nodes of human key points, the edges between the nodes of human key points and the object nodes (human detection frame) can be added.
  • the object group node can also be connected back to the object node to perform multiple rounds of optimization. For example, after the node feature of a certain object group node Vpair is updated, the connected object node is continued to be updated as a starting point, and then the object group node Vpair is updated again after the object node is updated.
  • the final node feature to be acquired is the feature of the object group node, so as to obtain the object relationship prediction result according to the node feature of the object group node. Therefore, there is an information transmission chain with the object group node as the final end point in the scene heterogeneous graph.
  • FIG. 4 FIG. 4 is only a schematic illustration, the number of nodes in actual implementation will be larger
  • the information transmission chain includes three directed edge groups:
  • the first directed edge group take the object node 42 as the target node, and take the pixel nodes 43, 44 and 45 as each auxiliary node, and update the node feature of the object node 42 according to the node feature of each auxiliary node.
  • the update method can be based on the aforementioned formula, for example, to obtain a reweighted vector and a residual vector, the channel dimension of these vectors is the same as the channel dimension of the object node 42, and the channel-level update is performed on the object node 42.
  • the update method can be based on the aforementioned formula, which will not be described again.
  • the node features of the end points in each directed edge group can be updated one by one in sequence, and each directed edge group converges from the starting point to the end point until the object is finally updated. Node characteristics for group nodes.
  • a prediction result of the relationship between the two objects in the object group node that is, the relationship prediction of the HOI
  • the initial classification confidence can be obtained according to the following formula.
  • MLP is a multi-layer perceptron, which is a vector of initial classification confidence obtained according to the node feature/ y of the updated object group node, and the initial classification
  • the confidence level includes: the confidence level of the object group node corresponding to each predetermined relationship category, the dimension of the vector is C c ia SS + 1 , where C c ia SS is the number of predetermined relationship categories, and 1 is "no action" .
  • one of the two objects corresponding to the object group node is a person and the other is a tennis player. The relationship between the two is "hit", that is, a person plays tennis.
  • sy includes the confidence of each relationship.
  • can be the confidence value of the predetermined relationship category corresponding to c in the ⁇ vector
  • a and ⁇ are the object detection confidence levels corresponding to the two objects in the object group node respectively, for example, is the detection confidence level of the human frame, is The detection confidence of the object box.
  • an object detector can be
  • object detector Detecting objects in a scene image, such as detecting a human body or an object, will get a corresponding human frame or object frame, and the object detector will also output a detection score (detection scores), which can be called object detection Confidence. Since the detection frame is not perfect, there may be false detection or inaccuracy, so the detection frame also has a confidence level, that is, the above-mentioned object detection confidence level. In actual implementation, a threshold of the prediction result of the object relationship can be set. For a certain object group node, if the final prediction result reaches this threshold, it will be confirmed that there is such a relationship between the two objects of the object group node. relation.
  • a scene image as an example, you can traverse all pairs in the scene image, such as entering all people and objects into Row pairing generates object group nodes.
  • the confidence level of the object group node corresponding to each predetermined relationship category is obtained according to the above method, and the object group node with the confidence level higher than the threshold is confirmed as the HOI identified in the scene image. relation.
  • the detection of the HOI relationship in the above-mentioned embodiments may have various applications: For example, abnormal behavior detection in a smart city can be used to better determine whether a violent incident between people occurs, or whether someone is The act of smashing the store, etc.
  • FIG. 5 provides an exemplary apparatus for detecting scene information.
  • the apparatus may include: a feature processing module 51 , a feature updating module 52 and an information determining module 53 .
  • the feature processing module 51 is configured to obtain the aggregated feature to be propagated according to the node features of each auxiliary node connected to the target node in the scene heterogeneous graph, and the feature dimension of the aggregated feature is Cy*1, wherein the Cy is The channel dimension of the aggregation feature, and the Cy and the channel dimension of the node feature of the target node are the same; wherein, the scene heterogeneous graph includes at least two kinds of heterogeneous nodes, and the at least two kinds of heterogeneous nodes include: The auxiliary node and the target node obtained based on the scene image.
  • the feature updating module 52 is configured to update the node feature of the target node based on the aggregation feature.
  • the information determination module 53 is configured to obtain scene information in the scene image according to the updated node feature of the target node.
  • the method includes: All feature positions corresponding to each channel in the node feature of the node use the channel feature to perform feature update processing.
  • the feature processing module 51 is specifically configured to obtain at least one of a reweighted vector and a residual vector based on the node features of each auxiliary node connected to the target node in the heterogeneous scene graph. the aggregation feature.
  • the feature updating module 52 is specifically configured to perform multiplication processing on each channel of the node feature of the target node based on the re-weighted vector, and/or, perform a multiplication process on each channel of the node feature of the target node through the residual vector. Additive processing.
  • the feature processing module 51 when configured to obtain at least one of a reweighted vector and a residual vector as the pooled feature, includes: passing an activation function and a node feature of the target node The standard deviation of , and the value of the residual vector is mapped to a predetermined numerical interval as the pooling feature.
  • the target node includes: an object group node, where the object group includes two objects in the scene image; the information determination module 53 is specifically configured to be based on the updated node of the object group node feature to obtain the prediction result of the relationship between the two objects in the object group node.
  • the scene heterogeneous graph includes: an information transmission chain with one of the object group nodes as the end point, the information transmission chain includes at least two directed edge groups, and each directed edge group includes an A plurality of starting points point to a plurality of directed edges of the same end point; each of the starting points and the end points in the information transmission chain includes at least two kinds of the heterogeneous nodes.
  • the feature processing module 51 is configured to: for the first directed edge group in the at least two directed edge groups, use the same first end point pointed to by the first directed edge group as the target node , 4, the aggregation feature is obtained according to the node features of each starting point connecting the first end point; the first end point is simultaneously used as one of the starting points of the second directed edge group in the at least two directed edge groups; for In the second directed edge group, the same second end point pointed to by the second directed edge group is used as the target node, and the aggregation feature is obtained according to the node characteristics of each starting point connecting the second end point.
  • the feature updating module 52 is configured to: update the node feature of the first endpoint based on the aggregated feature obtained from the node features connecting the respective starting points of the first endpoint; and based on the nodes connecting the respective starting points of the second endpoint The aggregated feature obtained from the feature updates the node feature of the second endpoint.
  • the start point and the end point of one of the at least two directed edge groups including One of the following: the starting point includes: each pixel node obtained by extracting features from the scene image, and the end point is an object node extracted from the scene image; or, the starting point and the ending point both include: The object node extracted from the scene image; or, the starting point includes the object node extracted from the scene image, the ending point includes the object group k, the starting point includes the object group node, the ending point Include the object node.
  • the auxiliary nodes include: multiple pixel nodes; the feature processing module 51 is further configured to: perform feature extraction according to the scene image to obtain multiple feature maps, the multiple feature maps have different sizes respectively; the multiple feature maps are scaled to the same size and then fused to obtain a fused feature map; and according to the fused feature map, node features of a plurality of the pixel nodes are obtained.
  • the method includes: according to the The node features of the object group nodes are used to obtain the predicted initial classification confidence.
  • the initial classification confidence includes: the initial classification confidence of the object group node corresponding to each predetermined relationship category; The initial classification confidence corresponding to one of the target predetermined relationship categories in the predetermined relationship categories, and the object detection confidence corresponding to the two objects in the object group node respectively, to obtain the corresponding objects in the object group node.
  • the confidence level of the target predetermined relationship category if the confidence level is greater than or equal to a preset confidence level threshold, confirming that the predicted result of the relationship between the two objects in the object group node is the target predetermined relationship category .
  • FIG. 6 provides another exemplary apparatus for detecting scene information. The apparatus is applied to an image processing device. As shown in FIG. 6 , the apparatus includes: an image acquisition module 61 and an information output module 62 .
  • the image acquisition module 61 is configured to acquire the scene image collected by the image acquisition device; the information output module 62 is configured to process the scene image according to the detection method of any embodiment of the present application, and output the scene image in the scene image. scene information.
  • the embodiments of the present application may be provided as a method, a system or a computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
  • one or more embodiments of the present application may employ a computer program implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein form of the product.
  • An embodiment of the present application further provides a computer-readable storage medium, where a computer program may be stored on the storage medium, and when the program is executed by a processor, the method for detecting scene information described in any embodiment of the present application is implemented.
  • An embodiment of the present application further provides an electronic device, the electronic device includes: a memory and a processor, where the memory is configured to store computer-readable instructions, and the processor is configured to invoke the computer instructions to implement any implementation of the present application The detection method of scene information described in the example.
  • Embodiments of the subject matter described in this application may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for use by data
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processing and logic flow may also be performed by a dedicated logic circuit such as a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), and the apparatus may also be implemented as a dedicated logic circuit.
  • Computers suitable for the execution of a computer program include, for example, general-purpose and/or special-purpose processors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from read only memory and/or random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operably connected to such mass storage devices to receive data from or to It transmits data, or both.
  • the computer does not have to have such a device.
  • the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, Or portable storage devices such as Universal Serial Bus (USB) flash drives, to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB Universal Serial Bus
  • Computer readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices including, for example, semiconductor memory devices, magnetic disks (eg, internal hard disks or removable disks), magneto-optical disks, and CD ROMs and DVD-ROM disks, where the semiconductor storage device may be an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM) and flash devices.
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash devices such as an Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555B Intel® 845555
  • multitasking and parallel processing may be advantageous.
  • the separation of various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product , or packaged into multiple software products.
  • specific embodiments of the subject matter have been described.
  • Other embodiments are within the scope of the appended claims.
  • the actions recited in the claims can be performed in a different order and still achieve desirable results.
  • the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results.
  • multitasking and parallel processing may be advantageous.
  • Embodiments of the present application provide a method, apparatus, electronic device, computer-readable storage medium, and computer program for detecting scene information; the method may include: each auxiliary node connected to a target node according to a scene heterogeneous graph
  • the feature dimension to be propagated is the aggregated feature of Cy*1, where Cy is the channel dimension of the aggregated feature, and Cy is the same as the channel dimension of the node feature of the target node;
  • the scene heterogeneous graph includes at least two Heterogeneous nodes: auxiliary nodes and target nodes obtained based on the scene image; update the node features of the target node based on the aggregation feature; obtain scene information of the scene image according to the updated node features of the target node.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种场景信息的检测方法和装置、电子设备,其中,该方法可以包括:根据场景异构图中与目标节点连接的各辅助节点的节点特征,得到待传播的特征维度是Cy*1的汇聚特征,其中,Cy是汇聚特征的通道维度,且Cy与目标节点的节点特征的通道维度相同;其中,场景异构图包括至少两种异质节点:辅助节点以及基于场景图像得到的目标节点;基于汇聚特征更新目标节点的节点特征;根据更新后的目标节点的节点特征,获得场景图像的场景信息。

Description

程序场景信息的检测方法、 装置、 电子设备、 介质和程序 相关申请的交叉引用 本 申请基于申请号为 202010739363.2、申请日为 2020年 7月 28日的中国专利申请提出, 并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。 技术领域 本 申请涉及计算机视觉技术, 涉及但不限于一种场景信息的检测方法、 装置、 电子 设备、 计算机可读存储介质和计算机程序。 背景技术 随着深度学习技术的不断发展,场景理解算法可以获取场景图像中包含的场景信息, 例如, 该场景信息可以是场景图像中包含哪些物体, 或者场景图像中的各个物体之间具 有怎样的关系, 即理解这个场景图像中正在发生什么事件。 而由于场景图像中包含的信 息复杂多样, 出于计算量大等多种因素的考虑, 现有的场景理解算法往往只能利用场景 图像中的一种类型的信息来辅助场景的理解, 使得最终获得的场景信息的检测精度有待 提高。 发明内容 有鉴 于此, 本申请实施例至少提供一种场景信息的检测方法、 装置、 电子设备、 计 算机可读存储介质和计算机程序。 本 申请实施例提供一种场景信息的检测方法, 所述方法包括: 根据 场景异构图中与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特 征, 所述汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述汇聚特征的通道维度, 且 所述 Cy与目标节点的节点特征的通道维度相同; 其中, 所述场景异构图包括至少两种异 质节点, 所述至少两种异质节点包括: 所述辅助节点以及基于所述场景图像得到的所述 目标节点; 基 于所述汇聚特征, 更新所述目标节点的节点特征; 根据 更新后的所述目标节点的节点特征, 获得所述场景图像中的场景信息。 在 一些实施例中, 所述基于所述汇聚特征, 更新所述目标节点的节点特征, 包括: 才艮据所述汇聚特征的每个通道的通道特征, 对所述目标节点的节点特征中对应所述每个 通道的所有特征位置利用所述通道特征进行特征更新处理。 在 一些实施例中,所述根据场景异构图中与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特征, 包括: 根据场景异构图中与目标节点连接的各辅助节点的节点 特征, 得到重加权向量和残差向量中的至少一种作为所述汇聚特征; 所述基于所述汇聚 特征, 更新所述目标节点的节点特征, 包括: 基于所述重加权向量对目标节点的节点特 征的各通道进行相乘处理, 和成, 通过所述残差向量对目标节点的节点特征的各通道进 行相加处理。 在 一些实施例中,所述得到重加权向量和残差向量中的至少一种作为所述汇聚特征, 包括: 通过激活函数、 以及所述目标节点的节点特征的标准差, 将所述残差向量的取值 映射到预定的数值区间作为所述汇聚特征。 在 一些实施例中, 所述目标节点包括: 对象组节点, 所述对象组包括所述场景图像 中的两个对象; 所述根据更新后的所述目标节点的节点特征, 获得所述场景图像中的场 景信息, 包括: 4艮据更新后的对象组节点的节点特征, 得到所述对象组节点中两个对象 之间关系的预测结果。 在一些 实施例中, 所述场景异构图中包括: 以其中一个对象组节点作为终点的信息 传输链, 所述信息传输链包括至少两个有向边组, 每个有向边组包括由多个起点指向同 一终点的多个有向边; 所述信息传输链中的各个起点和终点中包括至少两种所述异质节 点; 所述根据与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特征, 基于 所述汇聚特征, 更新所述目标节点的节点特征, 包括: 对于所述至少两个有向边组中的 第一有向边组, 以所述第一有向边组指向的同一个第一终点作为所述目标节点, 4艮据连 接所述第一终点的各个起点的节点特征得到汇聚特征, 基于所述汇聚特征更新所述第一 终点的节点特征; 所述第一终点同时作为所述至少两个有向边组中的第二有向边组的其 中一个起点; 对于所述第二有向边组, 以所述第二有向边组指向的同一个第二终点作为 所述目标节点, 根据连接所述第二终点的各个起点的节点特征得到汇聚特征, 基于所述 汇聚特征更新所述第二终点的节点特征。 在一些 实施例中, 所述至少两个有向边组的一个所述有向边组的起点和终点, 包括 如下其中一项: 所述起点包括: 由所述场景图像提取特征得到的各个像素节点, 所述终 点是由所述场景图像提取到的物体节点; 或者, 所述起点和终点均包括: 由所述场景图 像提取到的物体节点; 或者, 所述起点包括由所述场景图像提取到的物体节点, 所述终 点包括所述对象组节点; 或者, 所述起点包括所述对象组节点, 所述终点包括所述物体 节点。 在一些 实施例中, 所述各辅助节点包括: 多个像素节点; 所述方法还包括: 根据所 述场景图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同尺寸; 将所 述多个特征图缩放到同一尺寸后进行融合, 得到融合特征图; 根据所述融合特征图, 得 到多个所述像素节点的节点特征。 在一些 实施例中, 所述 4艮据更新后的对象组节点的节点特征, 得到所述对象组节点 中两个对象之间关系的预测结果, 包括: 根据所述对象组节点的节点特征, 得到预测的 初始分类置信度, 所述初始分类置信度中包括: 所述对象组节点对应各个预定关系类别 的初始分类置信度; 根据所述对象组节点在所述各个预定关系类别中的其中一种目标预 定关系类别对应的初始分类置信度、 以及所述对象组节点中两个对象分别对应的对象检 测置信度, 得到所述对象组节点中的两个对象对应所述目标预定关系类别的置信度; 若 所述置信度大于或等于预设的置信度阈值, 则确认所述对象组节点中的两个对象之间的 关系的预测结果是所述目标预定关系类别。 本 申请实施例提供一种场景信息的检测方法, 所述方法由图像处理设备执行; 所述 方法包括: 获取 图像采集设备采集到的场景图像; 根据本 申请任一实施例提供的检测方法, 对所述场景图像进行处理, 输出所述场景 图像中的场景信息。 本 申请实施例提供一种场景信息的检测装置, 所述装置包括: 特征处理模块, 配置为根据场景异构图中与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特征, 所述汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述汇聚 特征的通道维度, 且所述 Cy与目标节点的节点特征的通道维度相同; 其中, 所述场景异 构图包括至少两种异质节点, 所述至少两种异质节点包括: 所述辅助节点以及基于所述 场景图像得到的所述目标节点; 特征 更新模块, 配置为基于所述汇聚特征, 更新所述目标节点的节点特征; 信 息确定模块, 配置为根据更新后的所述目标节点的节点特征, 获得所述场景图像 中的场景信息。 在一些 实施例中, 所述特征更新模块, 在配置为基于所述汇聚特征更新所述目标节 点的节点特征时, 包括: 4艮据所述汇聚特征的每个通道的通道特征, 对所述目标节点的 节点特征中对应每个通道的所有特征位置利用所述通道特征进行特征更新处理。 在一些 实施例中, 所述特征处理模块, 具体配置为根据场景异构图中与目标节点连 接的各辅助节点的节点特征, 得到重加权向量和残差向量中的至少一种作为所述汇聚特 征; 所述特征更新模块, 具体配置为基于所述重加权向量对目标节点的节点特征的各通 道进行相乘处理, 和成, 通过所述残差向量对目标节点的节点特征的各通道进行相加处 理。 在一些 实施例中, 所述特征处理模块, 在配置为得到重加权向量和残差向量中的至 少一种作为所述汇聚特征时, 包括: 通过激活函数、 以及所述目标节点的节点特征的标 准差, 将所述残差向量的取值映射到预定的数值区间作为所述汇聚特征。 在一些 实施例中, 所述目标节点包括: 对象组节点, 所述对象组包括所述场景图像 中的两个对象; 所述信息确定模块, 具体配置为根据更新后的对象组节点的节点特征, 得到所述对象组节点中两个对象之间关系的预测结果。 在一些 实施例中, 所述场景异构图包括: 以其中一个对象组节点作为终点的信息传 输链, 所述信息传输链包括至少两个有向边组, 每个有向边组包括由多个起点指向同一 终点的多个有向边; 所述信息传输链中的各个起点和终点中包括至少两种所述异质节点; 所述特征处理模块, 配置为: 对于所述至少两个有向边组中的第一有向边组, 以所述第 一有向边组指向的同一个第一终点作为所述目标节点, 根据连接所述第一终点的各个起 点的节点特征得到汇聚特征; 所述第一终点同时作为所述至少两个有向边组中的第二有 向边组的其中一个起点; 对于所述第二有向边组, 以所述第二有向边组指向的同一个第 二终点作为所述目标节点,根据连接所述第二终点的各个起点的节点特征得到汇聚特征; 所述特征更新模块, 配置为: 基于连接所述第一终点的各个起点的节点特征得到的汇聚 特征更新所述第一终点的节点特征; 以及基于连接所述第二终点的各个起点的节点特征 得到的汇聚特征更新所述第二终点的节点特征。 在一些 实施例中, 所述至少两个有向边组的一个所述有向边组的起点和终点, 包括 如下其中一项: 所述起 点包括: 由所述场景图像提取特征得到的各个像素节点, 所述终点是由所述 场景图像提取到的物体节点; 或者 , 所述起点和终点均包括: 由所述场景图像提取到的物体节点; 或者 , 所述起点包括由所述场景图像提取到的物体节点, 所述终点包括所述对象组 k 者, 所述起点包括所述对象组节点, 所述终点包括所述物体节点。 在一些 实施例中, 所述各辅助节点包括: 多个像素节点; 所述特征处理模块, 还配置为: 根据所 述场景图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同 尺寸; 将所述多个特征图缩放到同一尺寸后进行融合, 得到融合特征图; 根据所述融合 特征图, 得到多个所述像素节点的节点特征。 在一些 实施例中, 所述信息确定模块, 在配置为 4艮据更新后的对象组节点的节点特 征, 得到所述对象组节点中两个对象之间关系的预测结果时, 包括: 根据所述对象组节 点的节点特征, 得到预测的初始分类置信度, 所述初始分类置信度中包括: 所述对象组 节点对应各个预定关系类别的初始分类置信度; 根据所述对象组节点在所述各个预定关 系类别中的其中一种目标预定关系类别对应的初始分类置信度、 以及所述对象组节点中 两个对象分别对应的对象检测置信度, 得到所述对象组节点中的两个对象对应所述目标 预定关系类别的置信度; 若所述置信度大于或等于预设的置信度阈值, 则确认所述对象 组节点中的两个对象之间的关系的预测结果是所述目标预定关系类别。 本 申请实施例提供一种场景信息的检测装置, 所述装置应用于图像处理设备, 所述 装置包括: 图像获取模块, 配置为获取图像采集设备采集到的场景图像; 信息输出模块, 配置为根据本申请任一实施例的检测方法, 对所述场景图像进行处理, 输出所述场景图 像中的场景信息。 本 申请实施例提供一种电子设备, 包括: 存储器、 处理器, 所述存储器配置为存储 计算机可读指令, 所述处理器配置为调用所述计算机指令, 实现本申请任一实施例的检 测方法。 本 申请实施例提供一种计算机可读存储介质, 其上存储有计算机程序, 所述程序被 处理器执行时实现本申请任一实施例的检测方法。 本 申请实施例提供一种计算机程序, 包括计算机可读代码, 当所述计算机可读代码 在电子设备中运行时, 所述电子设备中的处理器执行用于实现本申请任一实施例的检测 方法。 本 申请实施例提供的场景信息的检测方法、 装置、 电子设备、 计算机可读存储介质 和计算机程序, 通过在更新节点特征时, 在不同节点间传输通道级别的信息, 使得可以 在异质节点间传递信息, 这样就能够融合多种类型的信息进行场景信息的检测, 从而使 得场景信息检测更力 P准确。 应 当理解的是, 以上的一般描述和后文的细节描述仅是示例性和解释性的, 而非限 制本申请。 附图说明 为了更清楚地说明本申请一个或多个实施例或相关技术中的技术方案, 下面将对实 施例或相关技术描述中所需要使用的附图作筒单地介绍, 显而易见地, 下面描述中的附 图仅仅是本申请一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。 图 1示出了本申请至少一个实施例提供的一种场景信息的检测方法; 图 2示出了本申请至少一个实施例提供的一种特征更新的原理示意图; 图 3示出了本申请至少一个实施例提供的另一种场景信息的检测方法; 图 4示出了本申请至少一个实施例提供的场景异构图的示意图; 图 5示出了本申请至少一个实施例提供的场景信息的检测装置; 图 6示出了本申请至少一个实施例提供的另一种场景信息的检测装置。 具体实施方式 为了使本技术领域的人员更好地理解本申请一个或多个实施例 中的技术方案, 下面 将结合本申请一个或多个实施例中的附图, 对本申请一个或多个实施例中的技术方案进 行清楚、 完整地描述, 显然, 所描述的实施例仅仅是本申请一部分实施例, 而不是全部 的实施例。 基于本申请一个或多个实施例, 本领域普通技术人员在没有作出创造性劳动 前提下所获得的所有其他实施例, 都应当属于本申请保护的范围。 计算机视觉技术可以通过对某个场景的场景图像进行图像处理, 进而获得关于对该 场景内容的理解信息, 可称为场景信息。 该场景信息包括但不限于: 例如, 识别场景图 像中包含的目标对象、 检测场景图像中的对象在做的事情、 检测场景图像中的不同对象 之间的关系、 根据场景图像的内容识别图像中蕴含的信息, 等。 在一些实施例中, 可以由图像采集设备采集场景图像。 其中, 所述的场景可以是存 在自动分析场景信息的需求的地方, 例如, 经常发生暴力斗殴等城市安全隐患的场所, 可以安装监控摄像头等图像采集设备; 又例如, 如果一个超市等购物场所想要自动采集 顾客购物的图像, 并分析顾客对哪些商品的兴趣较高, 也可以在超市内安装监控摄像头 等图像采集设备。 其中, 所述的场景图像既可以是单帧图像, 也可以是视频流中的其中 部分图像帧。 在采集到场景图像以后, 可以将该场景图像传输至用于进行图像分析处理的图像处 理设备, 该图像处理设备可以按照本申请实施例后续提供的场景信息的检测方法, 对图 像采集设备采集到的图像进行分析, 最终输出场景图像中的场景信息, 例如, 该场景信 息可以是图像中的某些人正在打架。 当然, 这些都是示例而已, 实际实施中不局限于上 述列举的情况。 而在对场景图像进行处理以获得场景信息的过程中, 通常会依据场景中的部分信息 作为辅助来获得要识别和检测的目标场景内容, 这个过程就涉及到融合辅助信息的特征 更新的过程, 通过特征更新将多种辅助信息融合起来共同预测识别目标。 本申请实施例提供一种场景信息的检测方法, 该方法提供了一种特征更新的方式, 通过该方法提供的方式更新特征, 并根据更新特征检测场景信息。 首先, 通过对待识别的场景图像 (例如, 采集的网球场的图像)进行特征提取等图 像处理, 可以得到多个节点, 这些节点可以构成一个图网络, 本实施例将该图网络称为 场景异构图。 该场景异构图中的所述多个节点至少包括两种类型的异质节点, 所述的异质节点是 指节点在节点特征维度 ( feature shapes )和节点特征分布 ( feature distributions )等方面都 存在不同。 上述的场景异构图中具体包括哪些异质节点, 可以根据实际处理目标来确定, 本实 施例不做限制。 需要注意的是, 本实施例中的场景异构图中允许包括多种类型的异质节 点, 以融合更为丰富的信息来进行场景理解, 并且, 图中的各个节点之间可以建立有向 边的连接, 将有向边起点的特征融合进有向边终点的特征, 以实现对有向边终点的特征 优化更新。 例如, 如果要获得的场景信息是图像中的人和物体之间的关系, 那么图中节点可以 包括对象节点 (对象, 可以是人或者物)、 像素节点等不同节点。 例如, 在另一个场景理解任务中, 图中节点除了包括人体节点、 像素节点, 还可以 包括人体关键点对应的节点。 既可以将同一个人的关键点之间连边, 也可以在不同人的 同一个关键点之间连边, 这些关键点可以连接到人体检测框对应的节点上。 通过具有连 接边的节点之间的信息传递, 能够优化更新人体特征, 使得依据更新的人体特征更好的 捕捉到人的动作姿态。 例如, 在又一个场景理解任务中, 图中节点可以包括像素节点、 对象节点, 还可以 将一个时刻的场景凝缩成一个对应该时刻的时刻节点。 该时刻节点可以通过连接到像素 节点上, 来优化每个时刻中每个像素位置的特征表示, 也可以将该时刻节点连接到具体 的某个对象节点进行优化。 此外, 如果该场景理解任务还期望依据一些更整体性的环境 因素进去, 比如整体光照条件、 天气等因素和特征, 也可以在图中加入对应这些整体性 因素的节点。 总之, 可以根据具体的场景理解任务, 确定场景异构图中包括的节点, 本实施例允 许图中包括多种异质节点。 如下的图 1将描述根据该场景异构图进行场景信息检测的处 理, 可以包括: 步骤 100: 根据场景异构图中与目标节点连接的各辅助节点的节点特征, 得到待传播 的汇聚特征。 这里, 汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述汇聚特征的通道维度, 且所述 Cy与目标节点的节点特征的通道维度相同。 其中, 所述场景异构图包括至少两种异质节点, 所述至少两种异质节点包括: 所述 辅助节点以及对场景图像进行特征提取得到的所述目标节点。 其中, 目标节点和辅助节 点都可以是基于场景图像得到, 比如, 可以是对场景图像进行图像中的目标检测, 检测 到图像中的某个对象 (如, 人, 或者物体), 由此生成一个对应该对象的节点, 可以是辅 助节点。 又比如, 还可以是将场景图像中的两个对象组成一个对象组(如, 一个人和一 个网球 ), 并生成一个对应该对象组的节点, 可以是目标节点。 其中的部分辅助节点还可 以是以其他方式得到, 比如, 场景图像采集时的时间信息、 光照条件信息等, 这些信息 也可以对应一个节点, 可以是辅助节点, 当然后续这些信息都可以编码融合进该辅助节 点对应的节点特征中。 由此可见, 当得到一张场景图像后, 可以基于该场景图像生成上 述的目标节点、 辅助节点, 这些节点进而又构成了场景异构图。 例如, 该至少两种异质节点可以包括节点 A、 节点 B、 节点 C和节点 D四种类型的 节点, 每一种类型的节点数量可以是多个。 并且, 在该场景异构图中可以包括如下的节 点连接关系: 例如, 多个节点 A连接到其中一个节点 B, 并且节点 A作为有向边的起点, 节点 B 作为有向边的终点, 那么, 本步骤中的目标节点和各个辅助节点可以是, 所述的多个节 点 A为各个辅助节点, 节点 B是目标节点。 本步骤中, 可以根据各个辅助节点的节点特征, 得到待传播的汇聚特征, 并且, 汇 聚特征的特征维度是 Cy*l, 其中, 所述 Cy是汇聚特征的通道维度, 且 Cy与目标节点的 节点特征的通道维度相同。 示例性的, 目标节点的节点特征有 256个通道, 那么汇聚特 征可以是一个 256维的向量。 其中, 上述提到的目标节点的节点特征, 该节点特征可以是基于场景图像的至少一 部分图像内容得到的一种信息, 该节点特征中融合了目标节点对应的对象在场景图像中 的图像信息。 也正是由于该节点特征中融合了图像信息, 使得能够根据该节点特征进行 场景信息的预测, 得到场景图像中藍含的场景信息。 步骤 102: 基于所述汇聚特征, 更新所述目标节点的节点特征。 其中, 所述的汇聚特征是综合了目标节点对应的各个辅助节点的节点特征得到的, 该汇聚特征用于表示各辅助节点对目标节点的节点特征更新的影响, 相当于将各辅助节 点对应的图像内容的信息传输至目标节点对应的对象, 以使得目标节点的节点特征融合 进辅助节点对应的图像内容。 本步骤中, 汇聚特征和节点特征的通道维度相同, 在更新目标节点的节点特征时, 更新方式也是进行通道级 ( channel -wise) 的信息更新。 具体可以是, 根据所述汇聚特征 的每个通道的通道特征, 对所述目标节点的节点特征中对应所述通道的所有特征位置利 用所述通道特征进行特征更新处理。 例如, 仍以上述的目标节点的节点特征有 256个通道, 汇聚特征可以是一个 256维 的向量为例。 请结合参见图 2所示, 根据多个辅助节点 A的节点特征可以计算得到一个 汇聚特征 {pl,p2,p3...... p256} , 该汇聚特征是一个 256维的向量。 目标节点 B的节点特征 中每个通道有 7*7=49个特征位置,在对节点特征更新时,可以逐个通道进行更新。比如, 如图 2所示, 对目标节点的第一个通道进行更新时, 可以由汇聚特征的向量中取出第一 个向量元素 pi , 对目标节点的第一个通道中的所有特征位置都加上这个向量元素 (这里 以 “加”为例, 在一些实施例中, 还可以是“乘”等其他操作), 实现对该第一个通道中所有 特征位置的特征更新处理, 图 2在部分特征位置处示出了 +pl 的操作。 同理, 对目标节 点的第二个通道更新时, 使用汇聚特征的向量中的第二个向量元素, 将第二个通道中的 所有特征位置都力 P上该第二个向量元素。 步骤 104:根据更新后的所述目标节点的节点特征,获得所述场景图像中的场景信息 如上的步骤 100和步骤 102中, 以其中一次目标节点的更新为例, 实际实施中, 由 场景图像检测获得场景信息的过程中可以涉及到多次这样的特征更新。 比如, 在根据多 个节点 A的特征更新了共同指向的一个节点 B的特征之后,该节点 B可以与其他的节点 B 一起, 基于这些节点 B的节点特征去更新共同指向的节点 C的特征, 更新方式与图 2 相同。 在经过至少一次本实施例的特征更新后, 可以利用更新后的目标节点的节点特征, 最终获得所述场景图像中的场景信息。 其中, 在上述包括多次的特征更新的情况下, 这 里的更新后的目标节点的节点特征可以是最终得到更新的目标节点 (即最后的有向边终 点, 不再作为起点继续指向其他节点), 或者, 也可以是场景异构图中选择的部分节点, 本实施例不限制。 此外, 获得场景信息的方式以及具体的场景信息, 可以根据实际业务 需求确定, 例如, 若实际业务目标是预测场景中的对象间的关系, 那么可以通过多层感 知机根据更新节点特征预测对象间的关系类别。 本实施例的场景信息检测方法, 通过在更新节点特征时, 在不同节点间传输通道级 别的信息, 使得可以在异质节点间传递信息, 这样就能够融合多种类型的信息进行场景 信息的检测, 从而使得场景信息检测更加准确。 图 3示例了另一种场景信息的检测方法, 该方法在图 1 方法的基础上, 示例了一种 具体的通道信息的形式。 如图 3所示, 该方法可以包括如下处理: 步骤 300: 根据场景异构图中与目标节点连接的各辅助节点的节点特征, 得到重加权 向量和残差向量中的至少一种作为所述汇聚特征。 本步骤中, 根据多个辅助节点的节点特征得到的汇聚特征, 可以是重加权向量和残 差向量中的至少一种。 例如, 可以只有一个重加权向量, 也可以只有一个残差向量, 或 者计算重加权向量和残差向量两种向量。 通过 表示重加权向 量 ( channel -wise re-weighting vector ) , 表示残差向量 ( channel -wise residual vector )» 这两个向量在计算时, 可以先通过一个函数得到辅助节 点的节点特征对目标节点的节点特征的影响参数, 再将不同辅助节点的影响参数汇合起 来, 汇合的方式也可以有多种, 例如, 可以通过力口权求和, 或者也可以通过多层感知机。 如下示例两种重加权向量和残差向量的计算方式 , 但是可以理解的是, 具体计算方 式不限制于此: 在一些实施例中, 可以根据以下公式计算得出重加权向量和残差向量:
Figure imgf000009_0001
其中, Hw和 Hb为两个线性变换矩阵, 可以用于将辅助节点的维度 C' * L' L'的节 点特征变为通道维度是 Cy 的特征, /x表示辅助节点的节点特征。 为注意力权重, 可 以通过如下公式计算得出:
Figure imgf000009_0002
其中, 和 为两个线性变换矩阵, 可以用于将辅助节点的节点特征 /x和目标节点 的节点特征 /y变成同样维度 4的特征。 这里 dk 为一个超参数, 可以根据情况做具体的 设置。 <., > A两个向量的内积的计算函数 ^ 在一些实施例中, 还可以根据以下公式计算得出重加权向量和残差向量: 其中, 和 //y 的作用类似于上一个计算方式中的 和 y 可以用于将 /x和 /y 变成同样维度 4。 这里 [;] 表示拼接, 即将两个向量直接拼接在一起。 ML尸为多层感知 机, 具体的参数设定可以比较灵活。 如上两种方式示例了重加权向量 Wy和残差向量 fry的计算获得, 这两个向量的维度均 为 Cy * 1。 步骤 302: 基于所述汇聚特征, 更新所述目标节点的节点特征, 包括如下至少一项: 基于重加权向量对目标节点的节点特征的各通道进行相乘处理, 或者, 通过残差向量向 目标节点的节点特征的各通道进行相加处理。 本步骤中, 才艮据汇聚特征更新目标节点的节点特征时, 也可以有多种方式。 示例如下的一种更新公式: fy = ConvUUC C (sigmoid(wy)0(fy + a(fy)® tanh(&y))) (7) 其中, 目标节点是 y, 维度是 Cy是通道维度, L);是目标节点的每个通道的 特征尺寸; 该目标节点的更新前的特征是 /y, 更新后的新特征为 //, 并假设共有 M条有 向边指向该目标节点 y, 这 M条有向边的起点即 M个辅助节点, 这些 M个辅助节点组 成的集合是 N(y), 且每个辅助节点的特征维度为 C'*L'。 通过上述公式由 M个辅助节点 的节点特征得到汇聚特征后传递至目标节点 y, 以得到更新后的新特征 /y’。 首先, Wy和 可以按照步骤 300中示例的两种方式得到, 并且这两个向量的维度为 Cy *l。 请继续参见上述公式, 该公式代表的操作包括:
1)、 通过 Sigmoid激活函数, 将 映射至 (0,1) 区间; 并且, 通过激活函数 Tanh、 以及目标节点的更新前的节点特征 /y 标准差 c(fy), 将残差向量 fry的取值映射到预定的 数值区间[-stand, +stand]。其中, G(/y) 的含义是求 /y每个通道的标准差,是一个长度为 Cy * 1 的向量,每一位表示 /y在对应通道上的 L>>这些位置数据的标准差。 Conv是一个 1维 的卷积操作, 卷积核大小为 1, 输入的通道数和输出的通道数均为 Cy。
2 )、 对于残差向量, 该残差向量 crC/^Otanhf^)被“广播”到 /y的每个通道的所有特征 位置上, 即 /y +cr(/:y)®tanh(fey)。 然后, /y的每个通道的数再乘以重加权向量, 具体到公 式中,可以是每个通道上的所有特征位置的数乘以通过 sigmoid激活函数变换后的重加权 向量。 最后, 通过卷积操作对各个通道的信息进行融合, 得到更新后的特征。 上述公式是以同时计算了重加权向量和残差向量为例进行说明 , 实际实施中可以有 多种变形形式。 例如, 不使用重加权向量 Wy, 或者不使用残差向量 fry, 或者不使用卷积 操作 Conv等等。 又例如, 还可以是改变卷积操作的卷积核大小, 或者还可以是先对重加 权向量 和残差向量 做卷积再传播到 /y的各个通道。再例如, 在将汇聚特征融入目标 节点的节点特征时, 除了上述公式示例的乘和加的操作, 还可以是其他形式, 比如, 除 法, 减法, 或者多个嵌套 (例如, 先加后乘等 X 本实施例的场景信息检测方法, 具有如下效果: 第一、 通过在更新节点特征时在不同节点间传输通道级别的信息, 使得可以在异质 节点间传递信息, 这样就能够融合多种类型的信息进行场景信息的检测, 从而使得场景 信息检测更加准确; 并且, 只传输通道级别的信息也使得信息传输量减小, 能够快速的 在异质节点间的信息传输;还能使得不用对不同异质节点的节点特征的信息进行预压缩, 从而充分保留节点特征的原始内容, 并由于不需要对原始特征做不可逆压缩, 从而可以 容易地应用于不同框架, 具有广泛的适用性。 第二、 通过获取通道级别的重加权向量和残差向量传播到目标节点, 使得目标节点 的优化效果更好, 依据目标节点的最终场景信息检测更加准确。 第三、 此外, 本实施例中, 还通过目标节点特征的标准差来约束残差向量的取值范 围, 使得更新后的新特征不会与更新前特征的特征分布发生较大的偏移, 从而减轻异质 节点的特征分布的差异对目标节点更新的影响。 如上几点, 本实施例提供的这种异质节点间的信息传输机制, 通过通道级别信息的 传输实现了不同特征维度的异质节点间的信息传递, 通过标准差限制残差向量的取值范 围降低不同特征分布的异质节点对目标节点特征分布的影响, 从而该机制实现了异质节 点间的信息传递, 使得能够通过多种更为丰富的节点特征对目标节点特征进行优化, 进 而使得基于优化后的目标节点特征进行场景信息检测时更为准确。 如下将以场景图像中的对象关系检测为例, 来描述场景信息的检测方法, 在下面的 实施例中, 检测的场景信息将是场景图像中的两个对象之间到的关系, 并且, 以这两个 对象分别是人和物体为例,识别人和物体之间的关系 ( Human-object Interaction Detection, 筒称 HOI检测), 比如, 人打球。 请参见图 4的示例, 该图 4示例了 HOI检测时根据场景图像构建的场景异构图。 本 实施例以场景异构图中包括三种节点为例: 像素节点、 物体节点和对象组节点; 在其他 的可选实施例中, 该异构图中也可以包括其他类型的节点。 如下示例一种上述三类型节 点的节点特征的获得方式, 但是实际实施中并不局限于此, 也可以通过其他方式获得节 点特征。 像素节点 Vpix:其中一种具体的实现方式可以是利用 FPN对场景图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同尺寸; 然后, 将所述多个特征图缩放到 同一尺寸后, 通过一个卷积层进行融合, 得到融合特征图; 最后, 根据所述融合特征图, 得到多个所述像素节点的节点特征。 比如, 融合特征图的特征维度为 256 * 7 * 7, 其中 256是通道 维数, H和 W分别表示特征图的高和宽。 因此, 场景异构图中可以包含 H * W 个用于表示像素的节点即像素节点, 并且每个像素节点的维度为 256。 上述方式中, 通过将不同尺寸的特征图进行融合, 可以使得融合特征图中既包含了 很多低语义特征和局部特征 (来自高分辨率图 ),也包含了很多高语义信息和全局特征 (来 自低分辨率图 ), 使得像素节点中能够融合更加丰富的图像内容, 有助于提高后续的场景 信息的检测精度。 物体节点 Vinst: 例如, 可以利用 Faster R-CNN对场景图像进行处理, 检测出场景图 像中所有物体的类别和位置, 并使用 Rol Align 算法来提取出每个物体的特征。 假设检 测算法检测出这个场景里有 N个物体, 那么场景异构图中将会有 N个物体节点用于表 示不同物体, 并且每个物体节点的特征维度为 256 * 7 * 7。 该物体节点例如可以是人、 球、 马等。 或者, 在其他的例子中, 还可以是对物体检测框中的内容通过一个深度卷积 神经网络如 ResNet50来提取特征。 对象组节点 Vpair: 假设场景图像中有 N个物体, 那可以组成 N * ( N - 1 )个对象组 节点。 其中, 对于 01和 02两个物体节点, “01-02”是一个对象组节点, 该对象组节点 的主体是 01, 客体是 02; 而 “02-01”是另一个对象组节点, 该对象组节点的主体是 02, 客体是 01。 每个对象组节点的特征由三个区域的特征来决定 。 具体地, 设对象组节点包括的两 个物体节点对应物体的位置分别为 (ax 1, ay 1, ax2, ay2)和 (bxl, byl, bx2, by2),其中 axl 为第一个物体的检测框左上角的横坐标, ayl 为第一个物体的检测框左上角的纵坐标, ax2 为第一个物体的检测框右下角的横坐标, ay2 为第一个物体的检测框右下角的纵坐 标, bxl 为第二个物体的检测框左上角的横坐标, byl 为第二个物体的检测框左上角的 纵坐标, bx2 为第二个物体的检测框右下角的横坐标, by2 为第二个物体的检测框右下 角的纵坐标。之后将会对三个区域利用 Rol Align 算法提取特征: (axl, ay2, ax2, ay2), (bxl, byl, bx2, by2), (min(axl, bxl), min(ayl, byl), max(ax2, bx2), max(ay2, by2))。每个区域经过 Rol Align 算法之后得到的 4征维度都为 256 * 7 * 7, 因此将会得到 3 个 256 * 7 * 7 的 特征图。 拼接后可以得到一个维度为 768 * 7 * 7 的特征图, 这个将作为对象组节点的节 点特征。 因此场景异构图中将会包含这 N * ( N - 1 ) 个对象组节点, 且每个对象组节 点的特征维度是 768 * 7 * 7„ 在确定了图中的三种节点的节点特征后, 还需要在各种节点之间建立连接的有向边。 不同的异质节点之间建边的方式也可以有多种灵活的方式, 如下示例两种方式:
【建边方式一 1 将所有像素节点向所有对象组节点连边, 即会得到 H * W * N * (N - 1) 条有向边。 将所有物体节点之间两两连边, 即会得到 N * (N - 1) 条有向边。 将所有物体节点和其对 应的对象组节点 (即这个对象组节点中的主体 i者客体为该物体)连边, 即会得到 2 * N * (N-1)条有向边。
【建边方式二】: 将所有像素节点向所有物体节点连边, 即会得到 H * W * N条有向边。 将所有物体 节点之间两两连边, 即会得到 N * (N 1) 条有向边。 将所有物体节点和其对应的对象组 节点 (即这个对象组节点中的主体或者客体为该物体)连边, 即会得到 2 * N * (N-1) 条 有向边。 上述建图方式中, 像素节点的节点特征并没有直接传输给对象组节点, 而是先传输 给物体节点, 再由物体节点传输给对象组节点, 这种方式将物体节点作为桥梁, 由于物 体节点的数量比较少, 能够降低信息传输量, 提高传输效率。 如上述两种方式中所述的, 在节点之间连接的边是有向边, 比如, 将其中一个像素 节点 Vpix向一个物体节点 Vinst连边,则该有向边是由像素节点 Vpix指向物体节点 Vinst, 起点是像素节点 Vpix, 终点是物体节点 Vinst。 像素节点、 物体节点和对象组节点的数量都可以是多个, 相应的, 上述三种类型的 有向边的数量也可以是多个。 这三种有向边的集合可以表示如下:
Figure imgf000012_0001
此外, 在建立有向边时, 不局限于上述列举的两种方式, 可以有所调整。 例如, 可 以删去物体节点之间的连边, 或者当有人体关键点的节点时, 可以增加人体关键点的节 点到物体节点 (人体检测框)之间的连边。 又例如, 还可以将对象组节点再连接回物体 节点, 做多轮次的优化。 比如, 某个对象组节点 Vpair 的节点特征更新后, 再作为起点 继续更新连接的物体 节点, 然后该物体节点更新后又返回来再更新所述的对象组节点 Vpair。 不论如何建立有向边, 该场景异构图更新节点特征时, 最终要获取的节点特征是对 象组节点的特征, 以根据该对象组节点的节点特征得到对象关系的预测结果。 因此, 场 景异构图中存在以对象组节点为最终的终点的信息传输链。 如图 4所示 (图 4仅是筒单示意, 实际实施中的节点数量会较多 ), 以对象组节点 41 为例, 所述的信息传输链上包括三个有向边组:
(第一有向边组): 以物体节点 42为目标节点, 以像素节点 43、 44和 45为各个辅 助节点, 根据各辅助节点的节点特征更新物体节点 42的节点特征。 更新方式可以按照前 述的公式, 比如, 计算获得重加权向量和残差向量, 这些向量的通道维度与物体节点 42 的通道维度相同, 对物体节点 42进行通道级更新。
(第二有向边组): 以物体节点 46为目标节点, 以像素节点 47和 48为各个辅助节 点, 根据各辅助节点的节点特征更新物体节点 46的节点特征。 更新方式可以按照前述的 公式, 不再许述。
(第三有向边组): 以对象组节点 41为目标节点, 以物体节点 42和 46为各个辅助 节点, 根据各辅助节点的节点特征更新对象组节点 41的节点特征。 如上, 在包括很多异质节点的场景异构图中, 可以依序逐个更新各个有向边组中的 终点的节点特征, 每个有向边组都是由起点向终点汇聚, 直至最终更新对象组节点的节 点特征。 在得到对象组节点的节点特征后, 可以根据该更新后的对象组节点的节点特征, 得 到所述对象组节点中两个对象之间关系的预测结果, 即 HOI的关系预测。 例如, 可以根 据如下公式获得初始分类置信度。 = sigmoid(MLP(fy))yy G Vpair (ii) 如上, MLP是多层感知机, 是根据更新后的对象组节点的节点特征 /y得到的初始 分类置信度的向量, 所述初始分类置信度中包括: 所述对象组节点对应各个预定关系类 别的置信度, 该向量 的维度是 CciaSS+l , 其中的 CciaSS是预定关系类别的数量, 1是 “no action”。 比如, 对象组节点对应的两个对象一个是人, 一个是网球, 这两个之间的关系 是“打”, 即人打网球, “打” ( hit )就是一个预定关系类别, 同理还可以有其他的关系, sy 包括了各个关系的置信度。 接着, 还可以基于所述初始分类置信度以及对象检测置信度, 得到所述两个对象之 间关系的预测结果。 参见如下公式: score; = sh -so -sy c (12) 其中, c表示某一种预定的关系类别, 表示某个对象组节点,
Figure imgf000013_0001
即该对象组节 点在所述预定关系类别 c上的置信度, 相当于对象组节点中的两个对象之间的关系是所 述预定关系类别 c的概率。而 <可以是 ^向量中对应 c这种预定关系类别的置信度数值, A 和 \分别是对象组节点中两个对象分别对应的对象检测置信度, 比如, 是人体框 的检测置信度, 是物体框的检测置信度。 在实际情况中, 可以通过一个对象检测器
( object detector ) 由场景图像中检测对象, 例如检测人体或者物体, 将会得到一个对应 的人体框或物体框, 同时该对象检测器也会输出一个检测分 ( detection scores ), 可以称 为对象检测置信度。 由于检测框不是完美的, 也会有误检或者不准的情况, 因此检测框 也有一个置信度即上述的对象检测置信度。 实际实施中, 可以设定一个对象关系的预测结果的阈值, 对于某个对象组节点而言, 若最终的预测结果达到这个阈值, 才会确认该对象组节点的两个对象之间具有这种关系。 以一个场景图像为例, 可以遍历该场景图像中所有的 pair, 比如将所有人和物体都进 行配对生成对象组节点。 并对每一个对象组节点, 按照上述方式获得该对象组节点分别 对应每一个预定关系类别的置信度, 并将具有高于阈值的置信度的对象组节点确认为由 场景图像中识别到的 HOI关系。 上述各个实施例中的 HOI关系的检测, 可以具有多种应用: 例如, 在智慧城市中进行异常行为检测, 通过该方法可以更好地判断是否发生人与 人之间的暴力事件, 或者有人在打砸店铺的行为等。 又例如, 在超市购物的场景中, 通过该方法可以通过对超市采集图像的处理, 自动 分析每个人的购买内容, 以及对什么物品比较关注。 图 5提供了一示例性的场景信息的检测装置, 如图 5所示, 该装置可以包括: 特征 处理模块 51、 特征更新模块 52和信息确定模块 53。 特征处理模块 51, 配置为根据场景异构图中与目标节点连接的各辅助节点的节点特 征, 得到待传播的汇聚特征, 所述汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述 汇聚特征的通道维度, 且所述 Cy与目标节点的节点特征的通道维度相同; 其中, 所述场 景异构图包括至少两种异质节点, 所述至少两种异质节点包括: 所述辅助节点以及基于 所述场景图像得到的所述目标节点。 特征更新模块 52, 配置为基于所述汇聚特征, 更新所述目标节点的节点特征。 信息确定模块 53, 配置为根据更新后的所述目标节点的节点特征, 获得所述场景图 像中的场景信息。 在一些实施例中, 所述特征更新模块 52, 在配置为基于所述汇聚特征更新所述目标 节点的节点特征时, 包括: 根据所述汇聚特征的每个通道的通道特征, 对所述目标节点 的节点特征中对应每个通道的所有特征位置利用所述通道特征进行特征更新处理。 在一些实施例中, 所述特征处理模块 51, 具体配置为相 4居场景异构图中与目标节点 连接的各辅助节点的节点特征, 得到重加权向量和残差向量中的至少一种作为所述汇聚 特征。 所述特征更新模块 52, 具体配置为基于所述重加权向量对目标节点的节点特征的各 通道进行相乘处理, 和 /或, 通过所述残差向量对目标节点的节点特征的各通道进行相加 处理。 在一些实施例中, 所述特征处理模块 51, 在配置为得到重加权向量和残差向量中的 至少一种作为所述汇聚特征时, 包括: 通过激活函数、 以及所述目标节点的节点特征的 标准差, 将所述残差向量的取值映射到预定的数值区间作为所述汇聚特征。 在一些实施例中, 所述目标节点包括: 对象组节点, 所述对象组包括所述场景图像 中的两个对象;所述信息确定模块 53 ,具体配置为根据更新后的对象组节点的节点特征, 得到所述对象组节点中两个对象之间关系的预测结果。 在一些实施例中, 所述场景异构图中包括: 以其中一个对象组节点作为终点的信息 传输链, 所述信息传输链包括至少两个有向边组, 每个有向边组包括由多个起点指向同 一终点的多个有向边; 所述信息传输链中的各个起点和终点中包括至少两种所述异质节 点。 所述特征处理模块 51, 配置为: 对于所述至少两个有向边组中的第一有向边组, 以 所述第一有向边组指向的同一个第一终点作为所述目标节点, 4艮据连接所述第一终点的 各个起点的节点特征得到汇聚特征; 所述第一终点同时作为所述至少两个有向边组中的 第二有向边组的其中一个起点; 对于所述第二有向边组, 以所述第二有向边组指向的同 一个第二终点作为所述目标节点, 根据连接所述第二终点的各个起点的节点特征得到汇 聚特征。 所述特征更新模块 52, 配置为: 基于连接所述第一终点的各个起点的节点特征 得到的汇聚特征更新所述第一终点的节点特征; 以及基于连接所述第二终点的各个起点 的节点特征得到的汇聚特征更新所述第二终点的节点特征。 在一些 实施例中, 所述至少两个有向边组的一个所述有向边组的起点和终点, 包括 如下其中一项: 所述起 点包括: 由所述场景图像提取特征得到的各个像素节点, 所述终点是由所述 场景图像提取到的物体节点; 或者 , 所述起点和终点均包括: 由所述场景图像提取到的物体节点; 或者 , 所述起点包括由所述场景图像提取到的物体节点, 所述终点包括所述对象组 k 者, 所述起点包括所述对象组节点, 所述终点包括所述物体节点。 在一些 实施例中, 所述各辅助节点包括: 多个像素节点; 所述特征处理模块 51, 还配置为: 根据所述场景 图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同 尺寸; 将所述多个特征图缩放到同一尺寸后进行融合, 得到融合特征图; 根据所述融合 特征图, 得到多个所述像素节点的节点特征。 在一些实施例中, 信息确定模块 53 , 在配置为 4艮据更新后的对象组节点的节点特征, 得到所述对象组节点中两个对象之间关系的预测结果时, 包括: 根据所述对象组节点的 节点特征, 得到预测的初始分类置信度, 所述初始分类置信度中包括: 所述对象组节点 对应各个预定关系类别的初始分类置信度; 根据所述对象组节点在所述各个预定关系类 别中的其中一种目标预定关系类别对应的初始分类置信度、 以及所述对象组节点中两个 对象分别对应的对象检测置信度, 得到所述对象组节点中的两个对象对应所述目标预定 关系类别的置信度; 若所述置信度大于或等于预设的置信度阈值, 则确认所述对象组节 点中的两个对象之间的关系的预测结果是所述目标预定关系类别。 图 6提供了一示例性的另一种场景信息的检测装置, 该装置应用于图像处理设备, 如图 6所示, 所述装置包括: 图像获取模块 61和信息输出模块 62。 图像获取模块 61, 配置为获取图像采集设备采集到的场景图像; 信息输出模块 62, 配置为根据本申请任一实施例的检测方法, 对所述场景图像进行 处理, 输出所述场景图像中的场景信息。 本领域技术人员应明白, 本申请一个或多个实施例可提供为方法、 系统或计算机程 序产品。 因此, 本申请一个或多个实施例可采用完全硬件实施例、 完全软件实施例或结 合软件和硬件方面的实施例的形式。 而且, 本申请一个或多个实施例可采用在一个或多 个其中包含有计算机可用程序代码的计算机可用存储介质 (包括但不限于磁盘存储器、 CD-ROM 、 光学存储器等)上实施的计算机程序产品的形式。 本申请实施例还提供一种计算机可读存储介质, 该存储介质上可以存储有计算机程 序, 所述程序被处理器执行时实现本申请任一实施例描述的场景信息的检测方法。 本申请实施例还提供一种电子设备, 该电子设备包括: 存储器、 处理器, 所述存储 器配置为存储计算机可读指令, 所述处理器配置为调用所述计算机指令, 实现本申请任 一实施例所述的场景信息的检测方法。 其中, 本申请实施例所述的 “和 /或”表示至少具有两者中的其中一个, 例如, “A1和 / 或 A2”包括三种方案: Al、 A2、 以及 “A1和 A2”。 本申请中的各个实施例均采用递进的方式描述, 各个实施例之间相同相似的部分互 相参见即可, 每个实施例重点说明的都是与其他实施例的不同之处。 尤其, 对于数据处 理设备实施例而言, 由于其基本相似于方法实施例, 所以描述的比较筒单, 相关之处参 见方法实施例的部分说明即可。 上述对本申请特定实施例进行了描述。 其它实施例在所附权利要求书的范围内。 在 一些情况下, 在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行 并且仍然可以实现期望的结果。 另外, 在附图中描绘的过程不一定要求示出的特定顺序 或者连续顺序才能实现期望的结果。 在某些实施方式中, 多任务处理和并行处理也是可 以的或者可能是有利的。 本申请中描述的主题及功能操作的实施例可以在以下 中实现: 数字电子电路、 有形 体现的计算机软件或固件、 包括本申请中公开的结构及其结构性等同物的计算机硬件、 或者它们中的一个或多个的组合。 本申请中描述的主题的实施例可以实现为一个或多个 计算机程序, 即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理 装置的操作的计算机程序指令中的一个或多个模块。 可替代地或附加地, 程序指令可以 被编码在人工生成的传播信号上, 例如机器生成的电、 光或电磁信号, 该信号被生成以 将信息编码并传输到合适的接收机装置以由数据处理装置执行。 计算机存储介质可以是 机器可读存储设备、 机器可读存储基板、 随机或串行存取存储器设备、 或它们中的一个 或多个的组合。 本申请中描述的处理及逻辑流程可以由执行一个或多个计算机程序 的一个或多个可 编程计算机执行, 以通过根据输入数据进行操作并生成输出来执行相应的功能。 所述处 理及逻辑流程还可以由专用逻辑电路一例如现场可编程门阵列 ( Field Programmable Gate Array, FPGA ) 或专用集成电路 ( Application Specific Integrated Circuit, ASIC ) 来执行, 并且装置也可以实现为专用逻辑电路。 适合用于执行计算机程序的计算机包括, 例如通用和 /或专用彳敖处理器, 或任何其他 类型的中央处理单元。 通常, 中央处理单元将从只读存储器和 /或随机存取存储器接收指 令和数据。 计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指 令和数据的一个或多个存储器设备。 通常, 计算机还将包括用于存储数据的一个或多个 大容量存储设备, 例如磁盘、 磁光盘或光盘等, 或者计算机将可操作地与此大容量存储 设备接接以从其接收数据或向其传送数据, 抑或两种情况兼而有之。 然而, 计算机不是 必须具有这样的设备。 此外, 计算机可以嵌入在另一设备中, 例如移动电话、 个人数字 助理 ( Personal Digital Assistant, PDA )、 移动音频或视频播放器、 游戏操纵台、 全球定 位系统 ( Global Positioning System, GPS )接收机、 或例如通用串行总线 ( Universal Serial Bus, USB ) 闪存驱动器的便携式存储设备, 仅举几例。 适合于存储计算机程序指令和数据的计算机可读介质 包括所有形式的非易失性存储 器、媒介和存储器设备, 例如包括半导体存储器设备、磁盘 (例如内部硬盘或可移动盘)、 磁光盘以及 CD ROM和 DVD-ROM盘, 这里, 半导体存储设备可以是可擦除可编程只读 存储器 ( Erasable Programmable Read-Only Memory, EPROM )、 带电可擦可编程只读存 储器 ( Electrically Erasable Programmable Read Only Memory, EEPROM )和闪存设备。 处 理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。 虽然本申请包含许多具体实施细节, 但是这些不应被解释为限制任何公开的范围或 所要求保护的范围, 而是主要用于描述特定公开的具体实施例的特征。 本申请内在多个 实施例中描述的某些特征也可以在单个实施例中被组合实施。 另一方面, 在单个实施例 中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外, 虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护, 但是来自所要 求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除, 并且所要求保护 的组合可以指向子组合或子组合的变型。 类似地, 虽然在附图中以特定顺序描绘了操作, 但是这不应被理解为要求这些操作 以所示的特定顺序执行或顺次执行、 或者要求所有例示的操作被执行, 以实现期望的结 果。 在某些情况下, 多任务和并行处理可能是有利的。 此外, 上述实施例中的各种系统 模块和组件的分离不应被理解为在所有实施例中均需要这样的分离, 并且应当理解, 所 描述的程序组件和系统通常可以一起集成在单个软件产品中, 或者封装成多个软件产品。 由此, 主题的特定实施例已被描述。 其他实施例在所附权利要求书的范围以内。 在 某些情况下, 权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。 此外, 附图中描绘的处理并非必需所示的特定顺序或顺次顺序, 以实现期望的结果。 在 某些实现中, 多任务和并行处理可能是有利的。 以上所述仅为本申请一个或多个实施例的较佳实施例而 已, 并不用以限制本申请一 个或多个实施例, 凡在本申请一个或多个实施例的精神和原则之内, 所做的任何修改、 等同替换、 改进等, 均应包含在本申请一个或多个实施例保护的范围之内。 工业实用性 本 申请实施例提供了一种场景信息的检测方法、 装置、 电子设备、 计算机可读存储 介质和计算机程序; 该方法可以包括: 根据场景异构图中与目标节点连接的各辅助节点 的节点特征, 得到待传播的特征维度是 Cy * 1的汇聚特征, 其中, Cy是汇聚特征的通道维 度, 且 Cy与目标节点的节点特征的通道维度相同; 其中, 场景异构图包括至少两种异质 节点: 辅助节点以及基于场景图像得到的目标节点; 基于汇聚特征更新目标节点的节点 特征; 根据更新后的目标节点的节点特征, 获得场景图像的场景信息。

Claims

权 利要求 书
1、 一种场景信息的检测方法, 其中, 所述方法包括: 根据场景异构 图中与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特 征, 所述汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述汇聚特征的通道维度, 且 所述 Cy与目标节点的节点特征的通道维度相同; 其中, 所述场景异构图包括至少两种异 质节点, 所述至少两种异质节点包括: 所述辅助节点以及基于所述场景图像得到的所述 目标节点; 基 于所述汇聚特征, 更新所述目标节点的节点特征; 根据 更新后的所述目标节点的节点特征, 获得所述场景图像中的场景信息。
2、 根据权利要求 1所述的方法, 所述基于所述汇聚特征, 更新所述目标节点的节点 特征, 包括: 才艮据所述汇聚特征的每个通道的通道特征, 对所述目标节点的节点特征中对应所述 每个通道的所有特征位置利用所述通道特征进行特征更新处理。
3、根据权利要求 1所述的方法, 所述根据场景异构图中与目标节点连接的各辅助节 点的节点特征, 得到待传播的汇聚特征, 包括: 根据场景异构 图中与目标节点连接的各辅助节点的节点特征, 得到重加权向量和残 差向量中的至少一种作为所述汇聚特征; 所述基 于所述汇聚特征, 更新所述目标节点的节点特征, 包括: 基于所述重加权向 量对目标节点的节点特征的各通道进行相乘处理, 和成, 通过所述残差向量对目标节点 的节点特征的各通道进行相加处理。
4、根据权利要求 3所述的方法, 所述得到重加权向量和残差向量中的至少一种作为 所述汇聚特征, 包括: 通过激活 函数、 以及所述目标节点的节点特征的标准差, 将所述残差向量的取值映 射到预定的数值区间作为所述汇聚特征。
5、 根据权利要求 1〜 4任一所述的方法, 所述目标节点包括: 对象组节点, 所述对象 组包括所述场景图像中的两个对象; 所述根据 更新后的所述目标节点的节点特征, 获得所述场景图像中的场景信息, 包 括: 根据更新后的对象组节点的节点特征, 得到所述对象组节点中两个对象之间关系的 预测结果; 所述场景信息包括所述预测结果。
6、 根据权利要求 5所述的方法, 所述场景异构图包括: 以其中一个对象组节点作为 终点的信息传输链, 所述信息传输链包括至少两个有向边组, 每个有向边组包括由多个 起点指向同一终点的多个有向边; 所述信息传输链中的各个起点和终点中包括至少两种 所述异质节点; 所述根据 与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特征, 基于 所述汇聚特征, 更新所述目标节点的节点特征, 包括: 对于所述 至少两个有向边组中的第一有向边组, 以所述第一有向边组指向的同一个 第一终点作为所述目标节点, 根据连接所述第一终点的各个起点的节点特征得到汇聚特 征, 基于所述汇聚特征更新所述第一终点的节点特征; 所述第一终点同时作为所述至少 两个有向边组中的第二有向边组的其中一个起点; 对于所述 第二有向边组, 以所述第二有向边组指向的同一个第二终点作为所述目标 节点, 根据连接所述第二终点的各个起点的节点特征得到汇聚特征, 基于所述汇聚特征 更新所述第二终点的节点特征。
7、根据权利要求 6所述的方法, 所述至少两个有向边组的一个所述有向边组的起点 和终点, 包括如下其中一项: 所述起 点包括: 由所述场景图像提取特征得到的各个像素节点, 所述终点是由所述 场景图像提取到的物体节点; 或者 , 所述起点和终点均包括: 由所述场景图像提取到的物体节点; 或者 , 所述起点包括由所述场景图像提取到的物体节点, 所述终点包括所述对象组 k 者, 所述起点包括所述对象组节点, 所述终点包括所述物体节点。
8、 根据权利要求 1所述的方法, 所述各辅助节点包括: 多个像素节点; 所述方法还 包括: 根据所 述场景图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同 尺寸; 将所述 多个特征图缩放到同一尺寸后进行融合, 得到融合特征图; 根据所述 融合特征图, 得到多个所述像素节点的节点特征。
9、 根据权利要求 5所述的方法, 所述根据更新后的对象组节点的节点特征, 得到所 述对象组节点中两个对象之间关系的预测结果, 包括: 根据所 述对象组节点的节点特征, 得到预测的初始分类置信度, 所述初始分类置信 度中包括: 所述对象组节点对应各个预定关系类别的初始分类置信度; 才艮据所述对象组节点在所述各个预定关系类别中的其中一种目标预定关系类别对应 的初始分类置信度、 以及所述对象组节点中两个对象分别对应的对象检测置信度, 得到 所述对象组节点中的两个对象对应所述目标预定关系类别的置信度; 若所述 置信度大于或等于预设的置信度阈值, 则确认所述对象组节点中的两个对象 之间的关系的预测结果是所述目标预定关系类别。
10、 一种场景信息的检测方法, 所述方法由图像处理设备执行; 所述方法包括: 获取 图像采集设备采集到的场景图像; 根据权利要求 1〜9任一所述的检测方法, 对所述场景图像进行处理, 输出所述场景 图像中的场景信息。
11、 一种场景信息的检测装置, 其中, 所述装置包括: 特征处理模块, 配置为根据场景异构图中与目标节点连接的各辅助节点的节点特征, 得到待传播的汇聚特征, 所述汇聚特征的特征维度是 Cy*l , 其中, 所述 Cy是所述汇聚 特征的通道维度, 且所述 Cy与目标节点的节点特征的通道维度相同; 其中, 所述场景异 构图包括至少两种异质节点, 所述至少两种异质节点包括: 所述辅助节点以及基于所述 场景图像得到的所述目标节点; 特征 更新模块, 配置为基于所述汇聚特征, 更新所述目标节点的节点特征; 信 息确定模块, 配置为根据更新后的所述目标节点的节点特征, 获得所述场景图像 中的场景信息。
12、 根据权利要求 11所述的装置, 所述特征更新模块, 在配置为基于所述汇聚特征 更新所述目标节点的节点特征时, 包括: 根据所述汇聚特征的每个通道的通道特征, 对 所述目标节点的节点特征中对应每个通道的所有特征位置利用所述通道特征进行特征更 新处理。
13、 根据权利要求 11所述的装置, 所述特征处理模块, 具体配置为根据场景异构图 中与目标节点连接的各辅助节点的节点特征, 得到重加权向量和残差向量中的至少一种 作为所述汇聚特征; 所述特征 更新模块, 具体配置为基于所述重加权向量对目标节点的节点特征的各通 道进行相乘处理, 和成, 通过所述残差向量对目标节点的节点特征的各通道进行相加处 理。
14、 根据权利要求 13所述的装置, 所述特征处理模块, 在配置为得到重加权向量和 残差向量中的至少一种作为所述汇聚特征时, 包括: 通过激活函数、 以及所述目标节点 的节点特征的标准差,将所述残差向量的取值映射到预定的数值区间作为所述汇聚特征。
15、 根据权利要求 11至 14任一所述的装置, 所述目标节点包括: 对象组节点, 所 述对象组包括所述场景图像中的两个对象; 所述信息确定模块, 具体配置为根据更新后 的对象组节点的节点特征, 得到所述对象组节点中两个对象之间关系的预测结果。
16、 根据权利要求 15所述的装置, 所述场景异构图包括: 以其中一个对象组节点作 为终点的信息传输链, 所述信息传输链包括至少两个有向边组, 每个有向边组包括由多 个起点指向同一终点的多个有向边; 所述信息传输链中的各个起点和终点中包括至少两 种所述异质节点; 所述特征处理模块 , 配置为: 对于所述至少两个有向边组中的第一有向边组, 以所 述第一有向边组指向的同一个第一终点作为所述目标节点, 根据连接所述第一终点的各 个起点的节点特征得到汇聚特征; 所述第一终点同时作为所述至少两个有向边组中的第 二有向边组的其中一个起点; 对于所述第二有向边组, 以所述第二有向边组指向的同一 个第二终点作为所述目标节点, 根据连接所述第二终点的各个起点的节点特征得到汇聚 特征; 所述特征 更新模块, 配置为: 基于连接所述第一终点的各个起点的节点特征得到的 汇聚特征更新所述第一终点的节点特征; 以及基于连接所述第二终点的各个起点的节点 特征得到的汇聚特征更新所述第二终点的节点特征。
17、根据权利要求 16所述的装置, 所述至少两个有向边组的一个所述有向边组的起 点和终点, 包括如下其中一项: 所述起 点包括: 由所述场景图像提取特征得到的各个像素节点, 所述终点是由所述 场景图像提取到的物体节点; 或者 , 所述起点和终点均包括: 由所述场景图像提取到的物体节点; 或者 , 所述起点包括由所述场景图像提取到的物体节点, 所述终点包括所述对象组 k 者, 所述起点包括所述对象组节点, 所述终点包括所述物体节点。
18、 根据权利要求 11所述的装置, 所述各辅助节点包括: 多个像素节点; 所述特征处理模块, 还配置为: 根据所述场景 图像进行特征提取, 得到多个特征图, 所述多个特征图分别具有不同 尺寸; 将所述多个特征图缩放到同一尺寸后进行融合, 得到融合特征图; 根据所述融合 特征图, 得到多个所述像素节点的节点特征。
19、 根据权利要求 15所述的装置, 所述信息确定模块, 在配置为根据更新后的对象 组节点的节点特征, 得到所述对象组节点中两个对象之间关系的预测结果时, 包括: 根 据所述对象组节点的节点特征, 得到预测的初始分类置信度, 所述初始分类置信度中包 括: 所述对象组节点对应各个预定关系类别的初始分类置信度; 根据所述对象组节点在 所述各个预定关系类别中的其中一种目标预定关系类别对应的初始分类置信度、 以及所 述对象组节点中两个对象分别对应的对象检测置信度, 得到所述对象组节点中的两个对 象对应所述目标预定关系类别的置信度; 若所述置信度大于或等于预设的置信度阈值, 则确认所述对象组节点中的两个对象之间的关系的预测结果是所述目标预定关系类别。
20、 一种场景信息的检测装置, 其中, 所述装置应用于图像处理设备, 所述装置包 括: 图像获取模块, 配置为获取图像采集设备采集到的场景图像; 信 息输出模块, 配置为根据权利要求 1〜9任一所述的检测方法, 对所述场景图像进 行处理, 输出所述场景图像中的场景信息。
21、 一种电子设备, 其中, 包括: 存储器、 处理器, 所述存储器配置为存储计算机 可读指令, 所述处理器配置为调用所述计算机指令, 实现权利要求 1至 9任一所述的方 法, 或者实现权利要求 10所述的方法。
22、 一种计算机可读存储介质, 其上存储有计算机程序, 其中, 所述程序被处理器 执行时实现权利要求 1至 9任一所述的方法, 或者实现权利要求 10所述的方法。
23、 一种计算机程序, 包括计算机可读代码, 当所述计算机可 读代码在 电子 设备 中运行时 , 所述电子设备中的 处理器执 行用于 实现权利要 求 1至 9任一所 述的 方法, 或者实现权 利要求 10所述的方法。
PCT/IB2020/059587 2020-07-28 2020-10-13 程序场景信息的检测方法、装置、电子设备、介质和程序 WO2022023806A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022529946A JP2023504387A (ja) 2020-07-28 2020-10-13 シーン情報の検出方法及びその装置、電子機器、媒体並びにプログラム
KR1020227017414A KR20220075442A (ko) 2020-07-28 2020-10-13 시나리오 정보의 검출 방법, 장치, 전자 기기, 매체 및 프로그램

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010739363.2A CN111860403B (zh) 2020-07-28 2020-07-28 场景信息的检测方法和装置、电子设备
CN202010739363.2 2020-07-28

Publications (1)

Publication Number Publication Date
WO2022023806A1 true WO2022023806A1 (zh) 2022-02-03

Family

ID=72948254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2020/059587 WO2022023806A1 (zh) 2020-07-28 2020-10-13 程序场景信息的检测方法、装置、电子设备、介质和程序

Country Status (5)

Country Link
JP (1) JP2023504387A (zh)
KR (1) KR20220075442A (zh)
CN (1) CN111860403B (zh)
TW (1) TWI748720B (zh)
WO (1) WO2022023806A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065587B (zh) * 2021-03-23 2022-04-08 杭州电子科技大学 一种基于超关系学习网络的场景图生成方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103118439B (zh) * 2013-01-18 2016-03-23 中国科学院上海微系统与信息技术研究所 基于传感网节点通用中间件的数据融合方法
CN105138963A (zh) * 2015-07-31 2015-12-09 小米科技有限责任公司 图片场景判定方法、装置以及服务器
WO2018099473A1 (zh) * 2016-12-02 2018-06-07 北京市商汤科技开发有限公司 场景分析方法和系统、电子设备
CN108733280A (zh) * 2018-03-21 2018-11-02 北京猎户星空科技有限公司 智能设备的焦点跟随方法、装置、智能设备及存储介质
CN109214346B (zh) * 2018-09-18 2022-03-29 中山大学 基于层次信息传递的图片人体动作识别方法
CN110569437B (zh) * 2019-09-05 2022-03-04 腾讯科技(深圳)有限公司 点击概率预测、页面内容推荐方法和装置
CN110991532B (zh) * 2019-12-03 2022-03-04 西安电子科技大学 基于关系视觉注意机制的场景图产生方法
CN110689093B (zh) * 2019-12-10 2020-04-21 北京同方软件有限公司 一种复杂场景下的图像目标精细分类方法
CN111144577B (zh) * 2019-12-26 2022-04-22 北京百度网讯科技有限公司 异构图之中节点表示的生成方法、装置和电子设备
CN111325258B (zh) * 2020-02-14 2023-10-24 腾讯科技(深圳)有限公司 特征信息获取方法、装置、设备及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALIREZA ZAREIAN; SVEBOR KARAMAN; SHIH-FU CHANG: "Bridging Knowledge Graphs to Generate Scene Graphs", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 January 2020 (2020-01-08), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081574497 *
XU, DANFEI ET AL.: "Scene Graph Generation by Iterative Message Passing", PROCEEDINGS OF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) '17, vol. 1, 26 July 2017 (2017-07-26), pages 3097 - 3106, XP033249656, [retrieved on 20210204], DOI: 10.1109/CVPR.2017.330 *

Also Published As

Publication number Publication date
JP2023504387A (ja) 2023-02-03
TWI748720B (zh) 2021-12-01
TW202205144A (zh) 2022-02-01
CN111860403B (zh) 2024-06-14
CN111860403A (zh) 2020-10-30
KR20220075442A (ko) 2022-06-08

Similar Documents

Publication Publication Date Title
JP6893564B2 (ja) ターゲット識別方法、装置、記憶媒体および電子機器
CN109525859B (zh) 模型训练、图像发送、图像处理方法及相关装置设备
CN111539370A (zh) 一种基于多注意力联合学习的图像行人重识别方法和系统
CN109583391B (zh) 关键点检测方法、装置、设备及可读介质
WO2020228405A1 (zh) 图像处理方法、装置及电子设备
CN112200041B (zh) 视频动作识别方法、装置、存储介质与电子设备
JP6742623B1 (ja) 監視装置、監視方法、及びプログラム
CN112949508A (zh) 模型训练方法、行人检测方法、电子设备及可读存储介质
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
WO2022171036A1 (zh) 视频目标追踪方法、视频目标追踪装置、存储介质及电子设备
CN112800276B (zh) 视频封面确定方法、装置、介质及设备
CN113537254B (zh) 图像特征提取方法、装置、电子设备及可读存储介质
CN113487608A (zh) 内窥镜图像检测方法、装置、存储介质及电子设备
CN114565668A (zh) 即时定位与建图方法及装置
WO2022023806A1 (zh) 程序场景信息的检测方法、装置、电子设备、介质和程序
CN114071015B (zh) 一种联动抓拍路径的确定方法、装置、介质及设备
CN115035158A (zh) 目标跟踪的方法及装置、电子设备和存储介质
JP7001149B2 (ja) データ提供システムおよびデータ収集システム
JP7001150B2 (ja) 識別システム、モデル再学習方法およびプログラム
CN111429185B (zh) 人群画像预测方法、装置、设备及存储介质
CN115035596B (zh) 行为检测的方法及装置、电子设备和存储介质
CN114820723A (zh) 一种基于联合检测和关联的在线多目标跟踪方法
CN114140744A (zh) 基于对象的数量检测方法、装置、电子设备及存储介质
JP6981553B2 (ja) 識別システム、モデル提供方法およびモデル提供プログラム
CN115546708A (zh) 目标检测方法及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20947354

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022529946

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20227017414

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20947354

Country of ref document: EP

Kind code of ref document: A1