WO2020151247A1 - 图像分析方法以及系统 - Google Patents

图像分析方法以及系统 Download PDF

Info

Publication number
WO2020151247A1
WO2020151247A1 PCT/CN2019/107126 CN2019107126W WO2020151247A1 WO 2020151247 A1 WO2020151247 A1 WO 2020151247A1 CN 2019107126 W CN2019107126 W CN 2019107126W WO 2020151247 A1 WO2020151247 A1 WO 2020151247A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
target subject
vector
image
target
Prior art date
Application number
PCT/CN2019/107126
Other languages
English (en)
French (fr)
Inventor
郑鹏鹏
李嘉豪
金鑫
涂丹丹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19911852.2A priority Critical patent/EP3893197A4/en
Publication of WO2020151247A1 publication Critical patent/WO2020151247A1/zh
Priority to US17/365,089 priority patent/US12100209B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5854Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using shape and object relationship
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/587Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • This application relates to the field of image processing, and in particular to an image analysis method and system.
  • the task of image description is to generate a corresponding text description for a given image.
  • Image description can automatically extract information from images, and generate corresponding text descriptions based on the automatically extracted information, thereby realizing the transformation from images to knowledge.
  • the image description can generate a text description such as "a man surfing at sea” for the image shown in FIG. 1A.
  • image description can only describe images with low-level semantics, that is, they can only perform single-agent single actions (for example, a man surfing on the sea in Figure 1A) or multi-agent single actions (for example, a group of students doing morning exercises in Figure 1B). Description, however, the image cannot be described in panoramic semantics, that is, the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions (for example, in Figure 1C, a man sees a woman covered by a car Knocked down) to describe.
  • This application provides an image analysis method and system, which can perform panoramic semantic description of images.
  • an image analysis method including:
  • the influencing factors include the own characteristics of h target subjects in each frame of the t frame of images and the relationship vector characteristics between the h target subjects, and each target subject
  • the own features of includes location features, attribute features, and posture features, where t, h are natural numbers greater than 1, the location feature represents the location of the corresponding target subject in the image, and the attribute feature represents the corresponding target An attribute of the subject, the posture feature represents the action of the corresponding target subject, and the relationship vector feature represents the relationship between the target subject and the target subject;
  • a panoramic semantic description is obtained according to the influencing factors, and the panoramic semantic description includes descriptions of the relationships between the target subject and the target subject, between the actions of the target subject and the target subject, and between the actions of the target subject and the target subject.
  • the above solution can obtain a higher-level panoramic semantic description based on the position feature, attribute feature, posture feature of multiple target subjects in the multi-frame image and the relationship vector feature between multiple target subjects in the multi-frame image. It reflects the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions in the image.
  • the factors affecting the acquisition of panoramic semantic description include:
  • the same convolutional neural network is used to perform the extraction of the location feature, the extraction of the attribute feature, the extraction of the posture feature, and the extraction of the relationship vector feature.
  • the location feature extraction, the attribute feature extraction, the posture feature extraction and the relation vector feature extraction are performed through the same convolutional neural network.
  • Feature extraction when extracting the attribute feature, extracting the posture feature, and extracting the relationship vector feature, the feature vector extracted before can be used to avoid multiple extraction of the feature vector, thereby reducing the amount of calculation . That is, there is no need to perform feature vector extraction once when extracting the location feature, performing feature vector extraction once when performing the attribute feature extraction, and performing feature vector extraction once when performing the posture feature extraction. Extraction, and, during the extraction of the relationship vector feature, a feature vector extraction is performed.
  • the feature vector i is pooled according to the target subject a and the target subject b in the image i to obtain the feature vector v a, b , i corresponding to the target subject a and the target subject b .
  • a and b are both natural numbers, and 0 ⁇ i ⁇ t, 1 ⁇ a, b ⁇ h, the feature vector i is extracted according to the image i;
  • the relationship vector feature between the target subject a and the target subject b in the image i is calculated
  • w a,b sigmoid(w(va ,b ,va ,a )), sigmoid() is an sigmoid function, v a,b are the feature vectors corresponding to the target subject a and the target subject b, v a, a is the feature vector corresponding to the target subject a, and w() is the inner product function.
  • the obtaining the panoramic semantic description according to the influencing factors includes:
  • the panoramic semantic description is extracted according to the relationship vector feature and the third semantic description.
  • the same cyclic neural network is used to perform the extraction of the first semantic description, the second semantic description, and the third semantic description.
  • an image analysis system including a feature extraction module and a panoramic semantic model
  • the feature extraction module is used to obtain the influencing factors of the panoramic semantic description, where the influencing factors include the own features of h target subjects in each frame of t frames of images and the relationship vector features between h target subjects ,
  • the self-owned features include location features, attribute features, and posture features, where t, h are natural numbers greater than 1, the location features are used to indicate the location of the corresponding target subject in the image, and the attribute features are used for Represents the attributes of the corresponding target subject, the posture feature is used to represent the action of the corresponding target subject, and the relationship vector feature is used to represent the relationship between the target subject and the target subject;
  • the panoramic semantic model is used to obtain a panoramic semantic description according to the influencing factors, and the panoramic semantic description includes a description of the relationship between the target subject and the target subject, between the target subject and the action, and between the action and the action.
  • the feature extraction module includes: a feature vector extraction unit, a position feature extraction unit, an attribute feature extraction unit, a posture feature extraction unit, and a relation vector feature unit,
  • the feature vector extraction unit is configured to perform feature extraction on the t frame image to obtain t feature vectors
  • the location feature extraction unit is configured to perform location feature extraction on the t feature vectors to obtain the location feature
  • the attribute feature extraction unit is configured to perform attribute feature extraction on the t feature vectors to obtain the attribute feature
  • the posture feature extraction unit is configured to perform posture feature extraction on the t feature vectors to obtain the posture feature
  • the relation vector feature unit is used to perform relation vector feature extraction on the t feature vectors to obtain the relation vector feature.
  • the feature extraction module includes a convolutional neural network, the feature vector extraction unit, the location feature extraction unit, the attribute feature extraction unit, the posture feature extraction unit, and the relationship vector
  • the feature extraction unit is integrated in the convolutional neural network.
  • the relationship vector feature extraction unit is used to pool the feature vector i according to the target subject a and the target subject b in the image i, so as to obtain the relationship between the target subject a and the target subject b.
  • the corresponding feature vectors v a, b , i, a, and b are all natural numbers, and 0 ⁇ i ⁇ t, 1 ⁇ a,b ⁇ h;
  • the relationship vector feature between the target subject a and the target subject b in the image i is calculated
  • w a,b sigmoid(w(va ,b ,va ,a )), sigmoid() is an sigmoid function, v a,b are the feature vectors corresponding to the target subject a and the target subject b, v a, a is the feature vector corresponding to the target subject a, and w() is the inner product function.
  • the panoramic semantic model includes: a first temporal feature extraction unit, a second temporal feature extraction unit, a third temporal feature extraction unit, and a fourth temporal feature extraction unit,
  • the first time sequence feature extraction unit is configured to extract a first semantic description according to the location feature
  • the second time sequence feature extraction unit is configured to extract a second semantic description according to the attribute feature and the first semantic description
  • the third time sequence feature extraction unit is configured to extract a third semantic description according to the posture feature and the second semantic;
  • the fourth time sequence feature extraction unit is configured to extract the panoramic semantic description according to the relationship vector feature and the third semantic description.
  • the panoramic semantic model includes a cyclic neural network, the first time series feature extraction unit, the second time series feature extraction unit, the third time series feature extraction unit, and the fourth time series feature
  • the extraction unit is a layer in the recurrent neural network.
  • a computing node including: a processor and a memory, and the processor executes:
  • the influencing factors include the own characteristics of h target subjects in each frame of the t frame of images and the relationship vector characteristics between the h target subjects, and each target subject
  • the own features of includes location features, attribute features, and posture features, where t, h are natural numbers greater than 1, the location feature represents the location of the corresponding target subject in the image, and the attribute feature represents the corresponding target An attribute of the subject, the posture feature represents the action of the corresponding target subject, and the relationship vector feature represents the relationship between the target subject and the target subject;
  • a panoramic semantic description is obtained according to the influencing factors, and the panoramic semantic description includes descriptions of the relationships between the target subject and the target subject, between the actions of the target subject and the target subject, and between the actions of the target subject and the target subject.
  • the above solution can obtain a higher-level panoramic semantic description based on the position feature, attribute feature, posture feature of multiple target subjects in the multi-frame image and the relationship vector feature between multiple target subjects in the multi-frame image. It reflects the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions in the image.
  • the processor is used to execute:
  • the same convolutional neural network is used to perform the extraction of the location feature, the extraction of the attribute feature, the extraction of the posture feature, and the extraction of the relationship vector feature.
  • the location feature extraction, the attribute feature extraction, the posture feature extraction and the relation vector feature extraction are performed through the same convolutional neural network.
  • Feature extraction when extracting the attribute feature, extracting the posture feature, and extracting the relationship vector feature, the feature vector extracted before can be used to avoid multiple extraction of the feature vector, thereby reducing the amount of calculation . That is, there is no need to perform feature vector extraction once when extracting the location feature, performing feature vector extraction once when performing the attribute feature extraction, and performing feature vector extraction once when performing the posture feature extraction. Extraction, and, during the extraction of the relationship vector feature, a feature vector extraction is performed.
  • the feature vector i is pooled according to the target subject a and the target subject b in the image i to obtain the feature vector v a, b , i corresponding to the target subject a and the target subject b .
  • a and b are both natural numbers, and 0 ⁇ i ⁇ t, 1 ⁇ a, b ⁇ h, the feature vector i is extracted according to the image i;
  • the relationship vector feature between the target subject a and the target subject b in the image i is calculated
  • w a,b sigmoid(w(va ,b ,va ,a )), sigmoid() is an sigmoid function, v a,b are the feature vectors corresponding to the target subject a and the target subject b, v a, a is the feature vector corresponding to the target subject a, and w() is the inner product function.
  • the processor is used to execute:
  • the panoramic semantic description is extracted according to the relationship vector feature and the third semantic description.
  • the same cyclic neural network is used to perform the extraction of the first semantic description, the second semantic description, and the third semantic description.
  • a computing node cluster including: at least one computing node, each computing node includes a processor and a memory, and the processor executes the code in the memory to execute the code as described in any one of the first aspect Methods.
  • a computer program product is provided.
  • the method according to any one of the first aspects will be executed.
  • a computer non-transitory storage medium including instructions, when the instructions run on at least one computing node in a computing node cluster, the computing node cluster is caused to execute any one of the first aspect The method described.
  • FIGS. 1A to 1C are schematic diagrams of some images used for image description
  • FIG. 2 is a schematic diagram of a single frame image used for panoramic semantic description according to an embodiment of this application;
  • FIG. 3 is a schematic diagram of a multi-frame image used for panoramic semantic description according to an embodiment of this application;
  • FIG. 4 is a schematic diagram of feature extraction of location features, attribute features, posture features, and relation vector features involved in this application;
  • FIG. 5 is a schematic diagram of a panoramic semantic model of an embodiment involved in this application.
  • FIG. 6 is a schematic diagram of a panoramic semantic model of another embodiment involved in this application.
  • FIG. 7 is a flowchart of a semantic description method according to an embodiment of this application.
  • FIG. 8 is a schematic structural diagram of a semantic description system of an embodiment provided in this application.
  • FIG. 9 is a schematic structural diagram of a computing node according to an embodiment of this application.
  • FIG. 10 is a schematic structural diagram of a cloud service cluster according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a semantic description system according to another embodiment provided in this application.
  • FIG. 12 is a schematic structural diagram of a semantic description system according to another embodiment provided in this application.
  • Fig. 2 shows a schematic diagram of a single frame image suitable for an implementation of an embodiment of the present application for performing panoramic semantic description.
  • the single frame image used for panoramic semantic description in this embodiment usually includes multiple target subjects, where the target subjects may be one or more of people, animals, objects, and so on.
  • the target subjects in the image shown in Fig. 2 include men, women, and vehicles.
  • Different target subjects can perform different actions, where the actions can be one or more of drinking water, reading a book, doing exercises, playing basketball, kicking a ball, running, swimming, and so on.
  • the man in the figure is looking at a woman, the woman in the figure is falling, and the vehicle in the figure is crashing into a woman.
  • the target subject can also be other subjects, the number of target subjects can also be more, and the actions of the target subject can also be other actions, etc., here There is no specific limitation.
  • the image analysis system can cut out t frames of images I 1 , I 2 ,..., I t from the video in a chronological order for panoramic semantic description, where: t is a natural number.
  • the images I 1 , I 2 , ..., I t all include the same target subject.
  • the image I 1 includes the target subject 1, the target subject 2, and the target subject 3;
  • the image I 2 includes the target subject 1, the target subject 2, and target body 3; ...;
  • image I t 1 also includes a target body, the target body 2 and the target body 3.
  • the time interval between two adjacent frames of images in the foregoing t frames of images may be equal or unequal, which is not specifically limited here.
  • the image analysis system may perform a panoramic semantic description of the image I t through the panoramic semantic model.
  • the input variable of the panoramic semantic model is the influencing factor of the panoramic semantic description.
  • Factors include semantic description of the panoramic image I own features of the various target body 1 through I t (including location features, attributes and characteristics pose feature) feature vectors and the relationship between each target body.
  • the location feature is used to indicate the location of the corresponding target subject in the corresponding image.
  • the location feature can be expressed as (x, y, w, h), where x and y are the abscissa and ordinate of the center point of the target subject in the image, w is the width of the target subject in the image, and h is The height of the target subject in the image.
  • Attribute characteristics are used to represent the attributes of the corresponding target subject. Attribute characteristics can include many kinds. The target subject is different, and the attribute characteristics are usually different. Taking the target subject as a human being, for example, the target subject’s attribute characteristics can include gender, hairstyle, clothing type, clothing color, height, body type, etc. One or more.
  • the posture feature of the target subject is used to represent the action of the corresponding target subject.
  • the posture characteristics of the target subject also include many kinds.
  • the target subject is different, and the posture characteristics are usually different. Taking the target subject as an example, the posture characteristics of the target subject can include falling, lying down, walking, running, jumping, etc.
  • the relationship feature vector is a vector representing the relationship between two target subjects.
  • the image I 1, I 2, ..., I t in each of the frames comprises a body h targets, for example, the panoramic semantic description factors comprises:
  • the characteristics of the h target subjects in the image I 1 include:
  • the location feature P 1 1 the attribute feature Posture feature Is the own feature of the target subject 1 in the image I 1 , the location feature Attribute characteristics Posture feature Is the own feature of the target subject 2 in the image I 1 ,..., the location feature Attribute characteristics Posture feature It is the characteristic of the target subject h in the image I 1 .
  • the vector features of the relationship between the h target subjects in the image I 1 include:
  • Is the vector feature of the relationship between the target subject 1 and the target subject 2 in the image I 1 Is the vector feature of the relationship between the target subject 1 and the target subject 3 in the image I 1 ,..., Is the vector feature of the relationship between the target subject 1 and the target subject h in the image I 1 , Is the vector feature of the relationship between the target subject 2 and the target subject 3 in the image I 1 ,..., Is the vector feature of the relationship between the target subject 2 and the target subject h in the image I 1 ..., Is the vector feature of the relationship between the target subject h-1 and the target subject h in the image I 1 .
  • the characteristics of the h target subjects in the image I 2 include:
  • the location feature P 1 2 , the attribute feature Posture feature Is the own feature and location feature of the target subject 1 in the image I 2
  • Attribute characteristics Posture feature Is the own feature of the target subject 2 in the image I 2
  • the location feature Attribute characteristics Posture feature It is a characteristic of the target subject h in the image I 2 .
  • the vector features of the relationship between the h target subjects in the image I 2 include:
  • Is the vector feature of the relationship between the target subject 1 and the target subject 2 in the image I 2 Is the vector feature of the relationship between the target subject 1 and the target subject 3 in the image I 2 ,..., Is the vector feature of the relationship between the target subject 1 and the target subject h in the image I 2 , Is the vector feature of the relationship between the target subject 2 and the target subject 3 in the image I 2 ,..., Is the vector feature of the relationship between the target subject 2 and the target subject h in the image I 2 ..., Is the vector feature of the relationship between the target subject h-1 and the target subject h in the image I 2 .
  • a target body comprising:
  • the location feature P 1 t the attribute feature Posture feature 1 is characterized by its own, wherein the position of the target body in the image I t Attribute characteristics Posture feature I t is the target subject in the image of its own characteristic 2, ..., wherein the position Attribute characteristics Posture feature Characterized for its own image I t h of the target body.
  • H I t in the image feature vectors in the relationship between the target body comprising:
  • I t is a target subject image of the target body 1 and the relation between the feature vectors
  • I t is a target subject image of the target body 1 and the relationship between the feature vector 3
  • I t is a target subject image feature vectors in the relationship between the body and the target 1 h, 2 as the target subject and target vector characteristic relationship between the body 3 of the image I t
  • a target subject image I t is the relationship between the feature vector and the target body 2 h ..., H-1 vector and the target characteristic relationship between the body and h is the target of the subject image I t.
  • the above examples of the influencing factors of the panoramic semantic description are only used as examples. In actual applications, the influencing factors of the panoramic semantic description may also include other influencing factors, which are not specifically limited here.
  • the image I 1, I 2, ..., I t wherein the position of each target subject, wherein the vector relationship between the attribute characteristics, features, and posture of the target body according to the image I 1 respectively,
  • the characteristic vectors V 1 , V 2 , ..., V t of I 2 ,..., I t are calculated.
  • the relationship between the feature vector of the image I 1 wherein the position of each target subject, attribute characteristics, features, and posture of the target body 1 may be of a feature vector V is calculated according to the image I, image I for each target in the body 2
  • the position feature, attribute feature, posture feature and the relationship vector feature V 2 of each target subject can be calculated according to the feature vector of the image I 2 ,...
  • the location feature, attribute feature, and posture feature of each target subject in the image I t and vector features of the relationship between the target body can be calculated according to the feature vector of the image I t V t.
  • the image I 1, I 2, ..., I t of feature vectors V 1, V 2, ..., V t can be thus obtained.
  • An example of image I i, I i of the image feature vector V i I i may be input to the image obtained feature vector extraction unit. Among them, i is a natural number, and 1 ⁇ i ⁇ t.
  • the feature vector extraction unit may sequentially include: an input layer, a convolutional calculation layer, a pooling layer, and a fully connected layer.
  • the input of the input layer is the image I i
  • the output and the input are equal, that is, no processing is performed on the input.
  • the input layer does not perform any processing.
  • the input layer can be normalized and so on, which is not specifically limited here.
  • conv represents the convolution operation on the image I using the convolution kernel K l
  • valid represents the padding method
  • b l represents the offset value
  • u l represents the result of the convolution calculation
  • f() represents the activation Function
  • the generation process of each pooled image b l is as follows:
  • maxPool is expressed as mean pooling.
  • the convolution kernel K l (including elements, size, step size, etc.), bias values b l , f(), and ⁇ l can be artificially extracted features (position features) as needed. , Attribute features, posture features, and relationship vector features), the size of the image I i , and so on.
  • the convolution kernel K l Take the convolution kernel K l as an example.
  • the elements of the convolution kernel K l can use the elements of the sobel operator.
  • the convolution kernel K l The size of can also be relatively large.
  • the size of the convolution kernel K l can also be relatively small.
  • the step size of the convolution kernel K l can also be compared
  • the step size of the convolution kernel K l can also be relatively small.
  • the feature vector extraction unit is merely an example.
  • the feature vector extraction unit may also be in other forms.
  • it may include more convolutional calculation layers, more pooling layers, and Filling the image I i , etc., is not specifically limited here.
  • the feature extraction unit can be expressed as:
  • x 1 may be an image feature vector V i I i a
  • y 1 may be the location of the feature in image I i h target body
  • g 1 () is the position of the feature vectors V i and wherein
  • the mapping relationship between, where g 1 () can be obtained by training a large number of known images and known location features of the target subject.
  • g 1 () can be obtained by training a large number of known images and known location features of the target subject.
  • each object attribute body may be thus obtained 4: an example of image I i, I i is assumed that the image of the target body comprises h, the image Attribute characteristics of h target subjects in I i By the feature vector V i input position attribute extracting unit extracts obtained. Among them, i is a natural number, and 1 ⁇ i ⁇ t.
  • the attribute extraction unit can be expressed as:
  • x 1 may be a feature vector V i I i of the image
  • y 2 may be a characteristic property of the image I i h in the target body
  • g 2 () is the feature vector V i property characteristics
  • the mapping relationship between, where g 2 () can be obtained by training a large number of known images and the attributes of known target subjects.
  • the attributes of the h target subjects of the images I 1 , I 2 , ..., I t The extraction methods are consistent with the attribute characteristics of the h target subjects in the image I i
  • the extraction method is similar, so I won’t go into details here.
  • the image I 1 4, I 2, ... , I t wherein each object in the posture of the body may be thus obtained: an image I i, for example, assuming that the image I i h targets included in the body, the image The posture characteristics of h target subjects in I i It may feature vector V i input by the position and orientation extracting unit extracts obtained. Among them, i is a natural number, and 1 ⁇ i ⁇ t.
  • the pose extraction unit can be expressed as:
  • x 1 may be an image feature vector V i I i a
  • y 3 may be a gesture feature image I i h in the target body
  • g 3 () is the feature vector V i and posture wherein The mapping relationship between, where g 2 () can be obtained by training a large number of known images and known posture features of the target subject.
  • g 2 () can be obtained by training a large number of known images and known posture features of the target subject.
  • the posture features of the h target subjects in the images I 1 , I 2 , ..., I t The extraction methods are the same as the posture features of the h target subjects in the image I i
  • the extraction method is similar, so I won’t go into details here.
  • the vector features of the relationship between the target subjects in the images I 1 , I 2 , ..., I t can be obtained as follows: Take the image I i as an example, assuming that the image I i includes h target subjects , Then the relationship vector features between h target subjects in the image I i include: Among them, the relationship feature vector It can be calculated by the relation vector feature extraction unit, where i, a, b are natural numbers, and 1 ⁇ i ⁇ t, 1 ⁇ a, b ⁇ h:
  • the relation vector feature extraction unit is used to perform ROI pooling (region of interest, ROI) according to the target subject a and the target subject b, so as to obtain the feature vector v a corresponding to the target subject a and the target subject b . b ;
  • the relation vector feature extraction unit is used to perform ROI pooling according to the target subject a, so as to obtain the feature vector v a,a corresponding to the target subject a ;
  • the relation vector feature extraction unit is used to calculate the relation vector feature according to the following formula
  • w a,b sigmoid(w(va ,b ,va ,a )), sigmoid() is an sigmoid function, v a,b are the feature vectors corresponding to the target subject a and the target subject b, v a, a is the feature vector corresponding to the target subject a, and w() is the inner product function.
  • w a, b can be obtained through training with a large number of known target subjects and known feature vectors.
  • the above-mentioned feature vector extraction, location feature extraction, attribute feature extraction, posture feature extraction, and relation vector feature extraction can be implemented by different Convolutional Neural Networks (CNN), or integrated in the same convolutional neural network.
  • CNN Convolutional Neural Networks
  • the convolutional neural network may include VGGNet, ResNet, FPNet, etc., which are not specifically limited here.
  • feature vector extraction, location feature extraction, attribute feature extraction, posture feature extraction, and relation vector feature extraction are integrated in the same convolutional neural network, feature vector extraction, location feature extraction, attribute feature extraction, posture feature extraction and relation Vector feature extraction can be implemented in different layers in a convolutional neural network.
  • the factors (image I panorama semantic description 1, I 2, ..., I t in a position wherein each of the target body, the image I 1, I 2, ..., I t of each target body the characteristic properties of the image I 1, I 2, ..., I t wherein each object in the posture of the main body and the image I 1, I 2, ..., the relationship between the feature vector of each target body I t) is the semantic description of the panoramic there is influence: the image I 1, I 2, ..., I t in a position wherein each of the target body may provide a semantic description of a first position between the respective target body, an image I 1, I 2, ..., I t Combining the attribute features of each target subject in the above-mentioned first semantic description, a second semantic description combining the attributes of each target subject can be obtained, and then, the posture features of each target subject in the images I 1 , I 2 , ..., I t are combined with the above second semantic description may obtain a third semantic description; Finally, the image I 1,
  • factors which affect semantic description of the panoramic panoramic semantic description may be as follows: Firstly, 1, I 2, ..., I t man by the image I in FIG. 3, and women wherein the three positions of the vehicle, can be obtained "the object a and the object B at the left side of the object C" semantic description first; then, 1, I 2, ..., I t man by the image I in FIG. 3, women Combining the attributes and characteristics of the vehicle with the first semantic description, the second semantic description of " woman and car on the left of the man” can be obtained. After that, through the images I 1 , I 2 , ..., Men in I t in Figure 3 The posture features of the woman and the vehicle are combined with the second semantic description to obtain the third semantic description. Finally, through the relationship vector features of the images I 1 , I 2 ,..., I t in Figure 3 combined with the third semantic description, you can get A panoramic semantic description of "the man on the right saw the woman on the left being hit by a car".
  • the panoramic semantic model can be expressed as:
  • x is the influencing factor of panoramic semantic description
  • y is the panoramic semantic description
  • Panorama() is the mapping relationship between the influencing factors of panoramic semantic description and panoramic semantic description.
  • Panorama() can be trained through a large number of known influencing factors of panoramic semantic description and known panoramic semantic description.
  • the panoramic semantic model may be as shown in Figure 5,
  • the position features of h target subjects in the images I 1 , I 2 , ..., I t Input the time sequence feature extraction unit 1 to obtain the first semantic description;
  • the attribute characteristics of h target subjects in images I 1 , I 2 , ..., I t Combine the first semantic description and input the time series feature extraction unit 2 to obtain the second semantic description;
  • the posture features of h target subjects in images I 1 , I 2 , ..., I t Combine the second semantic description and input the time series feature extraction unit 3 to obtain the third semantic description;
  • the relationship vector features of h target subjects in images I 1 , I 2 , ..., I t The sequential feature extraction unit 4 is input in combination with the third semantic description to obtain a panoramic semantic description.
  • the extraction of the first semantic description, the second semantic description, the third semantic description and the panoramic semantic description can be implemented by different Recurrent Neural Networks (RNN), or by the same recurrent neural network.
  • RNN Recurrent Neural Networks
  • the recurrent neural network may include a long and short-term memory model (Long short-term memory, LSTM), a bidirectional long and short-term memory model (Bi Long short-term memory, BiLSTM), etc., which are not specifically limited here.
  • LSTM long and short-term memory model
  • BiLSTM bidirectional long and short-term memory model
  • the extraction of the first semantic description, the second semantic description, the third semantic description, and the panoramic semantic description are implemented in the same cyclic neural network, they can be implemented in different layers in the cyclic neural network.
  • the panoramic semantic model may be as shown in Fig. 6:
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • the above h 11 to h 1t can constitute the first semantic description.
  • initial output value h 10 initial output value h 10
  • offset value to Offset value to Offset value to Offset value to Offset value to can be set manually, weight matrix to Weight matrix to Weight matrix to They are all obtained through training with a large number of known first semantic descriptions and known location features of target subjects.
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • initial output value h 20 initial output value h 20
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • the output value h 3t-1 is calculated to obtain the forgetting value f 3t-1 :
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • initial output value h 30 initial output value h 30
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • Is the offset value Is the weight matrix.
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • ⁇ () is the sigmoid function
  • tanh is the activation function
  • the above h 41 to h 4t can constitute a panoramic semantic description.
  • initial output value h 40 initial output value h 40
  • offset value to Offset value to Offset value to Offset value to Offset value to Can be set manually, weight matrix to Weight matrix to Weight matrix to They are all obtained by training a large number of known panoramic semantic descriptions, known third semantic descriptions, and the relationship vector features of known target subjects.
  • FIG. 7 is a schematic flowchart of an image analysis method provided by the present application.
  • the image analysis method of this embodiment includes the following steps:
  • the image analysis system acquires influencing factors of t frames of images, where the influencing factors include the own features of h target subjects in each frame of the t frame of images and the relationship vector features between h target subjects ,
  • Each target subject’s own characteristics include location characteristics, attribute characteristics and posture characteristics, where t, h are natural numbers greater than 1.
  • the location feature is used to indicate the location of the corresponding target subject in the image.
  • the location feature can be expressed as (x, y, w, h), where x and y are the abscissa and ordinate of the center point of the target subject in the image, w is the width of the target subject in the image, and h is The height of the target subject in the image.
  • the attribute characteristics can include many types.
  • the target subjects are different, and the attribute characteristics are usually different.
  • the attribute characteristics of the target subject can include gender, hairstyle, clothing type, and clothes. One or more of color, height, body shape, etc.
  • the posture characteristics of the target subject also include many types.
  • the posture characteristics of the target subject are different, and the posture characteristics are usually different.
  • the posture characteristics of the target subject may include falling, lying down, One or more of walking, running, jumping, etc.
  • the relationship feature vector is a vector representing the relationship between two target subjects.
  • S102 The image analysis system obtains a panoramic semantic description according to the influencing factors.
  • the panoramic semantic model reflects the mapping relationship between the influencing factors and the panoramic semantic description.
  • the panoramic semantic model can be expressed as:
  • x is the influencing factor of panoramic semantic description
  • y is the panoramic semantic description
  • Panorama() is the mapping relationship between the influencing factors of panoramic semantic description and panoramic semantic description.
  • Panorama() can be trained through a large number of known influencing factors of panoramic semantic description and known panoramic semantic description.
  • the panoramic semantic description can describe the relationship between the target subject and the target subject, between the target subject and the action, and between the action and the action.
  • feature extraction is performed on the t frame images to obtain t feature vectors; location feature extraction is performed on the t feature vectors to obtain the location feature; Perform attribute feature extraction on the feature vector to obtain the attribute feature; perform posture feature extraction on the t feature vectors to obtain the posture feature; perform relation vector feature extraction on the t feature vectors to obtain the Relationship vector characteristics.
  • the above-mentioned feature vector extraction, location feature extraction, attribute feature extraction, posture feature extraction, and relation vector feature extraction can be different convolutional neural networks (Convolutional Neural Networks, CNN), or can be integrated In the same convolutional neural network, there is no specific limitation here.
  • CNN Convolutional Neural Networks
  • feature vector extraction, location feature extraction, attribute feature extraction, pose feature extraction, and relation vector feature extraction are integrated in the same convolutional neural network
  • feature vector extraction, location feature extraction, attribute feature extraction, posture feature extraction, and relation vector feature extraction It can be a layer in a convolutional neural network.
  • the first semantic description is extracted based on the location feature; the second semantic description is extracted based on the attribute feature and the first semantic description; the second semantic description is extracted based on the posture feature and the second semantic description The third semantic description; the panoramic semantic description is extracted according to the relationship vector feature and the third semantic description.
  • the first to third semantic descriptions and the panoramic semantic description description can be extracted from different Recurrent Neural Networks (RNN), or can be integrated in the same recurrent neural network. Extraction is not specifically limited here.
  • the recurrent neural network may include a long and short-term memory model (Long short-term memory, LSTM), a bidirectional long and short-term memory model (Bi Long short-term memory, BiLSTM), etc., which are not specifically limited here.
  • LSTM long and short-term memory model
  • BiLSTM bidirectional long and short-term memory model
  • the above-mentioned first semantic description to third semantic description and panoramic semantic description are integrated in the same cyclic neural network for extraction, they can be extracted through different layers in the cyclic neural network.
  • this embodiment does not expand the description of the definitions of images, target subjects, panoramic semantic descriptions, etc.
  • This embodiment also does not introduce feature vectors, position features, attribute features, posture features, and relation vector features and their extraction methods.
  • FIG. 4 and related descriptions please refer to FIG. 4 and related descriptions.
  • the embodiment of the present application does not give a detailed introduction to the panoramic semantic model and how to use the panoramic semantic model to describe the panoramic semantics of the image. For details, please refer to FIG. 5, FIG. 6 and related descriptions.
  • the above solution can obtain a higher-level panoramic semantic description based on the position feature, attribute feature, posture feature of multiple target subjects in the multi-frame image and the relationship vector feature between multiple target subjects in the multi-frame image. It reflects the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions in the image.
  • the image analysis system of the embodiment of the present application includes a feature extraction module 510 and a panoramic semantic description module 520.
  • the feature extraction module 510 includes: a feature vector extraction unit 511, a position feature extraction unit 512, an attribute feature extraction unit 513, a posture feature extraction unit 514, and a relation vector feature unit 515.
  • the panoramic semantic description module 520 includes a first time series feature extraction unit 522, a second time series feature extraction unit 523, a third time series feature extraction unit 524, and a fourth time series feature extraction unit 525.
  • the feature extraction 510 is used to obtain the influencing factors of the panoramic semantic description, where the influencing factors include the own features of h target subjects in each frame of t frames of images and the relationship vector features between h target subjects ,
  • the self-owned features include location features, attribute features, and posture features, where t, h are natural numbers greater than 1, the location features are used to indicate the location of the corresponding target subject in the image, and the attribute features are used for Represents the attributes of the corresponding target subject, the posture feature is used to represent the action of the corresponding target subject, and the relationship vector feature is used to represent the relationship between the target subject and the target subject;
  • the panoramic semantic description module 520 is configured to input the influencing factors into a panoramic semantic model to obtain a panoramic semantic description, wherein the panoramic semantic model reflects the mapping relationship between the influencing factor and the panoramic semantic description,
  • the panoramic semantic description can describe the relationship between the target subject and the target subject, between the target subject and the action, and between the action and the action.
  • the location feature is used to indicate the location of the corresponding target subject in the image.
  • the location feature can be expressed as (x, y, w, h), where x and y are the abscissa and ordinate of the center point of the target subject in the image, w is the width of the target subject in the image, and h is The height of the target subject in the image.
  • the attribute characteristics can include many types.
  • the target subjects are different, and the attribute characteristics are usually different.
  • the attribute characteristics of the target subject can include gender, hairstyle, clothing type, and clothes. One or more of color, height, body shape, etc.
  • the posture characteristics of the target subject also include many types.
  • the posture characteristics of the target subject are different, and the posture characteristics are usually different.
  • the posture characteristics of the target subject may include falling, lying down, One or more of walking, running, jumping, etc.
  • the relationship feature vector is a vector representing the relationship between two target subjects.
  • the panoramic semantic model reflects the mapping relationship between the influencing factors and the panoramic semantic description.
  • the panoramic semantic model can be expressed as:
  • x is the influencing factor of panoramic semantic description
  • y is the panoramic semantic description
  • Panorama() is the mapping relationship between the influencing factors of panoramic semantic description and panoramic semantic description.
  • Panorama() can be trained through a large number of known influencing factors of panoramic semantic description and known panoramic semantic description.
  • the feature extraction module 510 includes a convolutional neural network, the feature vector extraction unit 511, the position feature extraction unit 512, the attribute feature extraction unit 513, and the posture feature extraction
  • the unit 514 and the relation vector feature extraction unit 515 are integrated in the convolutional neural network.
  • the above-mentioned feature vector extraction unit 511, position feature extraction unit 512, attribute feature extraction unit 513, posture feature extraction unit 514, and relation vector feature extraction unit 515 may be different convolutional neural networks (Convolutional Neural Networks, CNN), or It is integrated in the same convolutional neural network, and there is no specific limitation here.
  • the convolutional neural network may include VGGNet, ResNet, FPNet, etc., which are not specifically limited here.
  • VGGNet ResNet
  • FPNet FPNet
  • the feature vector extraction unit 511, the location feature extraction unit 512, the attribute feature extraction unit 513, the posture feature extraction unit 514, and the relation vector feature extraction unit 515 are integrated in the same convolutional neural network, the feature vector extraction unit 511, the location feature extraction unit 512.
  • the attribute feature extraction unit 513, the posture feature extraction unit 514, and the relation vector feature 515 may be a layer in a convolutional neural network.
  • the first time series feature extraction unit 522 is used to extract a first semantic description based on the location feature; the second time series feature extraction unit is used to extract a first semantic description based on the attribute feature and the first semantic description The second semantic description; the third temporal feature extraction unit is used to extract the third semantic description based on the posture feature and the second semantic; the fourth temporal feature extraction unit is used to extract the third semantic description based on the relationship vector feature and the third semantic Description Extract the panoramic semantic description.
  • the panoramic semantic model includes a cyclic neural network, the first time series feature extraction unit, the second time series feature extraction unit, the third time series feature extraction unit, and the fourth time series feature extraction unit.
  • the time sequence feature extraction unit is a layer in the recurrent neural network.
  • the aforementioned first time series feature extraction unit to the fourth time series feature extraction unit may be different Recurrent Neural Networks (RNN) respectively, or they may be integrated in the same Recurrent Neural Network, which is not specifically limited here.
  • RNN Recurrent Neural Networks
  • the recurrent neural network may include a long and short-term memory model (Long short-term memory, LSTM), a bidirectional long and short-term memory model (Bi Long short-term memory, BiLSTM), etc., which are not specifically limited here.
  • LSTM long and short-term memory
  • BiLSTM bidirectional long and short-term memory model
  • the first time series feature extraction unit to the fourth time series feature extraction unit may be a layer in the cyclic neural network, respectively.
  • this embodiment does not expand the description of the definitions of images, target subjects, panoramic semantic descriptions, etc.
  • This embodiment also does not introduce feature vectors, position features, attribute features, posture features, and relation vector features and their extraction methods.
  • FIG. 4 and related descriptions please refer to FIG. 4 and related descriptions.
  • the embodiment of the present application does not give a detailed introduction to the panoramic semantic model and how to use the panoramic semantic model to describe the panoramic semantics of the image. For details, please refer to FIG. 5, FIG. 6 and related descriptions.
  • the above solution can obtain a higher-level panoramic semantic description based on the position feature, attribute feature, posture feature of multiple target subjects in the multi-frame image and the relationship vector feature between multiple target subjects in the multi-frame image. It reflects the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions in the image.
  • the image analysis system of the present application can be implemented in a computing node or cloud computing infrastructure, which is not specifically limited here. The following will respectively introduce how to implement an image analysis system on computing nodes and cloud computing infrastructure.
  • the computing node 100 may include a processor 110 and a memory 120.
  • the processor is used to run the feature extraction module 111 and the panoramic semantic model 112.
  • the memory 120 is used to store semantic descriptions, features, images 121 and so on.
  • the computing node 100 also provides two external interface interfaces, namely a management interface 140 for maintenance personnel of the semantic description system and a user interface 150 for users.
  • the interface interface can have various forms, such as web interface, command line tool, REST interface, etc.
  • the management interface is used for maintenance personnel to input a large number of images used for panoramic semantic description; a large number of known panoramic semantic descriptions, known third semantic descriptions, and known target subject relationship vectors Features; a large number of known third semantic descriptions, known second semantic descriptions and known target subject’s posture features; a large number of known second semantic descriptions, known first semantic descriptions and known target subject’s attribute features; a large number of known Know the first semantic description and the location feature of the known target subject for training the panoramic semantic model.
  • the user interface is used for the user to input the image that needs to be extracted with the panoramic semantic description, and the panoramic semantic description is output to the user through the user interface.
  • computing node 100 is only an example provided by the embodiment of the present application, and the computing node 100 may have more or fewer components than the shown components, may combine two or more components, or may have Different configurations of components are realized.
  • the cloud computing infrastructure may be a cloud service cluster 200.
  • the cloud service cluster 200 is composed of nodes and a communication network between the nodes.
  • the aforementioned node may be a computing node or a virtual machine running on the computing node.
  • Nodes can be divided into two categories according to their functions: computing nodes 210 and storage nodes 220.
  • the computing node 210 is used to run the feature extraction module 211 and the panoramic semantic model 212.
  • the storage node 220 is used to store semantic descriptions, features, images, etc. 221.
  • the cloud service cluster 200 also provides two external interface interfaces, namely a management interface 240 for maintenance personnel of the question and answer engine and a user interface 250 for users.
  • the interface interface can have various forms, such as web interface, command line tool, REST interface, etc.
  • the management interface is used for maintenance personnel to input a large number of images used for panoramic semantic description; a large number of known panoramic semantic descriptions, known third semantic descriptions, and known target subject relationship vectors Features; a large number of known third semantic descriptions, known second semantic descriptions and known target subject’s posture features; a large number of known second semantic descriptions, known first semantic descriptions and known target subject’s attribute features; a large number of known Know the first semantic description and the location feature of the known target subject for training the panoramic semantic model.
  • the user interface is used for the user to input the image that needs to be extracted with the panoramic semantic description, and the panoramic semantic description is output to the user through the user interface.
  • the cloud service cluster 200 is only an example provided in the embodiment of the present application, and the cloud service cluster 200 may have more or fewer components than the components shown, two or more components may be combined, or It can be implemented with different configurations of components.
  • FIG. 11 is a schematic structural diagram of a semantic description system of another embodiment provided in this application.
  • the semantic description system shown in FIG. 8 can be implemented in the computing node 300 shown in FIG. 9.
  • the computing node 300 in this embodiment includes one or more processors 311, a communication interface 312, and a memory 313.
  • the processor 311, the communication interface 312, and the memory 313 may be connected through a bus 324.
  • the processor 311 includes one or more general-purpose processors.
  • the general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on.
  • the processor 311 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 313, which enables the computing node 300 to provide a wide variety of services.
  • the processor 311 can execute programs or process data to perform at least a part of the methods discussed herein.
  • the processor 311 can run the feature extraction module and the panoramic semantic model as shown in FIG. 8.
  • the communication interface 312 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users.
  • the memory 313 may include volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include non-volatile memory (Non-Volatile Memory), such as read-only memory (Read-Only Memory). Memory, ROM), Flash Memory (Flash Memory), Hard Disk Drive (HDD), or Solid-State Drive (SSD) memory may also include a combination of the foregoing types of memories.
  • volatile memory such as random access memory (Random Access Memory, RAM); the memory may also include non-volatile memory (Non-Volatile Memory), such as read-only memory (Read-Only Memory).
  • Memory, ROM), Flash Memory (Flash Memory), Hard Disk Drive (HDD), or Solid-State Drive (SSD) memory may also include a combination of the foregoing types of memories.
  • the memory 313 may store program codes and program data.
  • the program code includes feature extraction module code and panoramic semantic model code.
  • Program data includes: a large number of images used for panoramic semantic description; a large number of known panoramic semantic descriptions, known third semantic descriptions, and relationship vector features of known target subjects; a large number of known third semantic descriptions and known second semantics Description and known posture characteristics of the target subject; a large number of known second semantic descriptions, known first semantic descriptions and known attributes of the target subject; a large number of known first semantic descriptions and known location characteristics of the target subject to Used to train the panoramic semantic model.
  • processor 311 is configured to execute the following steps by calling the program code in the memory 313:
  • the processor 311 is configured to obtain influencing factors of t frames of images, where the influencing factors include the own characteristics of h target subjects in each frame of the t frame of images and the relationship vector characteristics between h target subjects ,
  • the own characteristics of each target subject include location characteristics, attribute characteristics and posture characteristics, where t, h are natural numbers greater than 1;
  • the processor 311 is configured to obtain a panoramic semantic description according to the influencing factors, and the panoramic semantic description includes between the target subject and the target subject, between the action of the target subject and the target subject, and between the action of the target subject and the action of the target subject The description of the relationship.
  • this embodiment does not expand the description of the definitions of images, target subjects, panoramic semantic descriptions, etc.
  • This embodiment also does not introduce feature vectors, position features, attribute features, posture features, and relation vector features and their extraction methods.
  • FIG. 4 and related descriptions please refer to FIG. 4 and related descriptions.
  • the embodiment of the present application does not give a detailed introduction to the panoramic semantic model and how to use the panoramic semantic model to describe the panoramic semantics of the image. For details, please refer to FIG. 5, FIG. 6 and related descriptions.
  • FIG. 12 is a schematic structural diagram of a semantic description system according to another embodiment provided in this application.
  • the semantic description system of this embodiment can be implemented in a cloud service cluster as shown in FIG. 10.
  • the cloud service cluster includes at least one computing node 410 and at least one storage node 420.
  • the computing node 410 includes one or more processors 411, a communication interface 412, and a memory 413.
  • the processor 411, the communication interface 412, and the memory 413 may be connected through a bus 424.
  • the processor 411 includes one or more general-purpose processors.
  • the general-purpose processor may be any type of device capable of processing electronic instructions, including a central processing unit (CPU), a microprocessor, a microcontroller, and a main Processor, controller, ASIC (Application Specific Integrated Circuit, application specific integrated circuit) and so on. It can be a dedicated processor used only for the computing node 410 or can be shared with other computing nodes 410.
  • the processor 411 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 413, which enables the computing node 410 to provide a wide variety of services. For example, the processor 411 can execute a program or process data to perform at least a part of the method discussed herein.
  • the processor 411 can run the feature extraction module and the panoramic semantic model as shown in Fig. 8.
  • the communication interface 412 may be a wired interface (for example, an Ethernet interface) for communicating with other computing nodes or users.
  • the communication interface 412 may adopt a protocol family on top of TCP/IP, for example, RAAS protocol, remote function call (Remote Function Call, RFC) protocol, simple object access protocol (Simple Object Access Protocol, SOAP) protocol, Simple Network Management Protocol (SNMP) protocol, Common Object Request Broker Architecture (CORBA) protocol, distributed protocol, etc.
  • RAAS protocol Remote Function Call
  • RFC remote function call
  • simple object access protocol Simple Object Access Protocol
  • SOAP Simple Network Management Protocol
  • SNMP Simple Network Management Protocol
  • CORBA Common Object Request Broker Architecture
  • the memory 413 may include volatile memory (Volatile Memory), such as random access memory (Random Access Memory, RAM); the memory may also include non-volatile memory (Non-Volatile Memory), such as read-only memory (Read-Only Memory). Memory, ROM), Flash Memory (Flash Memory), Hard Disk Drive (HDD), or Solid-State Drive (SSD) memory may also include a combination of the foregoing types of memories.
  • volatile memory such as random access memory (Random Access Memory, RAM)
  • non-Volatile Memory such as read-only memory (Read-Only Memory).
  • Memory ROM
  • Flash Memory Flash Memory
  • HDD Hard Disk Drive
  • SSD Solid-State Drive
  • the storage node 420 includes one or more processors 421, a communication interface 422, and a memory 423. Wherein, the processor 421, the communication interface 422, and the memory 423 may be connected through a bus 424.
  • the processor 421 includes one or more general-purpose processors, where the general-purpose processor may be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. . It can be a dedicated processor used only for the storage node 420 or can be shared with other storage nodes 420.
  • the processor 421 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 223, which enables the storage node 420 to provide a wide variety of services. For example, the processor 221 can execute programs or process data to perform at least a part of the methods discussed herein.
  • the communication interface 422 may be a wired interface (for example, an Ethernet interface) for communicating with other computing devices or users.
  • the storage node 420 includes one or more storage controllers 421 and a storage array 425. Wherein, the storage controller 421 and the storage array 425 may be connected through a bus 426.
  • the storage controller 421 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait. It can be a dedicated processor used only for a single storage node 420 or can be shared with the computing node 40 or other storage nodes 420. It can be understood that, in this embodiment, each storage node includes a storage controller. In other embodiments, multiple storage nodes may also share a storage controller, which is not specifically limited here.
  • the memory array 425 may include multiple memories.
  • the memory may be a non-volatile memory, such as ROM, flash memory, HDD or SSD memory, and may also include a combination of the above types of memory.
  • the storage array may be composed of multiple HDDs or multiple SDDs, or the storage array may be composed of HDDs and SDDs.
  • multiple memories are combined in different ways with the assistance of the storage controller 321 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology.
  • the memory array 425 may include one or more data centers. Multiple data centers can be set up at the same location, or at different locations, and there is no specific limitation here.
  • the memory array 425 may store program codes and program data.
  • the program code includes feature extraction module code and panoramic semantic model code.
  • Program data includes: a large number of images used for panoramic semantic description; a large number of known panoramic semantic descriptions, known third semantic descriptions, and relationship vector features of known target subjects; a large number of known third semantic descriptions and known second semantics Description and known posture characteristics of the target subject; a large number of known second semantic descriptions, known first semantic descriptions and known attributes of the target subject; a large number of known first semantic descriptions and known location characteristics of the target subject to Used to train the panoramic semantic model.
  • computing node 411 is used to execute the following steps by calling the program code in the storage node 413:
  • the computing node 411 is used to obtain the influencing factors of t frames of images, where the influencing factors include the own characteristics of h target subjects in each frame of the t frame of images and the relationship vector characteristics between h target subjects ,
  • the own characteristics of each target subject include location characteristics, attribute characteristics and posture characteristics, where t, h are natural numbers greater than 1;
  • the computing node 411 is used to obtain a panoramic semantic description according to the influencing factors, the panoramic semantic description includes the relationship between the target subject and the target subject, the action between the target subject and the target subject, and the action between the target subject and the target subject The description of the relationship.
  • this embodiment does not expand the description of the definitions of images, target subjects, panoramic semantic descriptions, etc.
  • This embodiment also does not introduce feature vectors, position features, attribute features, posture features, and relation vector features and their extraction methods.
  • FIG. 4 and related descriptions please refer to FIG. 4 and related descriptions.
  • the embodiment of the present application does not give a detailed introduction to the panoramic semantic model and how to use the panoramic semantic model to describe the panoramic semantics of the image. For details, please refer to FIG. 5, FIG. 6 and related descriptions.
  • the above solution can obtain a higher-level panoramic semantic description based on the position feature, attribute feature, posture feature of multiple target subjects in the multi-frame image and the relationship vector feature between multiple target subjects in the multi-frame image. It reflects the relationship between multiple subjects and subjects, between subjects and actions, and between actions and actions in the image.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium, (for example, a floppy disk, a storage disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a Solid State Disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

本申请提供了一种图像分析方法,包括:获取t帧图像的影响因素,其中,影响因素包括t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数;根据影响因素获得全景语义描述,全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。

Description

图像分析方法以及系统 技术领域
本申请涉及图像处理领域,尤其涉及一种图像分析方法以及系统。
背景技术
图像描述的任务是:为给定的图像生成对应的文字描述。图像描述可以从图像中自动提取信息,并根据自动提取到的信息生成对应的文字描述,从而实现了从图像向知识的转化。例如,图片描述可以为图1A所示的图像生成“一个男人在海上冲浪”这样的文字描述。
目前,图像描述只能对图像进行低层语义描述,即,只能对单主体单动作(例如图1A中一个男人在海上冲浪)或多主体单动作(例如图1B中一群学生在做早操)进行描述,但是,不能对图像进行全景语义描述,即,不能对多个主体和主体之间,主体和动作之间,动作和动作之间的关系(例如图1C中一个男人看到一个女人被车子撞倒了)进行描述。
发明内容
本申请提供了一种图像分析方法以及系统,能够对图像进行全景语义描述。
第一方面,提供了一种图像分析方法,包括:
获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数,所述位置特征表示对应的目标主体在所述图像中的位置,所述属性特征表示对应的目标主体的属性,所述姿态特征表示对应的目标主体的动作,所述关系向量特征表示目标主体和目标主体之间的关系;
根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。
上述方案能够根据多帧图像中的多个目标主体的位置特征、属性特征、姿态特征以及多帧图像中的多个目标主体之间的关系向量特征得到更高层次的全景语义描述,从而更好地体现图像中多个主体和主体之间,主体和动作之间,动作和动作之间的关系。
在一些可能的设计中,获取全景语义描述的影响因素包括:
对所述t帧图像进行特征提取,从而得到t个特征向量;
对所述t个特征向量进行位置特征提取,从而得到所述位置特征;
对所述t个特征向量进行属性特征提取,从而得到所述属性特征;
对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;
对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
在一些可能的设计中,采用同一个卷积神经网络执行对所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取。
上述方案中,通过同一个卷积神经网络执行对所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取中,所以,在进行所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取时,都可以使用之前提取得到的特征向量,避免多次对特征向量进行提取,从而减少计算量。即,不需要在进行所述位置特征的提取时,进行一次特征向量提取,在进行所述属性特征的提取时,进行一次特征向量提取,在进行所述姿态特征的提取时,进行一次特征向量提取,以及,在进行所述关系向量特征的提取时,进行一次特征向量提取。
在一些可能的设计中,根据图像i中的目标主体a和目标主体b对特征向量i进行感兴趣区域池化,从而获得与目标主体a和目标主体b对应的特征向量v a,b,i,a和b均为自然数,并且,0<i≤t,1≤a,b≤h,所述特征向量i根据所述图像i提取;
根据目标主体a进行感兴趣区域池化,从而获得与目标主体a对应的特征向量v a,a
根据以下公式计算得到图像i中的目标主体a和目标主体b之间的关系向量特征
Figure PCTCN2019107126-appb-000001
Figure PCTCN2019107126-appb-000002
其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。
在一些可能的设计中,所述根据所述影响因素获得全景语义描述包括:
根据所述位置特征提取第一语义描述;
根据所述属性特征以及所述第一语义描述提取第二语义描述;
根据所述姿态特征以及所述第二语义提取第三语义描述;
根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
在一些可能的设计中,采用同一循环神经网络执行所述第一语义描述、所述第二语义描述和所述第三语义描述的提取。
第二方面,提供了一种图像分析系统,包括特征提取模块以及全景语义模型,
所述特征提取模块用于获取全景语义描述的影响因素,其中,所述影响因素包括t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,所述自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数,所述位置特征用于表示对应的目标主体在图像中的位置,所述属性特征用于表示对应的目标主体的属性,所述姿态特征用于表示对应的目标主体的动作,所述关系向量特征用于表示目标主体和目标主体之间的关系;
所述全景语义模型,用于根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和动作之间以及动作与动作之间的关系的描述。
在一些可能的设计中,所述特征提取模块包括:特征向量提取单元、位置特征提取单元、属性特征提取单元、姿态特征提取单元以及关系向量特征单元,
所述特征向量提取单元,用于对所述t帧图像进行特征提取,从而得到t个特征向量;
所述位置特征提取单元,用于对所述t个特征向量进行位置特征提取,从而得到所述位置特征;
所述属性特征提取单元,用于对所述t个特征向量进行属性特征提取,从而得到所述属性特征;
所述姿态特征提取单元,用于对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;
所述关系向量特征单元,用于对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
在一些可能的设计中,所述特征提取模块包括卷积神经网络,所述特征向量提取单元、所述位置特征提取单元、所述属性特征提取单元、所述姿态特征提取单元以及所述关系向量特征提取单元集成于所述卷积神经网络。
在一些可能的设计中,所述关系向量特征提取单元,用于根据图像i中的目标主体a和目标主体b对特征向量i进行感兴趣区域池化,从而获得与目标主体a和目标主体b对应的特征向量v a,b,i,a和b均为自然数,并且,0<i≤t,1≤a,b≤h;
根据目标主体a进行感兴趣区域池化,从而获得与目标主体a对应的特征向量v a,a
根据以下公式计算得到图像i中的目标主体a和目标主体b之间的关系向量特征
Figure PCTCN2019107126-appb-000003
Figure PCTCN2019107126-appb-000004
其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。
在一些可能的设计中,所述全景语义模型包括:第一时序特征提取单元、第二时序特征提取单元、第三时序特征提取单元以及第四时序特征提取单元,
所述第一时序特征提取单元用于根据所述位置特征提取第一语义描述;
所述第二时序特征提取单元用于根据所述属性特征以及所述第一语义描述提取第二语义描述;
所述第三时序特征提取单元用于根据所述姿态特征以及所述第二语义提取第三语义描述;
所述第四时序特征提取单元用于根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
在一些可能的设计中,所述全景语义模型包括循环神经网络,所述第一时序特征提取单元、所述第二时序特征提取单元、所述第三时序特征提取单元和所述第四时序特征提取单元分别是所述循环神经网络中的一个层。
第三方面,提供了一种计算节点,包括:处理器和存储器,所述处理器执行:
获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数,所述 位置特征表示对应的目标主体在所述图像中的位置,所述属性特征表示对应的目标主体的属性,所述姿态特征表示对应的目标主体的动作,所述关系向量特征表示目标主体和目标主体之间的关系;
根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。
上述方案能够根据多帧图像中的多个目标主体的位置特征、属性特征、姿态特征以及多帧图像中的多个目标主体之间的关系向量特征得到更高层次的全景语义描述,从而更好地体现图像中多个主体和主体之间,主体和动作之间,动作和动作之间的关系。
在一些可能的设计中,所述处理器用于执行:
对所述t帧图像进行特征提取,从而得到t个特征向量;
对所述t个特征向量进行位置特征提取,从而得到所述位置特征;
对所述t个特征向量进行属性特征提取,从而得到所述属性特征;
对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;
对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
在一些可能的设计中,采用同一个卷积神经网络执行对所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取。
上述方案中,通过同一个卷积神经网络执行对所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取中,所以,在进行所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取时,都可以使用之前提取得到的特征向量,避免多次对特征向量进行提取,从而减少计算量。即,不需要在进行所述位置特征的提取时,进行一次特征向量提取,在进行所述属性特征的提取时,进行一次特征向量提取,在进行所述姿态特征的提取时,进行一次特征向量提取,以及,在进行所述关系向量特征的提取时,进行一次特征向量提取。
在一些可能的设计中,根据图像i中的目标主体a和目标主体b对特征向量i进行感兴趣区域池化,从而获得与目标主体a和目标主体b对应的特征向量v a,b,i,a和b均为自然数,并且,0<i≤t,1≤a,b≤h,所述特征向量i根据所述图像i提取;
根据目标主体a进行感兴趣区域池化,从而获得与目标主体a对应的特征向量v a,a
根据以下公式计算得到图像i中的目标主体a和目标主体b之间的关系向量特征
Figure PCTCN2019107126-appb-000005
Figure PCTCN2019107126-appb-000006
其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。
在一些可能的设计中,所述处理器用于执行:
根据所述位置特征提取第一语义描述;
根据所述属性特征以及所述第一语义描述提取第二语义描述;
根据所述姿态特征以及所述第二语义提取第三语义描述;
根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
在一些可能的设计中,采用同一循环神经网络执行所述第一语义描述、所述第二语义描述和所述第三语义描述的提取。
第四方面,提供了一种计算节点集群,包括:至少一个计算节点,每个计算节点包括处理器和存储器,所述处理器执行所述存储器中的代码执行如第一方面任一项所述的方法。
第五方面,提供了一种计算机程序产品,当所述计算机程序产品被计算机读取并执行时,如第一方面任一项所述的方法将被执行。
第六方面,提供了一种计算机非瞬态存储介质,包括指令,当所述指令在计算节点集群中的至少一个计算节点上运行时,使得所述计算节点集群执行如第一方面任一项所述的方法。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1A至图1C为一些用于进行图像描述的图像的示意图;
图2为本申请涉及的一实施例的用于进行全景语义描述的单帧图像的示意图;
图3为本申请涉及的一实施例的用于进行全景语义描述的多帧图像的示意图;
图4为本申请涉及的位置特征、属性特征、姿态特征以及关系向量特征的特征提取示意图;
图5为本申请涉及的一实施例的全景语义模型的示意图;
图6为本申请涉及的另一实施例的全景语义模型的示意图;
图7为本申请涉及的一实施例的语义描述方法的流程图;
图8是本申请中提供的一实施方式的语义描述系统的结构示意图;
图9为本申请涉及的一实施例的计算节点的结构示意图;
图10是本申请涉及的一实施例的云服务集群的结构示意图;
图11是本申请中提供的另一实施方式的语义描述系统的结构示意图;
图12是本申请中提供的又一实施方式的语义描述系统的结构示意图。
具体实施方式
本申请的实施例部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
首先对本申请的实施例涉及的用于进行全景语义描述的单幅图像进行详细的描述。
图2示出了适用于本申请实施例的一实施方式的用于进行全景语义描述的单帧图像的示意图。本实施例中用于进行全景语义描述的单帧图像通常包括多个目标主体,其中,目标主体可以是人、动物或者物体等等中的一种或者多种。以图2为例,图2所示的图像中的目标主体包括男子、女子以及车辆。不同目标主体可以执行不同的动 作,其中,动作可以是喝水、看书、做操、打篮球、踢球、跑步、游泳等等中的一种或者多种。以图2为例,图中的男子的动作为看向女子,图中的女子的动作为摔倒了,图中的车辆的动作为撞向女子。可以理解,上述图2仅仅是作为一种示例,在实际应用中,目标主体还可以是其他主体,目标主体的数量还可以是更多,目标主体的动作也可以是其他动作等等,此处不作具体限定。
在本申请具体的实施例中,如图3所示,图像分析系统可以按照时间顺序从视频中截取出t帧用于进行全景语义描述的图像I 1,I 2,…,I t,其中,t为自然数。其中,图像I 1,I 2,…,I t均包括相同的目标主体,例如,图像I 1包括目标主体1、目标主体2以及目标主体3;图像I 2包括目标主体1、目标主体2以及目标主体3;…;图像I t也包括目标主体1、目标主体2以及目标主体3。可以理解,上述t帧图像中相邻两帧图像之间的时间间隔可以是相等的,也可以是不相等的,此处不作具体限定。
在本申请具体的实施例中,图像分析系统可以通过全景语义模型对图像I t进行全景语义描述。其中,全景语义模型的输入变量为全景语义描述的影响因素。所述全景语义描述的影响因素包括图像I 1至I t中各个目标主体的自有特征(包括位置特征、属性特征以及姿态特征)以及各个目标主体之间的关系向量特征。
位置特征用于表示对应的目标主体在对应图像中的位置。位置特征可以表示为(x,y,w,h),其中,x和y分别为目标主体的中心点的在图像中的横坐标和纵坐标,w为目标主体在图像中的宽度,h为目标主体在图像中的高度。属性特征用于表示对应的目标主体的属性。属性特征可以包括很多种,目标主体不同,属性特征通常也不相同,以目标主体为人类为例,目标主体的属性特征可以包括性别、发型、衣服类型、衣服颜色、身高以及体型等等中的一种或者多种。目标主体的姿态特征用于表示对应的目标主体的动作。目标主体的姿态特征同样包括很多种,目标主体不同,姿态特征通常也不相同,以目标主体为人类为例,目标主体的姿态特征可以包括跌倒、躺下、步行、跑步以及跳跃等等中的一种或者多种。关系特征向量为表示两个目标主体之间的之间的关系的向量。
以图像I 1,I 2,…,I t中每帧图像均包括h个目标主体为例,所述全景语义描述的影响因素具体包括:
从图像I 1中获取得到的全景语义描述的影响因素:图像I 1中h个目标主体的自有特征以及图像I 1中h个目标主体之间的关系向量特征。
图像I 1中h个目标主体的自有特征包括:
Figure PCTCN2019107126-appb-000007
这里,位置特征P 1 1,属性特征
Figure PCTCN2019107126-appb-000008
姿态特征
Figure PCTCN2019107126-appb-000009
为图像I 1中的目标主体1的自有特征,位置特征
Figure PCTCN2019107126-appb-000010
属性特征
Figure PCTCN2019107126-appb-000011
姿态特征
Figure PCTCN2019107126-appb-000012
为图像I 1中的目标主体2的自有特征,…, 位置特征
Figure PCTCN2019107126-appb-000013
属性特征
Figure PCTCN2019107126-appb-000014
姿态特征
Figure PCTCN2019107126-appb-000015
为图像I 1中的目标主体h的自有特征。
图像I 1中h个目标主体之间的关系向量特征包括:
Figure PCTCN2019107126-appb-000016
这里,
Figure PCTCN2019107126-appb-000017
为图像I 1中的目标主体1和目标主体2之间的关系向量特征,
Figure PCTCN2019107126-appb-000018
为图像I 1中的目标主体1和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000019
为图像I 1中的目标主体1和目标主体h之间的关系向量特征,
Figure PCTCN2019107126-appb-000020
为图像I 1中的目标主体2和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000021
为图像I 1中的目标主体2和目标主体h之间的关系向量特征…,
Figure PCTCN2019107126-appb-000022
为图像I 1中的目标主体h-1和目标主体h之间的关系向量特征。
从图像I 2中获取得到的全景语义描述的影响因素:图像I 2中h个目标主体的自有特征以及图像I 2中h个目标主体之间的关系向量特征。
图像I 2中h个目标主体的自有特征包括:
Figure PCTCN2019107126-appb-000023
这里,位置特征P 1 2,属性特征
Figure PCTCN2019107126-appb-000024
姿态特征
Figure PCTCN2019107126-appb-000025
为图像I 2中的目标主体1的自有特征,位置特征
Figure PCTCN2019107126-appb-000026
属性特征
Figure PCTCN2019107126-appb-000027
姿态特征
Figure PCTCN2019107126-appb-000028
为图像I 2中的目标主体2的自有特征,…,位置特征
Figure PCTCN2019107126-appb-000029
属性特征
Figure PCTCN2019107126-appb-000030
姿态特征
Figure PCTCN2019107126-appb-000031
为图像I 2中的目标主体h的自有特征。
图像I 2中h个目标主体之间的关系向量特征包括:
Figure PCTCN2019107126-appb-000032
这里,
Figure PCTCN2019107126-appb-000033
为图像I 2中的目标主体1和目标主体2之间的关系向量特征,
Figure PCTCN2019107126-appb-000034
为图像I 2中的目标主体1和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000035
为图像I 2中的目标主体1和目标主体h之间的关系向量特征,
Figure PCTCN2019107126-appb-000036
为图像I 2中的目标主体2和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000037
为图像I 2中的目标主体2和目标主体h之间的关系向量特征…,
Figure PCTCN2019107126-appb-000038
为图像I 2中的目标主体h-1和目标主体h之间的关系向量特征。
……;
从图像I t中获取得到的全景语义描述的影响因素:图像I t中h个目标主体的自有特征以及图像I t中h个目标主体之间的关系向量特征。
图像I t中h个目标主体的自有特征包括:
Figure PCTCN2019107126-appb-000039
这里,位置特征P 1 t,属性特征
Figure PCTCN2019107126-appb-000040
姿态特征
Figure PCTCN2019107126-appb-000041
为图像I t中的目标主体1的自有特征,位置特征
Figure PCTCN2019107126-appb-000042
属性特征
Figure PCTCN2019107126-appb-000043
姿态特征
Figure PCTCN2019107126-appb-000044
为图像I t中的目标主体2的自有特征,…,位置特征
Figure PCTCN2019107126-appb-000045
属性特征
Figure PCTCN2019107126-appb-000046
姿态特征
Figure PCTCN2019107126-appb-000047
为图像I t中的目标主体h的自有特征。
图像I t中h个目标主体之间的关系向量特征包括:
Figure PCTCN2019107126-appb-000048
这里,
Figure PCTCN2019107126-appb-000049
为图像I t中的目标主体1和目标主体2之间的关系向量特征,
Figure PCTCN2019107126-appb-000050
为图像I t中的目标主体1和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000051
为图像I t中的目标主体1和目标主体h之间的关系向量特征,
Figure PCTCN2019107126-appb-000052
为图像I t中的目标主体2和目标主体3之间的关系向量特征,…,
Figure PCTCN2019107126-appb-000053
为图像I t中的目标主体2和目标主体h之间的关系向量特征…,
Figure PCTCN2019107126-appb-000054
为图像I t中的目标主体h-1和目标主体h之间的关系向量特征。
应理解,上述全景语义描述的影响因素的示例仅仅用于进行举例,在实际应用中,全景语义描述的影响因素还可以包括其他的影响因素,此处不作具体限定。
在本申请具体的实施例中,图像I 1,I 2,…,I t中各目标主体的位置特征、属性特征、姿态特征以及各目标主体之间的关系向量特征可以分别根据图像I 1,I 2,…,I t的特征向量V 1,V 2,…,V t计算得到。也就是说,图像I 1中各目标主体的位置特征、属性特征、姿态特征以及各目标主体之间的关系向量特征可以根据图像I 1的特征向量V 1计算得到,图像I 2中各目标主体的位置特征、属性特征、姿态特征以及各目标主体之间的关系向量特征V 2可以根据图像I 2的特征向量计算得到,…,图像I t中各目标主体的位置特征、属性特征、姿态特征以及各目标主体之间的关系向量特征可以根据图像I t的特征向量V t计算得到。
如图4所示,图像I 1,I 2,…,I t的特征向量V 1,V 2,…,V t可以是这样得到的。以图像I i为例,图像I i的特征向量V i可以是将图像I i输入至特征向量提取单元中得到的。其中,i为自然数,并且,1≤i≤t。特征向量提取单元可以依次包括:输入层、卷积计算层、池化层以及全连接层。
输入层:
假设输入层的输入为图像I i,输出和输入相等,即,不对输入进行任何处理。为了陈述简便,此处假设输入层不作任何处理,但是,在实际应用中,可以对输入层进行归一化等等处理,此处不作具体限定。
卷积计算层:
将输入层输出的图像I i作为卷积计算层的输入,经过n个卷积核K l(l=1,2,…,n)的卷积生成n个特征图像a l(l=1,2,…,n),其中,每个特征图像a l的生成过程具体如下:
C l=conv2(I,K l,'valid',)+b l
u l=C l
a l=f(u l)
其中,conv表示为使用卷积核K l对图像I进行卷积运算、valid表示为padding的方式,b l表示为偏置值,u l表示为卷积计算的结果,f()表示为激活函数,本发明采用relu函数。
池化层:
将卷积计算层输出的n个特征图像a l作为池化层的输入,经过池化窗口进行池化之后,生成n个池化图像b l(l=1,2,…,n),其中,每个池化图像b l的生成过程具体如下:
b l=max Pool(a l)
其中,maxPool表示为均值池化。
全连接层:
将n个池化图像b l(l=1,2,…,n)顺序展开成向量,并有序连接成一个长向量,作为全连接层网络的输入,全连接层的输出即为图像I i的特征向量V i
上述特征向量提取单元的各个参数中,卷积核K l(包括元素、大小、步长等等)、偏置值b l、f()以及β l可以是人为根据需要提取的特征(位置特征、属性特征、姿态特征以及关系向量特征)、图像I i的大小等等设置的。以卷积核K l为例,当需要提取的特征为位置特征时,卷积核K l的元素可以采用sobel算子的元素,又例如,当图像I i比较大时,卷积核K l的大小也可以比较大,反之,当图像I i比较小时,卷积核K l的大小也可以比较小,又例如,当图像I i比较大时,卷积核K l的步长也可以比较大,反之,当图像I i比较小时,卷积核K l的步长也可以比较小。
应理解,上述特征向量提取单元仅仅作为一种举例,在实际应用中,特征向量提取单元还可以是其他的形式,例如,可以包括更多的卷积计算层、更多的池化层,可以对图像I i进行填充等等,此处不作具体限定。
为了简便起见,上面只陈述了图像I i的特征向量V i的提取,实际上,图像I 1,I 2,…,I t各自的特征向量V 1,V 2,…,V t的提取方式均与图像I i的特征向量V i的提取方式相类似,此处不再展开赘述。
如图4所示,图像I 1,I 2,…,I t中各目标主体的位置特征可以是这样得到的:以图像I i为例,假设图像I i中包括h个目标主体,则图像I i中h个目标主体的位置特征
Figure PCTCN2019107126-appb-000055
可以通过将特征向量V i输入位置特征提取单元进行提取得到的。其中,i为自然数,并且,1≤i≤t。特征提取单元可以表示为:
y 1=g 1(x 1);
这里,x 1可以是图像I i的特征向量V i,y 1可以是图像I i中h个目标主体的位置特征
Figure PCTCN2019107126-appb-000056
g 1()为特征向量V i与位置特征
Figure PCTCN2019107126-appb-000057
之间的映射关系, 其中,g 1()可以通过大量的已知图像以及已知目标主体的位置特征进行训练得到。为了简便起见,上面只陈述了图像I i中h个目标主体的位置特征
Figure PCTCN2019107126-appb-000058
的提取,实际上,图像I 1,I 2,…,I t各自的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000059
Figure PCTCN2019107126-appb-000060
的提取方式均与图像I i的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000061
的提取方式相类似,此处不再展开赘述。
如图4所示,图像I 1,I 2,…,I t中各目标主体的属性特征可以是这样得到的:以图像I i为例,假设图像I i中包括h个目标主体,则图像I i中h个目标主体的属性特征
Figure PCTCN2019107126-appb-000062
可以通过将特征向量V i输入位置属性提取单元进行提取得到的。其中,i为自然数,并且,1≤i≤t。属性提取单元可以表示为:
y 2=g 2(x 1);
这里,x 1可以是图像I i的特征向量V i,y 2可以是图像I i中h个目标主体的属性特征
Figure PCTCN2019107126-appb-000063
g 2()为特征向量V i与属性特征
Figure PCTCN2019107126-appb-000064
之间的映射关系,其中,g 2()可以通过大量的已知图像以及已知目标主体的属性特征进行训练得到。为了简便起见,上面只陈述了图像I i中h个目标主体的属性特征
Figure PCTCN2019107126-appb-000065
的提取,实际上,图像I 1,I 2,…,I t各自的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000066
Figure PCTCN2019107126-appb-000067
的提取方式均与图像I i的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000068
的提取方式相类似,此处不再展开赘述。
如图4所示,图像I 1,I 2,…,I t中各目标主体的姿态特征可以是这样得到的:以图像I i为例,假设图像I i中包括h个目标主体,则图像I i中h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000069
可以通过将特征向量V i输入位置姿态提取单元进行提取得到的。其中,i为自然数,并且,1≤i≤t。姿态提取单元可以表示为:
y 3=g 3(x 1);
这里,x 1可以是图像I i的特征向量V i,y 3可以是图像I i中h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000070
g 3()为特征向量V i与姿态特征
Figure PCTCN2019107126-appb-000071
之间的映射关系,其中,g 2()可以通过大量的已知图像以及已知目标主体的姿态特征进行训练得到。为了简便起见,上面只陈述了图像I i中h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000072
的提取,实际上,图像I 1,I 2,…,I t各自的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000073
Figure PCTCN2019107126-appb-000074
的提取方式均与图像I i的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000075
的提取方式相类似,此处不再展开赘述。
如图4所示,图像I 1,I 2,…,I t中各目标主体之间的关系向量特征可以是这样得到的:以图像I i为例,假设图像I i中包括h个目标主体,则图像I i中h个目标主体之间的关系向量特征包括:
Figure PCTCN2019107126-appb-000076
其中,关系特征向量
Figure PCTCN2019107126-appb-000077
可以通过关系向量特征提取单元计算得到,其中,i,a,b为自然数,并且,1≤i≤t,1≤a,b≤h:
关系向量特征提取单元用于根据目标主体a和目标主体b进行感兴趣区域池化(ROI pooling)(region of interest,ROI),从而获得与目标主体a和目标主体b对 应的特征向量v a,b
关系向量特征提取单元用于根据目标主体a进行ROI pooling,从而获得与目标主体a对应的特征向量v a,a
关系向量特征提取单元用于根据以下公式计算得到关系向量特征
Figure PCTCN2019107126-appb-000078
Figure PCTCN2019107126-appb-000079
其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。w a,b可以通过大量的已知目标主体和已知特征向量进行训练得到。
为了简便起见,上面只陈述了图像I i中h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000080
Figure PCTCN2019107126-appb-000081
的提取,实际上,图像I 1,I 2,…,I t各自的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000082
的提取方式均与图像I i的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000083
的提取方式相类似,此处不再展开赘述。
上述特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取可以分别是不同的卷积神经网络(Convolutional Neural Networks,CNN)实现的,也可以集成在同一个卷积神经网络中实现的,此处不作具体限定。其中,卷积神经网络可以包括VGGNet、ResNet、FPNet等等,此处不作具体限定。当特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取集成在同一个卷积神经网络中完成时,特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取可以是卷积神经网络中不同层分别实现的。
在本申请具体的实施例中,全景语义描述的影响因素(图像I 1,I 2,…,I t中各目标主体的位置特征、图像I 1,I 2,…,I t中各目标主体的属性特征、图像I 1,I 2,…,I t中各目标主体的姿态特征以及图像I 1,I 2,…,I t中各目标主体的之间的关系向量特征)对全景语义描述存在这样的影响:图像I 1,I 2,…,I t中各目标主体的位置特征可以提供关于各个目标主体之间的位置的第一语义描述,图像I 1,I 2,…,I t中各目标主体的属性特征结合上述第一语义描述,可以得到结合各个目标主体的属性的第二语义描述,之后,图像I 1,I 2,…,I t的各目标主体的姿态特征结合上述第二语义描述可以得到第三语义描述;最后,图像I 1,I 2,…,I t的各目标主体之间的关系向量特征结合第三语义描述,可以得到全景语义描述。
以图3所示的例子为例,全景语义描述的影响因素对全景语义描述的影响可以如下所述:首先,通过图3中的图像I 1,I 2,…,I t的男子、女子以及车辆三者的位置特征,可以得到“物体A和物体B在物体C的左侧”的第一语义描述;然后,通过图3中的图像I 1,I 2,…,I t的男子、女子和车辆三者的属性特征结合第一语义描述,可以得到“女子和汽车在男子左侧”的第二语义描述,之后,通过图3中图像I 1,I 2,…,I t的男子、女子和车辆三者的姿态特征结合第二语义描述,可以得到第三语义描述,最后,通过图3中图像I 1,I 2,…,I t的关系向量特征结合第三语义描述,可以得到“右边的男子看到左边的女子被汽车撞倒”的全景语义描述。
应理解,上述图3所示的例子仅仅作为一种具体的示例,在其他的实施例子,还可以对其他的图像进行全景语义描述,此处不作具体限定,
在本申请具体的实施例中,全景语义模型可以表示为:
y=Panorama(x)
其中,x为全景语义描述的影响因素,y为全景语义描述,Panorama()为全景语义描述的影响因素与全景语义描述的映射关系。Panorama()可以是通过大量的已知全景语义描述的影响因素和已知全景语义描述进行训练得到的。在一具体的实施例中,全景语义模型可以如图5所示,
将图像I 1,I 2,…,I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000084
Figure PCTCN2019107126-appb-000085
输入时序特征提取单元1,从而得到第一语义描述;
将图像I 1,I 2,…,I t中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000086
Figure PCTCN2019107126-appb-000087
结合第一语义描述输入时序特征提取单元2,从而得到第二语义描述;
将图像I 1,I 2,…,I t中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000088
Figure PCTCN2019107126-appb-000089
结合第二语义描述输入时序特征提取单元3,从而得到第三语义描述;
将图像I 1,I 2,…,I t中的h个目标主体的关系向量特征
Figure PCTCN2019107126-appb-000090
Figure PCTCN2019107126-appb-000091
结合第三语义描述输入时序特征提取单元4,从而得到全景语义描述。
可以理解,第一语义描述、第二语义描述、第三语义描述以及全景语义描述的提取可以分别是不同的循环神经网络(Recurrent Neural Networks,RNN)实现的,也可以是同一个循环神经网络实现的,此处不作具体限定。其中,循环神经网络可以包括长短时记忆模型模型(Long short-term memory,LSTM)、双向长短时记忆模型模型(Bi Long short-term memory,BiLSTM)等等,此处不作具体限定。当第一语义描述、第二语义描述、第三语义描述以及全景语义描述的提取在同一个循环神经网络中实现时,可以分别是循环神经网络中的不同层分别实现的。
以时序特征提取单元1至4均为LSTM为例,在一具体的实施例中,全景语义模型可以如图6所示:
以下为第一时序特征提取单元中的神经元1,神经元2至神经元t中的计算过程:
在神经元1中:
首先,根据图像I 1中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000092
以及初始输出值h 10计算得到遗忘值f 10
Figure PCTCN2019107126-appb-000093
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000094
为偏置值,
Figure PCTCN2019107126-appb-000095
为权值矩阵。
然后,根据图像I 1中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000096
初始输入值C 10,初始输出值h 10以及遗忘值f 10计算得到输入值C 11
Figure PCTCN2019107126-appb-000097
Figure PCTCN2019107126-appb-000098
Figure PCTCN2019107126-appb-000099
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000100
均为权值矩阵,
Figure PCTCN2019107126-appb-000101
均为偏置值。
最后,根据图像I 1中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000102
初始输出值h 10以及输入值C 11
Figure PCTCN2019107126-appb-000103
h 11=o 10tanh(C 11)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000104
均为权值矩阵,
Figure PCTCN2019107126-appb-000105
均为偏置值。
在神经元2中:
首先,根据图像I 2中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000106
以及输出值h 11计算得到遗忘值f 11
Figure PCTCN2019107126-appb-000107
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000108
为偏置值,
Figure PCTCN2019107126-appb-000109
为权值矩阵。
然后,根据图像I 2中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000110
输入值C 11,输出值h 11以及遗忘值f 11计算得到输入值C 12
Figure PCTCN2019107126-appb-000111
Figure PCTCN2019107126-appb-000112
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000113
均为权值矩阵,
Figure PCTCN2019107126-appb-000114
均为偏置值。
最后,根据图像I 2中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000115
输出值h 11以及输入值C 12
Figure PCTCN2019107126-appb-000116
h 12=o 11tanh(C 12)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000117
均为权值矩阵,
Figure PCTCN2019107126-appb-000118
均为偏置值。
……;
在神经元t中:
首先,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000119
以及输出值h 1t-1计算得到遗忘值f 1t-1
Figure PCTCN2019107126-appb-000120
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000121
为偏置值,
Figure PCTCN2019107126-appb-000122
为权值矩阵。
然后,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000123
输入值C 1t-1,输出值h 1t-1以及遗忘值f 1t-1计算得到输入值C 1t
Figure PCTCN2019107126-appb-000124
Figure PCTCN2019107126-appb-000125
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000126
均为权值矩阵,
Figure PCTCN2019107126-appb-000127
均为偏置值。
最后,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000128
输出值h 1t-1以及输入值C 1t
Figure PCTCN2019107126-appb-000129
h 1t=o 1t-1tanh(C 1t)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000130
均为权值矩阵,
Figure PCTCN2019107126-appb-000131
均为偏置值。
上述h 11至h 1t即可构成第一语义描述。
可以理解,上述初始输出值h 10,初始输出值h 10,偏置值
Figure PCTCN2019107126-appb-000132
Figure PCTCN2019107126-appb-000133
偏置值
Figure PCTCN2019107126-appb-000134
Figure PCTCN2019107126-appb-000135
偏置值
Figure PCTCN2019107126-appb-000136
Figure PCTCN2019107126-appb-000137
偏置值
Figure PCTCN2019107126-appb-000138
Figure PCTCN2019107126-appb-000139
可以是人工设置的,权值矩阵
Figure PCTCN2019107126-appb-000140
Figure PCTCN2019107126-appb-000141
权值矩阵
Figure PCTCN2019107126-appb-000142
Figure PCTCN2019107126-appb-000143
权值矩阵
Figure PCTCN2019107126-appb-000144
Figure PCTCN2019107126-appb-000145
均是通过大量已知第一语义描述与已知目标主体的位置特征进行训练得到的。
以下为第二时序特征提取单元中的神经元1,神经元2至神经元t中的计算过程:
在神经元1中:
首先,根据图像I 1中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000146
以及初始输出值h 20计算得到遗忘值f 20
Figure PCTCN2019107126-appb-000147
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000148
为偏置值,
Figure PCTCN2019107126-appb-000149
为权值矩阵。
然后,根据图像I 1中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000150
初始输入值C 20,初始输出值h 20以及遗忘值f 20计算得到输入值C 21
Figure PCTCN2019107126-appb-000151
Figure PCTCN2019107126-appb-000152
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000153
均为权值矩阵,
Figure PCTCN2019107126-appb-000154
均为偏置值。
最后,根据图像I 1中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000155
初始输出值h 20以及输入值C 21
Figure PCTCN2019107126-appb-000156
h 21=o 20tanh(C 21)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000157
均为权值矩阵,
Figure PCTCN2019107126-appb-000158
均为偏置值。
在神经元2中:
首先,根据图像I 2中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000159
以及输出值h 21计算得到遗忘值f 21
Figure PCTCN2019107126-appb-000160
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000161
为偏置值,
Figure PCTCN2019107126-appb-000162
为权值矩阵。
然后,根据图像I 2中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000163
输入值C 21,输出值h 21以及遗忘值f 21计算得到输入值C 22
Figure PCTCN2019107126-appb-000164
Figure PCTCN2019107126-appb-000165
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000166
均为权值矩阵,
Figure PCTCN2019107126-appb-000167
均为偏置值。
最后,根据图像I 2中的h个目标主体的属性特征
Figure PCTCN2019107126-appb-000168
输出值h 21以及输入值C 22
Figure PCTCN2019107126-appb-000169
h 12=o 11tanh(C 12)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000170
均为权值矩阵,
Figure PCTCN2019107126-appb-000171
均为偏置值。
……;
在神经元t中:
首先,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000172
以及输出值h 2t-1计算得到遗忘值f 2t-1
Figure PCTCN2019107126-appb-000173
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000174
为偏置值,
Figure PCTCN2019107126-appb-000175
为权值矩阵。
然后,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000176
输入值C 2t-1,输出值h 2t-1以及遗忘值f 2t-1计算得到输入值C 2t
Figure PCTCN2019107126-appb-000177
Figure PCTCN2019107126-appb-000178
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000179
均为权值矩阵,
Figure PCTCN2019107126-appb-000180
Figure PCTCN2019107126-appb-000181
均为偏置值。
最后,根据图像I t中的h个目标主体的位置特征
Figure PCTCN2019107126-appb-000182
输出值h 2t-1以及输入值C 2t
Figure PCTCN2019107126-appb-000183
h 2t=o 2t-1tanh(C 2t)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000184
均为权值矩阵,
Figure PCTCN2019107126-appb-000185
均为偏置值。
上述h 21至h 2t即可构成第二语义描述。
可以理解,上述初始输出值h 20,初始输出值h 20,偏置值
Figure PCTCN2019107126-appb-000186
Figure PCTCN2019107126-appb-000187
偏置值
Figure PCTCN2019107126-appb-000188
Figure PCTCN2019107126-appb-000189
偏置值
Figure PCTCN2019107126-appb-000190
Figure PCTCN2019107126-appb-000191
偏置值
Figure PCTCN2019107126-appb-000192
Figure PCTCN2019107126-appb-000193
可以是人工设置的,权值矩阵
Figure PCTCN2019107126-appb-000194
Figure PCTCN2019107126-appb-000195
权值矩阵
Figure PCTCN2019107126-appb-000196
Figure PCTCN2019107126-appb-000197
权值矩阵
Figure PCTCN2019107126-appb-000198
Figure PCTCN2019107126-appb-000199
均是通过大量已知第二语义描述、已知第一语义描述与已知目标主体的属性特征进行训练得到的。
以下为第三时序特征提取单元中的神经元1,神经元2至神经元t中的计算过程:
在神经元1中:
首先,根据图像I 1中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000200
以及初始输出值h 30计算得到遗忘值f 30
Figure PCTCN2019107126-appb-000201
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000202
为偏置值,
Figure PCTCN2019107126-appb-000203
为权值矩阵。
然后,根据图像I 1中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000204
初始输入值C 30,初始输出值h 30以及遗忘值f 30计算得到输入值C 31
Figure PCTCN2019107126-appb-000205
Figure PCTCN2019107126-appb-000206
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000207
均为权值矩阵,
Figure PCTCN2019107126-appb-000208
均为偏置值。
最后,根据图像I 1中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000209
初始输出值h 30以及输入值C 31
Figure PCTCN2019107126-appb-000210
h 31=o 30tanh(C 31)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000211
均为权值矩阵,
Figure PCTCN2019107126-appb-000212
均为偏置值。
在神经元2中:
首先,根据图像I 2中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000213
以及输出值h 31计算得到遗忘值f 31
Figure PCTCN2019107126-appb-000214
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000215
为偏置值,
Figure PCTCN2019107126-appb-000216
为权值矩阵。
然后,根据图像I 2中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000217
输入值C 31,输出值h 31以及遗忘值f 31计算得到输入值C 32
Figure PCTCN2019107126-appb-000218
Figure PCTCN2019107126-appb-000219
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000220
均为权值矩阵,
Figure PCTCN2019107126-appb-000221
均为偏置值。
最后,根据图像I 2中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000222
输出值h 31以及输入值C 32
Figure PCTCN2019107126-appb-000223
h 32=o 31tanh(C 32)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000224
均为权值矩阵,
Figure PCTCN2019107126-appb-000225
均为偏置值。
……;
在神经元t中:
首先,根据图像I t中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000226
以及输出值h 3t-1计算得到遗忘值f 3t-1
Figure PCTCN2019107126-appb-000227
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000228
为偏置值,
Figure PCTCN2019107126-appb-000229
为权值矩阵。
然后,根据图像I t中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000230
输入值C 3t-1,输出值h 3t-1以及遗忘值f 3t-1计算得到输入值C 3t
Figure PCTCN2019107126-appb-000231
Figure PCTCN2019107126-appb-000232
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000233
均为权值矩阵,
Figure PCTCN2019107126-appb-000234
Figure PCTCN2019107126-appb-000235
均为偏置值。
最后,根据图像I t中的h个目标主体的姿态特征
Figure PCTCN2019107126-appb-000236
输出值h 3t-1以及输入值C 3t
Figure PCTCN2019107126-appb-000237
h 3t=o 3t-1tanh(C 3t)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000238
均为权值矩阵,
Figure PCTCN2019107126-appb-000239
均为偏置值。
上述h 31至h 3t即可构成第三语义描述。
可以理解,上述初始输出值h 30,初始输出值h 30,偏置值
Figure PCTCN2019107126-appb-000240
Figure PCTCN2019107126-appb-000241
偏置值
Figure PCTCN2019107126-appb-000242
Figure PCTCN2019107126-appb-000243
偏置值
Figure PCTCN2019107126-appb-000244
Figure PCTCN2019107126-appb-000245
偏置值
Figure PCTCN2019107126-appb-000246
Figure PCTCN2019107126-appb-000247
可以是人工设置的,权值矩阵
Figure PCTCN2019107126-appb-000248
Figure PCTCN2019107126-appb-000249
权值矩阵
Figure PCTCN2019107126-appb-000250
Figure PCTCN2019107126-appb-000251
权值矩阵
Figure PCTCN2019107126-appb-000252
Figure PCTCN2019107126-appb-000253
均是通过大量已知第三语义描述、已知第二语义描述与已知目标主体的姿态特征进行训练得到的。
以下为第四时序特征提取单元中的神经元1,神经元2至神经元t中的计算过程:
在神经元1中:
首先,根据图像I 1中的h个目标主体的之间的关系向量特征
Figure PCTCN2019107126-appb-000254
以及初始输出值h 40计算得到遗忘值f 40
Figure PCTCN2019107126-appb-000255
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000256
为偏置值,
Figure PCTCN2019107126-appb-000257
为权值矩阵。
然后,根据图像I 1中的h个目标主体的之间的关系向量特征
Figure PCTCN2019107126-appb-000258
初始输入值C 40,初始输出值h 40以及遗忘值f 40计算得到输入值C 41
Figure PCTCN2019107126-appb-000259
Figure PCTCN2019107126-appb-000260
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000261
均为权值矩阵,
Figure PCTCN2019107126-appb-000262
均为偏置值。
最后,根据图像I 1中的h个目标主体的之间的关系向量特征
Figure PCTCN2019107126-appb-000263
初始输出值h 40以及输入值C 41
Figure PCTCN2019107126-appb-000264
h 41=o 40tanh(C 41)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000265
均为权值矩阵,
Figure PCTCN2019107126-appb-000266
均为偏置值。
在神经元2中:
首先,根据图像I 2中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000267
以及输出值h 41计算得到遗忘值f 41
Figure PCTCN2019107126-appb-000268
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000269
为偏置值,
Figure PCTCN2019107126-appb-000270
为权值矩阵。
然后,根据图像I 2中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000271
输入值C 41,输出值h 41以及遗忘值f 41计算得到输入值C 42
Figure PCTCN2019107126-appb-000272
Figure PCTCN2019107126-appb-000273
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000274
均为权值矩阵,
Figure PCTCN2019107126-appb-000275
均为偏置值。
最后,根据图像I 2中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000276
输出值h 41以及输入值C 42
Figure PCTCN2019107126-appb-000277
h 42=o 41tanh(C 42)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000278
均为权值矩阵,
Figure PCTCN2019107126-appb-000279
均为偏置值。
……;
在神经元t中:
首先,根据图像I t中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000280
以及输出值h 4t-1计算得到遗忘值f 4t-1
Figure PCTCN2019107126-appb-000281
其中,σ()为sigmoid函数,
Figure PCTCN2019107126-appb-000282
为偏置值,
Figure PCTCN2019107126-appb-000283
为权值矩阵。
然后,根据图像I t中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000284
输入值C 4t-1,输出值h 4t-1以及遗忘值f 4t-1计算得到输入值C 4t
Figure PCTCN2019107126-appb-000285
Figure PCTCN2019107126-appb-000286
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000287
均为权值矩阵,
Figure PCTCN2019107126-appb-000288
Figure PCTCN2019107126-appb-000289
均为偏置值。
最后,根据图像I t中的h个目标主体之间的关系向量特征
Figure PCTCN2019107126-appb-000290
输出值h 4t-1以及输入值C 4t
Figure PCTCN2019107126-appb-000291
h 4t=o 4t-1tanh(C 4t)
其中,σ()为sigmoid函数,tanh为激活函数,
Figure PCTCN2019107126-appb-000292
均为权值矩阵,
Figure PCTCN2019107126-appb-000293
均为偏置值。
上述h 41至h 4t即可构成全景语义描述。
可以理解,上述初始输出值h 40,初始输出值h 40,偏置值
Figure PCTCN2019107126-appb-000294
Figure PCTCN2019107126-appb-000295
偏置值
Figure PCTCN2019107126-appb-000296
Figure PCTCN2019107126-appb-000297
偏置值
Figure PCTCN2019107126-appb-000298
Figure PCTCN2019107126-appb-000299
偏置值
Figure PCTCN2019107126-appb-000300
Figure PCTCN2019107126-appb-000301
可以是人工设置的,权值矩阵
Figure PCTCN2019107126-appb-000302
Figure PCTCN2019107126-appb-000303
权值矩阵
Figure PCTCN2019107126-appb-000304
Figure PCTCN2019107126-appb-000305
权值矩阵
Figure PCTCN2019107126-appb-000306
Figure PCTCN2019107126-appb-000307
均是通过大量已知全景语义描述、已知第三语义描述以及已知目标主体的关系向量特征进行训练得到的。
如图7所示,图7是本申请提供了一种图像分析方法的流程示意图。本实施方式的图像分析方法,包括如下步骤:
S101:图像分析系统获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数。
在本申请具体的实施例中,位置特征用于表示对应的目标主体在图像中的位置。位置特征可以表示为(x,y,w,h),其中,x和y分别为目标主体的中心点的在图像中的横坐标和纵坐标,w为目标主体在图像中的宽度,h为目标主体在图像中的高度。
在本申请具体的实施例中,属性特征可以包括很多种,目标主体不同,属性特征通常也不相同,以目标主体为人类为例,目标主体的属性特征可以包括性别、发型、衣服类型、衣服颜色、身高以及体型等等中的一种或者多种。
在本申请具体的实施例中,目标主体的姿态特征同样包括很多种,目标主体不同,姿态特征通常也不相同,以目标主体为人类为例,目标主体的姿态特征可以包括跌倒、躺下、步行、跑步以及跳跃等等中的一种或者多种。
在本申请具体的实施例中,关系特征向量为表示两个目标主体之间的之间的关系的向量。
S102:图像分析系统根据所述影响因素获得全景语义描述。
在本申请具体的实施例中,所述全景语义模型反应了所述影响因素和所述全景语义描述之间的映射关系。全景语义模型可以表示为:
y=Panorama(x)
其中,x为全景语义描述的影响因素,y为全景语义描述,Panorama()为全景语义描述的影响因素与全景语义描述的映射关系。Panorama()可以是通过大量的已知全景语义描述的影响因素和已知全景语义描述进行训练得到的。
在本申请具体的实施例中,所述全景语义描述能够对目标主体和目标主体之间,目标主体和动作之间以及动作与动作之间的关系进行描述。
在本申请具体的实施例中,对所述t帧图像进行特征提取,从而得到t个特征向量;对所述t个特征向量进行位置特征提取,从而得到所述位置特征;对所述t个特征向量进行属性特征提取,从而得到所述属性特征;对所述t个特征向量进行姿态特征提取, 从而得到所述姿态特征;对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
在本申请具体的实施例中,上述特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取可以分别是不同的卷积神经网络(Convolutional Neural Networks,CNN),也可以集成在同一个卷积神经网络中,此处不作具体限定。当特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取集成在同一个卷积神经网络,特征向量提取、位置特征提取、属性特征提取、姿态特征提取以及关系向量特征提取可以是卷积神经网络中的一个层。
在本申请具体的实施例中,根据所述位置特征提取第一语义描述;根据所述属性特征以及所述第一语义描述提取第二语义描述;根据所述姿态特征以及所述第二语义提取第三语义描述;根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
在本申请具体的实施例中,上述第一语义描述至第三语义以及全景语义描述描述可以分别是不同的循环神经网络(Recurrent Neural Networks,RNN)进行提取,也可以集成在同一个循环神经网络中进行提取,此处不作具体限定。其中,循环神经网络可以包括长短时记忆模型模型(Long short-term memory,LSTM)、双向长短时记忆模型模型(Bi Long short-term memory,BiLSTM)等等,此处不作具体限定。当上述第一语义描述至第三语义描述以及全景语义描述集成在同一个循环神经网络中进行提取时,可以分别通过循环神经网络中的不同层进行提取。
为了简便陈述,本实施例并没有对图像、目标主体、全景语义描述等等的定义进行展开描述,具体请参见图2以及图3以及相关的图像、目标主体、全景语义模型、全景语义描述的定义等等的描述。本实施例也没有对特征向量、位置特征、属性特征、姿态特征以及关系向量特征以及它们的提取方式进行介绍,具体请参见图4以及相关描述。另外,本申请实施例也没有对全景语义模型以及如何使用全景语义模型对图像进行全景语义描述进行详细的介绍,具体请参见图5、图6以及相关描述。
上述方案能够根据多帧图像中的多个目标主体的位置特征、属性特征、姿态特征以及多帧图像中的多个目标主体之间的关系向量特征得到更高层次的全景语义描述,从而更好地体现图像中多个主体和主体之间,主体和动作之间,动作和动作之间的关系。
参见图8,图8是本申请中提供的一实施方式的图像分析系统的结构示意图。本申请实施例的图像分析系统包括特征提取模块510以及全景语义描述模块520。其中,特征提取模块510包括:特征向量提取单元511、位置特征提取单元512、属性特征提取单元513、姿态特征提取单元514以及关系向量特征单元515。全景语义描述模块520包括第一时序特征提取单元522、第二时序特征提取单元523、第三时序特征提取单元524以及第四时序特征提取单元525。
所述特征提取510用于获取全景语义描述的影响因素,其中,所述影响因素包括t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,所述自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自 然数,所述位置特征用于表示对应的目标主体在图像中的位置,所述属性特征用于表示对应的目标主体的属性,所述姿态特征用于表示对应的目标主体的动作,所述关系向量特征用于表示目标主体和目标主体之间的关系;
所述全景语义描述模块520用于将所述影响因素输入全景语义模型,从而得到全景语义描述,其中,所述全景语义模型反应了所述影响因素和所述全景语义描述之间的映射关系,所述全景语义描述能够对目标主体和目标主体之间,目标主体和动作之间以及动作与动作之间的关系进行描述。
在本申请具体的实施例中,位置特征用于表示对应的目标主体在图像中的位置。位置特征可以表示为(x,y,w,h),其中,x和y分别为目标主体的中心点的在图像中的横坐标和纵坐标,w为目标主体在图像中的宽度,h为目标主体在图像中的高度。
在本申请具体的实施例中,属性特征可以包括很多种,目标主体不同,属性特征通常也不相同,以目标主体为人类为例,目标主体的属性特征可以包括性别、发型、衣服类型、衣服颜色、身高以及体型等等中的一种或者多种。
在本申请具体的实施例中,目标主体的姿态特征同样包括很多种,目标主体不同,姿态特征通常也不相同,以目标主体为人类为例,目标主体的姿态特征可以包括跌倒、躺下、步行、跑步以及跳跃等等中的一种或者多种。
在本申请具体的实施例中,关系特征向量为表示两个目标主体之间的之间的关系的向量。
在本申请具体的实施例中,所述全景语义模型反应了所述影响因素和所述全景语义描述之间的映射关系。全景语义模型可以表示为:
y=Panorama(x)
其中,x为全景语义描述的影响因素,y为全景语义描述,Panorama()为全景语义描述的影响因素与全景语义描述的映射关系。Panorama()可以是通过大量的已知全景语义描述的影响因素和已知全景语义描述进行训练得到的。
在本申请具体的实施例中,所述特征向量提取单元511用于对所述t帧图像进行特征提取,从而得到t个特征向量;所述位置特征提取单元512用于对所述t个特征向量进行位置特征提取,从而得到所述位置特征;所述属性特征提取单元513用于对所述t个特征向量进行属性特征提取,从而得到所述属性特征;所述姿态特征提取单元514用于对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;所述关系向量特征单元515用于对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
在本申请具体的实施例中,所述特征提取模块510包括卷积神经网络,所述特征向量提取单元511、所述位置特征提取单元512、所述属性特征提取单元513、所述姿态特征提取单元514以及所述关系向量特征提取单元515集成于所述卷积神经网络。上述特征向量提取单元511、位置特征提取单元512、属性特征提取单元513、姿态特征提取单元514以及关系向量特征提取单元515可以分别是不同的卷积神经网络(Convolutional Neural Networks,CNN),也可以集成在同一个卷积神经网络中,此处不作具体限定。其中,卷积神经网络可以包括VGGNet、ResNet、FPNet等等,此处 不作具体限定。当特征向量提取单元511、位置特征提取单元512、属性特征提取单元513、姿态特征提取单元514以及关系向量特征提取单元515集成在同一个卷积神经网络,特征向量提取单元511、位置特征提取单元512、属性特征提取单元513、姿态特征提取单元514以及关系向量特征515可以是卷积神经网络中的一个层。
在本申请具体的实施例中,第一时序特征提取单元522用于根据所述位置特征提取第一语义描述;第二时序特征提取单元用于根据所述属性特征以及所述第一语义描述提取第二语义描述;第三时序特征提取单元用于根据所述姿态特征以及所述第二语义提取第三语义描述;第四时序特征提取单元用于根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
在本申请具体的实施例中,所述全景语义模型包括循环神经网络,所述第一时序特征提取单元、所述第二时序特征提取单元、所述第三时序特征提取单元和所述第四时序特征提取单元分别是所述循环神经网络中的一个层。上述第一时序特征提取单元至第四时序特征提取单元可以分别是不同的循环神经网络(Recurrent Neural Networks,RNN),也可以集成在同一个循环神经网络中,此处不作具体限定。其中,循环神经网络可以包括长短时记忆模型模型(Long short-term memory,LSTM)、双向长短时记忆模型模型(Bi Long short-term memory,BiLSTM)等等,此处不作具体限定。当第一时序特征提取单元至第四时序特征提取单元集成在同一个循环神经网络,第一时序特征提取单元至第四时序特征提取单元可以分别是循环神经网络中的一个层。
为了简便陈述,本实施例并没有对图像、目标主体、全景语义描述等等的定义进行展开描述,具体请参见图2以及图3以及相关的图像、目标主体、全景语义模型、全景语义描述的定义等等的描述。本实施例也没有对特征向量、位置特征、属性特征、姿态特征以及关系向量特征以及它们的提取方式进行介绍,具体请参见图4以及相关描述。另外,本申请实施例也没有对全景语义模型以及如何使用全景语义模型对图像进行全景语义描述进行详细的介绍,具体请参见图5、图6以及相关描述。
上述方案能够根据多帧图像中的多个目标主体的位置特征、属性特征、姿态特征以及多帧图像中的多个目标主体之间的关系向量特征得到更高层次的全景语义描述,从而更好地体现图像中多个主体和主体之间,主体和动作之间,动作和动作之间的关系。
本申请的图像分析系统可以在计算节点中实现,也可以在云计算基础设施上实现,此处不做具体限定。下面将分别介绍如何在计算节点和云计算基础设施上实现图像分析系统。
如图9所示,计算节点100可以包括包括处理器110以及存储器120。其中,处理器用于运行特征提取模块111以及全景语义模型112。存储器120用于存储语义描述、特征以及图像121等等。计算节点100还提供了两种对外的接口界面,分别是面向语义描述系统的维护人员的管理界面140以及面向用户的用户界面150。其中,接口界面的形态可以是多样的,例如web界面、命令行工具、REST接口等。
在本申请具体的实施例中,管理界面用于供维护人员可以通过输入大量用于进行 全景语义描述的图像;大量已知全景语义描述、已知第三语义描述以及已知目标主体的关系向量特征;大量已知第三语义描述、已知第二语义描述与已知目标主体的姿态特征;大量已知第二语义描述、已知第一语义描述与已知目标主体的属性特征;大量已知第一语义描述与已知目标主体的位置特征,以用于对全景语义模型进行训练。
在本申请具体的实施例中,用户界面用于供用户输入需要被提取全景语义描述的图像,并且,通过用户界面向用户输出全景语义描述。
应当理解,计算节点100仅为本申请实施例提供的一个例子,并且,计算节点100可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
如图10所示,云计算基础设施可以是云服务集群200。所述云服务集群200是由节点,以及节点之间的通信网络构成。上述节点可以是计算节点,也可以是运行在计算节点上的虚拟机。节点按功能可分为两类:计算节点210和存储节点220。计算节点210用于运行特征提取模块211以及全景语义模型212。存储节点220用于存储语义描述、特征以及图像等等221。云服务集群200还提供了两种对外的接口界面,分别是面向问答引擎的维护人员的管理界面240以及面向用户的用户界面250。其中,接口界面的形态可以是多样的,例如web界面、命令行工具、REST接口等。
在本申请具体的实施例中,管理界面用于供维护人员可以通过输入大量用于进行全景语义描述的图像;大量已知全景语义描述、已知第三语义描述以及已知目标主体的关系向量特征;大量已知第三语义描述、已知第二语义描述与已知目标主体的姿态特征;大量已知第二语义描述、已知第一语义描述与已知目标主体的属性特征;大量已知第一语义描述与已知目标主体的位置特征,以用于对全景语义模型进行训练。
在本申请具体的实施例中,用户界面用于供用户输入需要被提取全景语义描述的图像,并且,通过用户界面向用户输出全景语义描述。
应当理解,云服务集群200仅为本申请实施例提供的一个例子,并且,云服务集群200可具有比示出的部件更多或更少的部件,可以组合两个或更多个部件,或者可具有部件的不同配置实现。
参见图11,图11是本申请中提供的另一实施方式的语义描述系统的结构示意图。图8所示的语义描述系统可以在如图9所示的计算节点300中实现。本实施方式的计算节点300包括一个或多个处理器311、通信接口312和存储器313。其中,处理器311、通信接口312和存储器313之间可以通过总线324连接。
处理器311包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括中央处理器(Central Processing Unit,CPU)、微处理器、微控制器、主处理器、控制器以及ASIC(Application Specific Integrated Circuit,专用集成电路)等等。处理器311执行各种类型的数字存储指令,例如存储在存储器313中的软件或者固件程序,它能使计算节点300提供较宽的多种服务。例如,处理器311能够执行程序或者处理数据,以执行本文讨论的方法的至少一部分。处理器311中可以运行如图8所示的特征提取模块以及全景语义模型。
通信接口312可以为有线接口(例如以太网接口),用于与其他计算节点或用户进行通信。
存储器313可以包括易失性存储器(Volatile Memory),例如随机存取存储器(Random Access Memory,RAM);存储器也可以包括非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD)存储器还可以包括上述种类的存储器的组合。存储器313可以存储有程序代码以及程序数据。其中,程序代码包括特征提取模块代码以及全景语义模型代码。程序数据包括:大量用于进行全景语义描述的图像;大量已知全景语义描述、已知第三语义描述以及已知目标主体的关系向量特征;大量已知第三语义描述、已知第二语义描述与已知目标主体的姿态特征;大量已知第二语义描述、已知第一语义描述与已知目标主体的属性特征;大量已知第一语义描述与已知目标主体的位置特征,以用于对全景语义模型进行训练。
其中,所述处理器311通过调用存储器313中的程序代码,用于执行以下步骤:
处理器311用于获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数;
处理器311用于根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。
为了简便陈述,本实施例并没有对图像、目标主体、全景语义描述等等的定义进行展开描述,具体请参见图2以及图3以及相关的图像、目标主体、全景语义模型、全景语义描述的定义等等的描述。本实施例也没有对特征向量、位置特征、属性特征、姿态特征以及关系向量特征以及它们的提取方式进行介绍,具体请参见图4以及相关描述。另外,本申请实施例也没有对全景语义模型以及如何使用全景语义模型对图像进行全景语义描述进行详细的介绍,具体请参见图5、图6以及相关描述。
参见图12,图12是本申请中提供的又一实施方式的语义描述系统的结构示意图。本实施方式的语义描述系统可以在如图10所示的云服务集群中实现。云服务集群包括包括至少一个计算节点410以及至少一个存储节点420。
计算节点410包括一个或多个处理器411、通信接口412和存储器413。其中,处理器411、通信接口412和存储器413之间可以通过总线424连接。
处理器411包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括中央处理器(Central Processing Unit,CPU)、微处理器、微控制器、主处理器、控制器以及ASIC(Application Specific Integrated Circuit,专用集成电路)等等。它能够是仅用于计算节点410的专用处理器或者能够与其它计算节点410共享。处理器411执行各种类型的数字存储指令,例如存储在存储器413中的软件或者固件程序,它能使计算节点410提供较宽的多种服务。例如,处理器411能够执行程序或者处理数据,以执行本文讨论的方法的至少一部分。处理器411中可 以运行如图8所示的特征提取模块以及全景语义模型。
通信接口412可以为有线接口(例如以太网接口),用于与其他计算节点或用户进行通信。当通信接口412为有线接口时,通信接口412可以采用TCP/IP之上的协议族,例如,RAAS协议、远程函数调用(Remote Function Call,RFC)协议、简单对象访问协议(Simple Object Access Protocol,SOAP)协议、简单网络管理协议(Simple Network Management Protocol,SNMP)协议、公共对象请求代理体系结构(Common Object Request Broker Architecture,CORBA)协议以及分布式协议等等。
存储器413可以包括易失性存储器(Volatile Memory),例如随机存取存储器(Random Access Memory,RAM);存储器也可以包括非易失性存储器(Non-Volatile Memory),例如只读存储器(Read-Only Memory,ROM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,HDD)或固态硬盘(Solid-State Drive,SSD)存储器还可以包括上述种类的存储器的组合。
存储节点420包括一个或多个处理器421、通信接口422和存储器423。其中,处理器421、通信接口422和存储器423之间可以通过总线424连接。
处理器421包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括CPU、微处理器、微控制器、主处理器、控制器以及ASIC等等。它能够是仅用于存储节点420的专用处理器或者能够与其它存储节点420共享。处理器421执行各种类型的数字存储指令,例如存储在存储器223中的软件或者固件程序,它能使存储节点420提供较宽的多种服务。例如,处理器221能够执行程序或者处理数据,以执行本文讨论的方法的至少一部分。
通信接口422可以为有线接口(例如以太网接口),用于与其他计算设备或用户进行通信。
存储节点420包括一个或多个存储控制器421和存储阵列425。其中,存储控制器421和存储阵列425之间可以通过总线426连接。
存储控制器421包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括CPU、微处理器、微控制器、主处理器、控制器以及ASIC等等。它能够是仅用于单个存储节点420的专用处理器或者能够与计算节点40或者其它存储节点420共享。可以理解,在本实施例中,每个存储节点包括一个存储控制器,在其他的实施例中,也可以多个存储节点共享一个存储控制器,此处不作具体限定。
存储器阵列425可以包括多个存储器。存储器可以是非易失性存储器,例如ROM、快闪存储器、HDD或SSD存储器还可以包括上述种类的存储器的组合。例如,存储阵列可以是由多个HDD或者多个SDD组成,或者,存储阵列可以是由HDD以及SDD组成。其中,多个存储器在存储控制器321将的协助下按不同的方式组合起来形成存储器组,从而提供比单个存储器更高的存储性能和提供数据备份技术。可选地,存储器阵列425可以包括一个或者多个数据中心。多个数据中心可以设置在同一个地点,或者,分别在不同的地点,此处不作具体限定。存储器阵列425可以存储有程序代码以及程序数据。其中,程序代码包括特征提取模块代码以及全景语义模型代码。程序数据包括:大量用于进行全景语义描述的图像;大量已知全景语义描述、已知第三语 义描述以及已知目标主体的关系向量特征;大量已知第三语义描述、已知第二语义描述与已知目标主体的姿态特征;大量已知第二语义描述、已知第一语义描述与已知目标主体的属性特征;大量已知第一语义描述与已知目标主体的位置特征,以用于对全景语义模型进行训练。
其中,所述计算节点411通过调用存储节点413中的程序代码,用于执行以下步骤:
计算节点411用于获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数;
计算节点411用于根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。
为了简便陈述,本实施例并没有对图像、目标主体、全景语义描述等等的定义进行展开描述,具体请参见图2以及图3以及相关的图像、目标主体、全景语义模型、全景语义描述的定义等等的描述。本实施例也没有对特征向量、位置特征、属性特征、姿态特征以及关系向量特征以及它们的提取方式进行介绍,具体请参见图4以及相关描述。另外,本申请实施例也没有对全景语义模型以及如何使用全景语义模型对图像进行全景语义描述进行详细的介绍,具体请参见图5、图6以及相关描述。
上述方案能够根据多帧图像中的多个目标主体的位置特征、属性特征、姿态特征以及多帧图像中的多个目标主体之间的关系向量特征得到更高层次的全景语义描述,从而更好地体现图像中多个主体和主体之间,主体和动作之间,动作和动作之间的关系。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、存储盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态存储盘Solid State Disk(SSD))等。

Claims (14)

  1. 一种图像分析方法,其特征在于,包括:
    获取t帧图像的影响因素,其中,所述影响因素包括所述t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,每个目标主体的自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数,所述位置特征表示对应的目标主体在所述图像中的位置,所述属性特征表示对应的目标主体的属性,所述姿态特征表示对应的目标主体的动作,所述关系向量特征表示目标主体和目标主体之间的关系;
    根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和目标主体的动作之间以及目标主体的动作与目标主体的动作之间的关系的描述。
  2. 根据权利要求1所述的方法,其特征在于,所述获取t帧图像的全景语义描述的影响因素包括:
    对所述t帧图像进行特征提取,从而得到t个特征向量;
    对所述t个特征向量进行位置特征提取,从而得到所述位置特征;
    对所述t个特征向量进行属性特征提取,从而得到所述属性特征;
    对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;
    对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
  3. 根据权利要求2所述的方法,其特征在于,采用同一个卷积神经网络执行对所述位置特征的提取,所述属性特征的提取、所述姿态特征的提取和所述关系向量特征的提取。
  4. 根据权利要求2或3所述的方法,其特征在于,所述对所述t个特征向量进行关系向量特征提取,从而得到关系向量特征包括:
    根据图像i中的目标主体a和目标主体b对特征向量i进行感兴趣区域池化,从而获得与目标主体a和目标主体b对应的特征向量v a,b,i,a和b均为自然数,并且,0<i≤t,1≤a,b≤h,所述特征向量i根据所述图像i提取;
    根据目标主体a进行感兴趣区域池化,从而获得与目标主体a对应的特征向量v a,a
    根据以下公式计算得到图像i中的目标主体a和目标主体b之间的关系向量特征
    Figure PCTCN2019107126-appb-100001
    Figure PCTCN2019107126-appb-100002
    其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。
  5. 根据权利要求1至4任一权利要求所述的方法,其特征在于,所述根据所述影 响因素获得全景语义描述包括:
    根据所述位置特征提取第一语义描述;
    根据所述属性特征以及所述第一语义描述提取第二语义描述;
    根据所述姿态特征以及所述第二语义提取第三语义描述;
    根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
  6. 根据权利要求5所述的方法,其特征在于,
    采用同一循环神经网络执行所述第一语义描述、所述第二语义描述和所述第三语义描述的提取。
  7. 一种图像分析系统,其特征在于,包括特征提取模块以及全景语义模型,
    所述特征提取模块,用于获取全景语义描述的影响因素,其中,所述影响因素包括t帧图像中的每帧图像中h个目标主体的自有特征以及h个目标主体之间的关系向量特征,所述自有特征包括位置特征、属性特征以及姿态特征,其中,t,h为大于1的自然数,所述位置特征用于表示对应的目标主体在图像中的位置,所述属性特征用于表示对应的目标主体的属性,所述姿态特征用于表示对应的目标主体的动作,所述关系向量特征用于表示目标主体和目标主体之间的关系;
    所述全景语义模型,用于根据所述影响因素获得全景语义描述,所述全景语义描述包括目标主体和目标主体之间,目标主体和动作之间以及动作与动作之间的关系的描述。
  8. 根据权利要求7所述的系统,其特征在于,所述特征提取模块包括:特征向量提取单元、位置特征提取单元、属性特征提取单元、姿态特征提取单元以及关系向量特征单元,
    所述特征向量提取单元,用于对所述t帧图像进行特征提取,从而得到t个特征向量;
    所述位置特征提取单元,用于对所述t个特征向量进行位置特征提取,从而得到所述位置特征;
    所述属性特征提取单元,用于对所述t个特征向量进行属性特征提取,从而得到所述属性特征;
    所述姿态特征提取单元,用于对所述t个特征向量进行姿态特征提取,从而得到所述姿态特征;
    所述关系向量特征单元模块,用于对所述t个特征向量进行关系向量特征提取,从而得到所述关系向量特征。
  9. 根据权利要求8所述的系统,其特征在于,所述特征提取模块包括卷积神经网络,所述特征向量提取单元、所述位置特征提取单元、所述属性特征提取单元、所述姿态特征提取单元以及所述关系向量特征提取单元集成于所述卷积神经网络。
  10. 根据权利要求8或9所述的系统,其特征在于,
    所述关系向量特征提取单元,用于根据图像i中的目标主体a和目标主体b对特征向量i进行感兴趣区域池化,从而获得与目标主体a和目标主体b对应的特征向量v a,b,i,a和b均为自然数,并且,0<i≤t,1≤a,b≤h;根据目标主体a进行感兴趣区域池化,从而获得与目标主体a对应的特征向量v a,a;并根据以下公式计算得到图像i中的目标主体a和目标主体b之间的关系向量特征
    Figure PCTCN2019107126-appb-100003
    Figure PCTCN2019107126-appb-100004
    其中,w a,b=sigmoid(w(v a,b,v a,a)),sigmoid()为S型的函数,v a,b为目标主体a和目标主体b对应的特征向量,v a,a为目标主体a对应的特征向量,w()为内积函数。
  11. 根据权利要求7至10任一权利要求所述的系统,其特征在于,所述全景语义模型包括:第一时序特征提取单元、第二时序特征提取单元、第三时序特征提取单元以及第四时序特征提取单元,
    所述第一时序特征提取单元,用于根据所述位置特征提取第一语义描述;
    所述第二时序特征提取单元,用于根据所述属性特征以及所述第一语义描述提取第二语义描述;
    所述第三时序特征提取单元,用于根据所述姿态特征以及所述第二语义提取第三语义描述;
    所述第四时序特征提取单元,用于根据所述关系向量特征以及所述第三语义描述提取所述全景语义描述。
  12. 根据权利要求11所述的系统,其特征在于,所述全景语义模型包括循环神经网络,所述第一时序特征提取单元、所述第二时序特征提取单元、所述第三时序特征提取单元和所述第四时序特征提取单元分别是所述循环神经网络中的一个层。
  13. 一种计算节点集群,其特征在于,包括:至少一个计算节点,每个计算节点包括处理器和存储器,所述处理器执行所述存储器中的代码执行如权利要求1至6任一权利要求所述的方法。
  14. 一种计算机非瞬态存储介质,其特征在于,包括指令,当所述指令在计算节点集群中的至少一个计算节点上运行时,使得所述计算节点集群执行如权利要求1至6任一权利要求所述的方法。
PCT/CN2019/107126 2019-01-23 2019-09-21 图像分析方法以及系统 WO2020151247A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19911852.2A EP3893197A4 (en) 2019-01-23 2019-09-21 IMAGE ANALYSIS METHOD AND SYSTEM
US17/365,089 US12100209B2 (en) 2019-01-23 2021-07-01 Image analysis method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910065251.0 2019-01-23
CN201910065251.0A CN111476838A (zh) 2019-01-23 2019-01-23 图像分析方法以及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/365,089 Continuation US12100209B2 (en) 2019-01-23 2021-07-01 Image analysis method and system

Publications (1)

Publication Number Publication Date
WO2020151247A1 true WO2020151247A1 (zh) 2020-07-30

Family

ID=71735877

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/107126 WO2020151247A1 (zh) 2019-01-23 2019-09-21 图像分析方法以及系统

Country Status (3)

Country Link
EP (1) EP3893197A4 (zh)
CN (1) CN111476838A (zh)
WO (1) WO2020151247A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114764897A (zh) * 2022-03-29 2022-07-19 深圳市移卡科技有限公司 行为识别方法、装置、终端设备以及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880692A (zh) * 2012-09-19 2013-01-16 上海交通大学 一种面向检索的监控视频语义描述和检测建模方法
EP2993618A1 (en) * 2014-09-04 2016-03-09 Xerox Corporation Domain adaptation for image classification with class priors
CN106650617A (zh) * 2016-11-10 2017-05-10 江苏新通达电子科技股份有限公司 一种基于概率潜在语义分析的行人异常识别方法
CN108197589A (zh) * 2018-01-19 2018-06-22 北京智能管家科技有限公司 动态人体姿态的语义理解方法、装置、设备和存储介质
CN108416776A (zh) * 2018-03-16 2018-08-17 京东方科技集团股份有限公司 图像识别方法、图像识别装置、计算机产品和可读存储介质
CN108509880A (zh) * 2018-03-21 2018-09-07 南京邮电大学 一种视频人物行为语义识别方法
CN108510012A (zh) * 2018-05-04 2018-09-07 四川大学 一种基于多尺度特征图的目标快速检测方法
CN108960330A (zh) * 2018-07-09 2018-12-07 西安电子科技大学 基于快速区域卷积神经网络的遥感图像语义生成方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5278770B2 (ja) * 2007-02-08 2013-09-04 ビヘイヴィアラル レコグニション システムズ, インコーポレイテッド 挙動認識システム
CN104966052A (zh) * 2015-06-09 2015-10-07 南京邮电大学 基于属性特征表示的群体行为识别方法
US9811765B2 (en) * 2016-01-13 2017-11-07 Adobe Systems Incorporated Image captioning with weak supervision
CN107391505B (zh) * 2016-05-16 2020-10-23 腾讯科技(深圳)有限公司 一种图像处理方法及系统
CN106169065B (zh) * 2016-06-30 2019-12-24 联想(北京)有限公司 一种信息处理方法及电子设备
CN106446782A (zh) * 2016-08-29 2017-02-22 北京小米移动软件有限公司 图像识别方法及装置
CN107122416B (zh) * 2017-03-31 2021-07-06 北京大学 一种中文事件抽取方法
CN107391646B (zh) * 2017-07-13 2020-04-10 清华大学 一种视频图像的语义信息提取方法及装置
CN108304846B (zh) * 2017-09-11 2021-10-22 腾讯科技(深圳)有限公司 图像识别方法、装置及存储介质
CN108875494A (zh) * 2017-10-17 2018-11-23 北京旷视科技有限公司 视频结构化方法、装置、系统及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880692A (zh) * 2012-09-19 2013-01-16 上海交通大学 一种面向检索的监控视频语义描述和检测建模方法
EP2993618A1 (en) * 2014-09-04 2016-03-09 Xerox Corporation Domain adaptation for image classification with class priors
CN106650617A (zh) * 2016-11-10 2017-05-10 江苏新通达电子科技股份有限公司 一种基于概率潜在语义分析的行人异常识别方法
CN108197589A (zh) * 2018-01-19 2018-06-22 北京智能管家科技有限公司 动态人体姿态的语义理解方法、装置、设备和存储介质
CN108416776A (zh) * 2018-03-16 2018-08-17 京东方科技集团股份有限公司 图像识别方法、图像识别装置、计算机产品和可读存储介质
CN108509880A (zh) * 2018-03-21 2018-09-07 南京邮电大学 一种视频人物行为语义识别方法
CN108510012A (zh) * 2018-05-04 2018-09-07 四川大学 一种基于多尺度特征图的目标快速检测方法
CN108960330A (zh) * 2018-07-09 2018-12-07 西安电子科技大学 基于快速区域卷积神经网络的遥感图像语义生成方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3893197A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114764897A (zh) * 2022-03-29 2022-07-19 深圳市移卡科技有限公司 行为识别方法、装置、终端设备以及存储介质

Also Published As

Publication number Publication date
EP3893197A4 (en) 2022-02-23
EP3893197A1 (en) 2021-10-13
CN111476838A (zh) 2020-07-31
US20210326634A1 (en) 2021-10-21

Similar Documents

Publication Publication Date Title
US12062158B2 (en) Image denoising method and apparatus
US20210183022A1 (en) Image inpainting method and apparatus, computer device, and storage medium
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
US20210319258A1 (en) Method and apparatus for training classification task model, device, and storage medium
WO2019196633A1 (zh) 一种图像语义分割模型的训练方法和服务器
CN111670457B (zh) 动态对象实例检测、分割和结构映射的优化
CN109685819B (zh) 一种基于特征增强的三维医学图像分割方法
WO2021043273A1 (zh) 图像增强方法和装置
WO2020107847A1 (zh) 基于骨骼点的跌倒检测方法及其跌倒检测装置
CN110276411A (zh) 图像分类方法、装置、设备、存储介质和医疗电子设备
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
WO2022179581A1 (zh) 一种图像处理方法及相关设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
TWI753588B (zh) 人臉屬性識別方法、電子設備和電腦可讀儲存介質
WO2021018251A1 (zh) 图像分类方法及装置
WO2021073311A1 (zh) 图像识别方法、装置、计算机可读存储介质及芯片
WO2021227787A1 (zh) 训练神经网络预测器的方法、图像处理方法及装置
CN111433812A (zh) 动态对象实例检测、分割和结构映射的优化
WO2022111387A1 (zh) 一种数据处理方法及相关装置
CN112200041A (zh) 视频动作识别方法、装置、存储介质与电子设备
WO2020042126A1 (zh) 一种对焦装置、方法及相关设备
WO2022179603A1 (zh) 一种增强现实方法及其相关设备
WO2022052782A1 (zh) 图像的处理方法及相关设备
WO2023165361A1 (zh) 一种数据处理方法及相关设备
CN106874922B (zh) 一种确定业务参数的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19911852

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019911852

Country of ref document: EP

Effective date: 20210705

NENP Non-entry into the national phase

Ref country code: DE