WO2021103731A1 - 一种语义分割方法、模型训练方法及装置 - Google Patents

一种语义分割方法、模型训练方法及装置 Download PDF

Info

Publication number
WO2021103731A1
WO2021103731A1 PCT/CN2020/113206 CN2020113206W WO2021103731A1 WO 2021103731 A1 WO2021103731 A1 WO 2021103731A1 CN 2020113206 W CN2020113206 W CN 2020113206W WO 2021103731 A1 WO2021103731 A1 WO 2021103731A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
frame
model
network layer
video frame
Prior art date
Application number
PCT/CN2020/113206
Other languages
English (en)
French (fr)
Inventor
裴仁静
邵滨
郝磊
许松岑
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021103731A1 publication Critical patent/WO2021103731A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • This application relates to the field of computer technology, in particular to a semantic segmentation method, model training method and device.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • Semantic segmentation is a basic task in computer vision.
  • semantic segmentation we need to divide the visual input into different semantic interpretable categories.
  • the interpretability of semantics means that the classification categories are meaningful in the real world.
  • image semantic segmentation enables us to have a more detailed understanding of images. This kind of understanding is very important in many fields such as autonomous driving, robotics, and image search engines.
  • the embodiments of the present application provide a semantic segmentation method, a model training method, and a device, which are used to improve the stability of the segmentation result of a video frame.
  • a first aspect of the present application provides a method for semantic segmentation of video frames, including: obtaining a first video frame and a second video frame in a first video frame sequence, where the first video frame is different from the second video frame Respectively input the first video frame and the second video frame into an image segmentation model, the image segmentation model is used for semantic segmentation of the input image, the image segmentation model is a convolutional neural network model,
  • the convolutional neural network model includes an input layer, an output layer, and a multi-layer network layer located between the input layer and the output layer.
  • Each layer in the multi-layer network layer is used to perform input data
  • the intermediate network layer is the one with the smallest resolution of the feature map output from the multi-layer network layer
  • the first feature map of the first video frame output by the first image segmentation network layer is obtained, so
  • the first image segmentation network layer is the intermediate network layer of the image segmentation model or any network layer located between the input layer and the intermediate network layer of the image segmentation model;
  • acquiring a second image The second feature map of the second video frame output by the segmentation network layer, where the second image segmentation network layer is any network layer located between the intermediate network layer and the output layer of the image segmentation model Layer; input the first feature map and the second feature map into a first inter-frame fusion model to generate a semantic segmentation image of the second video frame, the first inter-frame fusion model is a neural network model .
  • the first feature map of the first video frame can provide the timing information for the second video frame.
  • the second feature map of the second video frame output by the second image segmentation network layer The map can provide the spatial information of the second video frame to a greater extent.
  • the semantic segmentation image of the second video frame is generated according to the first feature map and the second feature map, which is beneficial in On the premise of maintaining the segmentation accuracy of a single video frame, the timing information is used to improve the stability of semantic segmentation of the second video frame.
  • the inputting the first feature map and the second feature map into a fusion network model to generate a semantic segmentation image of the second video frame includes: adding the first feature Picture input the first neighboring frame prediction model, the first neighboring frame prediction model is used to predict the information of the neighboring video frame, the neighboring video frame and the video frame to which the feature map of the first neighboring frame prediction model belongs Belong to the same video frame sequence, the first neighboring frame prediction model is a kind of the convolutional neural network model; acquiring the first compressed feature map of the first feature map output by the first neighboring frame prediction network layer, the The first adjacent frame prediction network layer is the intermediate network layer of the first adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model Layer; input the first compressed feature map and the second feature map into a second inter-frame fusion model to generate a semantic segmentation image of the second video frame.
  • the first neighboring frame prediction model is obtained by training based on a first sample set with annotation information, and the first sample is any sample in the first sample set,
  • the first sample is the feature map of the third video frame output by the first image segmentation network layer, and the label information of the first sample is the feature map of the fourth video frame output by the first image segmentation network layer.
  • the third video frame and the fourth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the third video frame is in the first time sequence direction of the fourth video frame.
  • the method further includes: acquiring the third video frame of the second video frame output by the first image segmentation network layer.
  • Feature map the inputting the first compressed feature map and the second feature map into a second inter-frame fusion model to generate a semantic segmentation image of the second video frame includes: inputting the third feature map A second neighboring frame prediction model, where the second neighboring frame prediction model is used to predict information about neighboring video frames, and the neighboring video frame belongs to the same video frame as the feature map input to the second neighboring frame prediction model
  • the second neighboring frame prediction model is a kind of the convolutional neural network model; acquiring a second compressed feature map of the third feature map output by the second neighboring frame prediction network layer, and the second The adjacent frame prediction network layer is the intermediate network layer of the second adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model;
  • the second neighboring frame prediction model is obtained by training based on a second sample set with annotation information, the second sample is any sample in the second sample set, and the first The second sample is the feature map of the fifth video frame output by the first image segmentation network layer, and the label information of the second sample is the feature map of the sixth video frame output by the first image segmentation network layer.
  • the fifth video frame and the sixth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the sixth video frame is in the first time sequence direction of the fifth video frame.
  • the second inter-frame fusion model is obtained by training based on a third sample set with annotation information
  • the third sample is any sample in the third sample set
  • the first The three samples include the compressed feature map of the fourth feature map output by the first adjacent frame prediction network layer, the compressed feature map of the fifth feature map output by the second adjacent frame prediction network layer, and the second image segmentation network
  • the sixth feature map of the eighth video frame output by the layer, the fourth feature map is the feature map of the seventh video frame output by the first image segmentation network layer
  • the fifth feature map is the first image
  • the feature map of the eighth video frame output by the segmentation network layer, the seventh video frame and the eighth video frame are different video frames in the same video frame sequence
  • the label information of the third sample is the Annotated semantic segmentation image of the eighth video frame.
  • the method further includes: obtaining a fourth video frame output by the first image segmentation network layer.
  • Feature map said inputting the first feature map and the second feature map into a first inter-frame fusion model to generate a semantic segmentation image of the second video frame includes: combining the first feature map, the The second feature map and the fourth feature map are input to the first inter-frame fusion model to generate a semantic segmentation image of the second video frame.
  • a second aspect of the present application provides a model training method, including: obtaining a first frame and a second frame in the same video frame sequence, and a semantic segmentation image of the second frame;
  • the second frame is an input image segmentation model.
  • the image segmentation model is used to perform semantic segmentation on the input image.
  • the image segmentation model is a convolutional neural network model.
  • the convolutional neural network model includes an input layer and an output layer.
  • each of the multi-layer network layers is used for feature extraction of input data
  • the intermediate network layer is the multi-layer network layer
  • the first feature map of the first frame output by the first image segmentation network layer is obtained
  • the first image segmentation network layer is the image segmentation model
  • acquiring the second feature of the second frame output by the second image segmentation network layer is any network layer located between the intermediate network layer and the output layer of the image segmentation model
  • the semantic segmentation image of the second frame is used as annotation information
  • the first inter-frame fusion model is a neural network model.
  • the method further includes: acquiring a third feature map of the second frame output by the first image segmentation network layer;
  • the inputting the first feature map and the second feature map into a first inter-frame fusion model, and updating the parameters of the first inter-frame fusion model includes: adding the first feature map and the second feature map to the first inter-frame fusion model.
  • the feature map and the third feature map are input to the first inter-frame fusion model, and the parameters of the first inter-frame fusion model are updated.
  • the first feature map, the second feature map, and the third feature map are input into a first inter-frame fusion model, and the first inter-frame fusion model is updated.
  • the parameters include: using the third feature map as annotation information, inputting the first feature map to a first neighboring frame prediction model, and updating the parameters of the first neighboring frame prediction model.
  • the first neighboring frame prediction model is a kind of the convolutional neural network model; based on the first neighboring frame prediction model that satisfies a first constraint condition, the first neighboring frame prediction model The feature map, the second feature map, and the third feature map are input into the first inter-frame fusion model, and the parameters of the first inter-frame fusion model are updated, and the method further includes: inputting the first feature map into the first inter-frame fusion model.
  • a neighboring frame prediction model acquiring a first compressed feature map of the first feature map output by the first neighboring frame prediction network layer, where the first neighboring frame prediction network layer is the first neighboring frame prediction model
  • the intermediate network layer or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model using the semantic segmentation image of the second frame as the annotation information, the The first compressed feature map and the second feature map are input to a second inter-frame fusion model, and parameters of the second inter-frame fusion model are updated.
  • the first feature map, the second feature map, and the third feature map are input into a first inter-frame fusion model, and the first inter-frame fusion model is updated.
  • the parameters further include: using the first feature map as annotation information, inputting the third feature map to a second neighboring frame prediction model, and updating the parameters of the second neighboring frame prediction model.
  • the second adjacent frame prediction model is a kind of the convolutional neural network model; based on the second adjacent frame prediction model meeting a second constraint condition, the first adjacent frame prediction model The feature map, the second feature map, and the third feature map are input into the first inter-frame fusion model, and the parameters of the first inter-frame fusion model are updated, and the method further includes: inputting the third feature map into the first inter-frame fusion model.
  • a second neighboring frame prediction model acquiring a second compressed feature map of the third feature map output by a second neighboring frame prediction network layer, where the second neighboring frame prediction network layer is the second neighboring frame prediction model
  • the intermediate network layer or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model using the semantic segmentation image of the second frame as the annotation information, the The first compressed feature map, the second compressed feature map, and the second feature map are input to the second inter-frame fusion model, and parameters of the second inter-frame fusion model are updated.
  • a third aspect of the present application provides a semantic segmentation device, including: a video frame acquisition module, configured to acquire a first video frame and a second video frame in a first video frame sequence, the first video frame and the second video frame The video frames are different; the feature map acquisition module is used to input the first video frame and the second video frame into the image segmentation model, and the image segmentation model is used to perform semantic segmentation on the input image, and the image segmentation
  • the model is a convolutional neural network model.
  • the convolutional neural network model includes an input layer, an output layer, and a multi-layer network layer located between the input layer and the output layer.
  • Each layer is used for feature extraction of the input data, the intermediate network layer is the one with the smallest resolution of the output feature map in the multi-layer network layer; the feature map acquisition module is also used to acquire the first A first feature map of the first video frame output by an image segmentation network layer, where the first image segmentation network layer is the intermediate network layer of the image segmentation model or the input of the image segmentation model Layer and the intermediate network layer; the feature map acquisition module is also used to acquire the second feature map of the second video frame output by the second image segmentation network layer, the first The second image segmentation network layer is any network layer located between the intermediate network layer and the output layer of the image segmentation model; the fusion module is used to combine the first feature map and the second feature map The image inputs the first inter-frame fusion model to generate a semantic segmentation image of the second video frame, and the first inter-frame fusion model is a neural network model.
  • the fusion module is configured to: input the first feature map into a first neighboring frame prediction model, and the first neighboring frame prediction model is used to predict information about neighboring video frames, so The adjacent video frame and the video frame to which the feature map input to the first adjacent frame prediction model belongs belong to the same video frame sequence, and the first adjacent frame prediction model is a kind of the convolutional neural network model; The first compressed feature map of the first feature map output by the adjacent frame prediction network layer, where the first adjacent frame prediction network layer is the intermediate network layer of the first adjacent frame prediction model or is located in the first Any network layer between the input layer and the intermediate network layer of the adjacent frame prediction model; input the first compressed feature map and the second feature map into a second inter-frame fusion model to generate the Semantic segmentation image of the second video frame.
  • the first neighboring frame prediction model is obtained by training based on a first sample set with annotation information, and the first sample is any sample in the first sample set,
  • the first sample is the feature map of the third video frame output by the first image segmentation network layer, and the label information of the first sample is the feature map of the fourth video frame output by the first image segmentation network layer.
  • the third video frame and the fourth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the third video frame is in the first time sequence direction of the fourth video frame.
  • the feature map obtaining module is further configured to: after inputting the second video frame into the image segmentation model, obtain the second video frame output by the first image segmentation network layer The third feature map; the fusion module is also used to: input the third feature map into a second adjacent frame prediction model, the second adjacent frame prediction model is used to predict the information of adjacent video frames, the phase The adjacent video frame and the video frame to which the feature map of the second adjacent frame prediction model belongs belong to the same video frame sequence, and the second adjacent frame prediction model is a type of the convolutional neural network model; obtaining a second adjacent frame The second compressed feature map of the third feature map output by the prediction network layer, the second adjacent frame prediction network layer is the intermediate network layer of the second adjacent frame prediction model or is located in the second adjacent frame Any network layer between the input layer and the intermediate network layer of the prediction model; input the first compressed feature map, the second compressed feature map, and the second feature map to the second The inter-frame fusion model generates the semantic segmentation image of the second video frame.
  • the second neighboring frame prediction model is obtained by training based on a second sample set with annotation information, the second sample is any sample in the second sample set, and the first The second sample is the feature map of the fifth video frame output by the first image segmentation network layer, and the label information of the second sample is the feature map of the sixth video frame output by the first image segmentation network layer.
  • the fifth video frame and the sixth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the sixth video frame is in the first time sequence direction of the fifth video frame.
  • the second inter-frame fusion model is obtained by training based on a third sample set with annotation information
  • the third sample is any sample in the third sample set
  • the first The three samples include the compressed feature map of the fourth feature map output by the first adjacent frame prediction network layer, the compressed feature map of the fifth feature map output by the second adjacent frame prediction network layer, and the second image segmentation network
  • the sixth feature map of the eighth video frame output by the layer, the fourth feature map is the feature map of the seventh video frame output by the first image segmentation network layer
  • the fifth feature map is the first image
  • the feature map of the eighth video frame output by the segmentation network layer, the seventh video frame and the eighth video frame are different video frames in the same video frame sequence
  • the label information of the third sample is the Annotated semantic segmentation image of the eighth video frame.
  • the feature map acquisition module inputs the first video frame into the image segmentation model, it is further configured to: acquire the first image output from the first image segmentation network layer.
  • the fourth feature map of the video frame; the inputting the first feature map and the second feature map into a first inter-frame fusion model to generate a semantic segmentation image of the second video frame includes: A feature map, the second feature map, and the fourth feature map are input to the first inter-frame fusion model to generate a semantic segmentation image of the second video frame.
  • a fourth aspect of the present application provides a model training device, which includes: a sample acquisition module for acquiring a first frame and a second frame in the same video frame sequence, and a semantic segmentation image of the second frame; a feature map acquisition module , For separately inputting the first frame and the second frame into an image segmentation model, the image segmentation model is used for semantic segmentation of the input image, and the image segmentation model is a convolutional neural network model,
  • the convolutional neural network model includes an input layer, an output layer, and a multi-layer network layer located between the input layer and the output layer.
  • the intermediate network layer is the network layer with the smallest resolution of the feature map output from the multilayer network layer;
  • the feature map acquisition module is also used to acquire the first image segmentation network layer output A first feature map of one frame, where the first image segmentation network layer is the intermediate network layer of the image segmentation model or any between the input layer and the intermediate network layer of the image segmentation model A network layer;
  • the feature map acquisition module is also used to acquire a second feature map of the second frame output by the second image segmentation network layer, where the second image segmentation network layer is located in the image segmentation model Any network layer between the intermediate network layer and the output layer;
  • a training module configured to use the semantic segmentation image of the second frame as annotation information, and combine the first feature map and the first feature map
  • the two feature maps are input to the first inter-frame fusion model, and parameters of the first inter-frame fusion model are updated, and the first inter-frame fusion model is a neural network model.
  • the feature map acquisition module inputs the second frame into the image segmentation model, it is further configured to: acquire the first image of the second frame output by the first image segmentation network layer.
  • the training module is used to: input the first feature map, the second feature map, and the third feature map into a first inter-frame fusion model, and update the first inter-frame fusion model parameter.
  • the training module is configured to: use the third feature map as annotation information, input the first feature map into a first neighboring frame prediction model, and update the first neighboring frame prediction The parameters of the model.
  • the first neighboring frame prediction model is a kind of the convolutional neural network model; the training module is also used for satisfying the first constraint condition based on the first neighboring frame prediction model : Input the first feature map into the first adjacent frame prediction model; obtain the first compressed feature map of the first feature map output by the first adjacent frame prediction network layer, and the first adjacent frame prediction network layer Is the intermediate network layer of the first adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model;
  • the semantic segmentation image of the frame is used as annotation information, the first compressed feature map and the second feature map are input into a second inter-frame fusion model, and the parameters of the second inter-frame fusion model are updated.
  • the training module is further configured to: use the first feature map as annotation information, input the third feature map into a second neighboring frame prediction model, and update the second The parameters of the neighboring frame prediction model.
  • the second neighboring frame prediction model is a kind of the convolutional neural network model; the training module is also used for satisfying a second constraint condition based on the second neighboring frame prediction model : Input the third feature map into the second adjacent frame prediction model; obtain a second compressed feature map of the third feature map output by the second adjacent frame prediction network layer, and the second adjacent frame prediction network layer Is the intermediate network layer of the second adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model;
  • the semantic segmentation image of the frame is used as the annotation information, the first compressed feature map, the second compressed feature map, and the second feature map are input into the second inter-frame fusion model, and the second inter-frame fusion is updated The parameters of the model.
  • a fifth aspect of the present application provides a video call method, including: a first terminal device collects a first local video frame through its image acquisition module; the first terminal device receives a pair that is collected by a second terminal device through its image acquisition module End video frame; the first terminal device generates a second local video frame according to the first local video frame, the first image area of the first local video frame displays the first image, and the second local video frame The second image area of the end video frame displays a second image, the first image and the second image are different, the third image area of the first local video frame and the second image area of the second local video frame The four image areas all display the third image; the first terminal device simultaneously displays the opposite end video frame and the second local video frame through its display screen.
  • the method further includes: the first terminal device converts the second local video frame The local video frame is sent to the second terminal device.
  • the first terminal device generating a second local video frame according to the first local video frame includes: the first terminal device according to a user's switching instruction, according to the first local video frame A local video frame generates a second local video frame, and the switching instruction is used to instruct the first terminal device to switch the first image in the first local video frame to the second image.
  • the semantic types corresponding to the first image and the third image are different.
  • the first terminal device generating the second local video frame according to the first local video frame includes: the first terminal device according to the first aspect or any of the first aspect
  • the semantic segmentation image of the second local video frame is generated according to the first local video frame and the third local video frame, and the third local video frame and the third local video frame are The first local video frame is a different video frame in the same video frame sequence collected by the first terminal device; the first terminal device generates a second video frame according to the semantic segmentation image and the first local video frame The local video frame.
  • the sixth aspect of the embodiments of the present application provides a video call device, which includes: an image acquisition module for collecting a first local video frame; a communication module for receiving a peer video collected by a second terminal device through its image acquisition module Frame; background switching module for generating a second local video frame according to the first local video frame, the first image area of the first local video frame displays the first image, the second local video The second image area of the frame displays a second image, the first image and the second image are different, the third image area of the first local video frame and the fourth image of the second local video frame All areas display a third image; the display module is used for the first terminal device to simultaneously display the opposite end video frame and the second local video frame through its display screen.
  • the communication module is further configured to send the second local video frame to The second terminal device.
  • the background switching module is configured to: the first terminal device generates a second local video frame according to the first local video frame according to a user's switching instruction, and the switching instruction Used to instruct the first terminal device to switch the first image in the first local video frame to the second image.
  • the semantic types corresponding to the first image and the third image are different.
  • the background switching module is configured to, according to the method described in the first aspect or any possible implementation manner of the first aspect, according to the first local video frame and the third local video frame
  • the end video frame generates the semantic segmentation image of the second local video frame, and the third local video frame and the first local video frame are different in the same video frame sequence collected by the first terminal device Video frame; generating a second local video frame according to the semantically segmented image and the first local video frame.
  • a seventh aspect of the embodiments of the present application provides a computer device, including a processor and a memory, and when the processor runs the computer instructions stored in the memory, it executes any possibility of the first aspect, the second aspect, or the fifth aspect. The method described in the implementation mode.
  • An eighth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute any possible implementation manner of the first aspect, the second aspect, or the fifth aspect The method described.
  • the ninth aspect of the embodiments of the present application provides a computer program product, including instructions, when the instructions are run on a computer, the computer executes as described in any possible implementation manner of the first aspect, the second aspect, or the fifth aspect Methods.
  • a tenth aspect of the embodiments of the present application provides a computer system, including a terminal device and a server.
  • the terminal device is used to send collected videos to the server
  • the server is used to execute the method provided in the first aspect of the embodiments of the present application Semantic segmentation method, and return the generated semantic segmentation result to the terminal device.
  • FIG. 1 is a schematic diagram of an artificial intelligence main body framework provided by an embodiment of this application.
  • FIG. 2A is a schematic diagram of an application environment provided by an embodiment of the application.
  • 2B is a schematic structural diagram of a terminal cloud system provided by an embodiment of this application.
  • 3A is a schematic diagram of an embodiment of a semantic segmentation method provided by an embodiment of this application.
  • FIG. 3B is a schematic diagram of a detailed step of step 305 in the embodiment corresponding to FIG. 3A;
  • 3C is a schematic diagram of another refinement step of step 305 in the embodiment corresponding to FIG. 3A;
  • 4A is a schematic structural diagram of a convolutional neural network provided by an embodiment of this application.
  • 4B is a schematic diagram of another convolutional neural network structure provided by an embodiment of the application.
  • FIG. 5A is a schematic diagram of an embodiment of a model training method provided by an embodiment of the application.
  • FIG. 5B is a schematic diagram of a detailed step of step 505 in the method embodiment provided in FIG. 5A;
  • FIG. 5C is a schematic diagram of another detailed step of step 505 in the method embodiment provided in FIG. 5A; FIG.
  • FIG. 6 is a schematic diagram of an embodiment of a video call method provided by an embodiment of the application.
  • FIGS. 7A-7D are schematic diagrams of an application scenario of a video call method provided by an embodiment of the application.
  • FIGS. 8A-8C are schematic diagrams of an application scenario of the semantic segmentation method provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of an embodiment of a semantic segmentation device provided by an embodiment of this application.
  • FIG. 11 is a schematic diagram of an embodiment of a model training device provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of an embodiment of a video call device provided by an embodiment of this application.
  • Fig. 13 is a schematic diagram of an embodiment of a computer device of this application.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • Intelligent Information Chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensing process of "data-information-knowledge-wisdom".
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • basic platforms include distributed computing frameworks and network related platform guarantees and support, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chip in the distributed computing system provided by the basic platform for calculation.
  • the data in the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as the Internet of Things data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, training, etc.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, using formal information to conduct machine thinking and solving problems based on reasoning control strategies.
  • the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, ranking, and prediction.
  • some general capabilities can be formed based on the results of the data processing, such as an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image Recognition and so on.
  • Intelligent products and industrial applications refer to the products and applications of artificial intelligence systems in various fields. It is an encapsulation of the overall solution of artificial intelligence, productizing intelligent information decision-making and realizing landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical, smart security, autonomous driving, safe city, smart terminal, etc.
  • Semantic segmentation refers to calling the semantic segmentation model to perform semantic recognition on images or video frames, and segmenting and predicting the categories of objects in the image according to the recognition results.
  • Video is composed of still pictures, these still pictures are called frames or video frames.
  • the semantic segmentation model can perform semantic recognition on each pixel in the image or video frame, and perform category prediction on each pixel in the image or video frame according to the semantic recognition result, and generate Semantic segmentation of images.
  • Semantic segmentation images are used to classify each pixel in the video frame to realize the semantic annotation of the video frame.
  • the semantic segmentation image includes one or more target regions segmented by semantic recognition.
  • the same target region corresponds to the predicted pixels of the same category.
  • the same target region is labeled with the same identifier (for example, color).
  • the regions correspond to the predicted different types of pixels, and different target regions are generally labeled with different identifiers (for example, colors). It should be noted that the embodiment of the present application does not limit the classification unit.
  • the semantic classification may be pixel-by-pixel classification or image block classification, and one image block includes multiple pixels.
  • the embodiment of the application provides a method for semantic segmentation of video frames.
  • the semantic segmentation method is based on an artificial intelligence model (referred to as a semantic segmentation model), and is supported by the computing power provided by the infrastructure in FIG. 1 for inputting video frames.
  • Perform data processing to generate semantic segmentation results of video frames for example, obtain semantic segmentation images of video frames, realize background blur or background replacement of video frames, live broadcast production, movie or animation production, partition optimization of video frames, Functions such as recognizing objects in video frames can be applied to fields such as smart terminals, autonomous driving, and smart medical care.
  • FIG. 2A is a schematic diagram of the implementation scenario of the semantic segmentation method provided in the embodiment of the present application.
  • the embodiment of the present application provides a system architecture 200.
  • the data collection device 260 is used to collect a video frame sequence and store it in the database 230, and the training device 220 generates a semantic segmentation model 201 based on the video frame sequence maintained in the database 230.
  • the semantic segmentation model 201 obtained by the training device 220 can be applied to different systems or devices.
  • the execution device 210 is configured with an I/O interface 212 to perform data interaction with external devices, and the "user" can input data to the I/O interface 212 through the client device 240.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the calculation module 211 uses the semantic segmentation model 201 to perform semantic segmentation on the input video frame sequence to obtain a semantic segmentation image sequence.
  • the I/O interface 212 returns the processing result (ie, the obtained semantic segmentation image sequence) to the client device 240 and provides it to the user.
  • the user can manually specify the input data in the execution device 210, for example, to operate in the interface provided by the I/O interface 212.
  • the client device 240 can automatically input data to the I/O interface 212 and obtain the result. If the client device 240 automatically inputs data and needs the user's authorization, the user can set the corresponding authority in the client device 240. The user can view the result output by the execution device 210 on the client device 240, and the specific presentation form may be a specific manner such as display, sound, and action.
  • FIG. 2A is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage The system 250 is an external memory relative to the execution device 210. In other cases, the data storage system 250 may also be placed in the execution device 210.
  • the execution device 210 may be provided in a server.
  • an embodiment of the present invention provides a system architecture 300.
  • the execution device 210 is implemented by one or more servers, and optionally, it cooperates with other computing devices, such as data storage, routers, load balancers and other devices; the execution device 210 can be arranged on one physical site or distributed in multiple On the physical site.
  • the execution device 210 may use the data in the data storage system 250 or call the program code in the data storage system 250 to implement the method of the embodiment of the present application.
  • Each local device can represent any computing device, such as personal computers, computer workstations, smart phones, tablets, smart cameras, smart cars or other types of cellular phones, media consumption devices, wearable devices, set-top boxes, game consoles, etc.
  • Each user's local device can interact with the execution device 210 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • one or more aspects of the execution device 210 may be implemented by each local device.
  • the local device 2401 may provide the execution device 210 with local data or feed back calculation results.
  • both the client device 240 and the execution device 210 may be set in the terminal device.
  • the client device 240 may send a semantic segmentation request to the execution device 210, and the semantic segmentation request may include a segment of video (or video frame sequence) to be semantically segmented.
  • the execution device 210 is configured to sequentially generate a semantic segmentation image sequence of a video frame sequence by executing the semantic segmentation method provided in the embodiment of the present application, and return the obtained semantic segmentation image sequence to the client device 240.
  • All functions of the execution device 210 can also be implemented by a local device.
  • the local device 2401 implements the functions of the execution device 210 and provides services for its own users, or provides services for users of the local device 2402.
  • the client device 240 can obtain the video frame sequence to be semantically segmented, and by executing the semantic segmentation method provided in the embodiment of the present application, the semantic segmentation of the video frame sequence is sequentially generated. Image sequence.
  • the client device 240 may execute corresponding scene or field applications based on the obtained semantic segmentation image sequence, such as smart terminals, unmanned driving, and medical fields. The following are examples of applications in the field of smart terminals:
  • An image acquisition device such as a camera
  • a processing chip may be provided on a smart terminal (such as a mobile phone).
  • the user turns on the video call function on the smart terminal.
  • the image capture device can capture a sequence of video frames including scenes.
  • a scene video frame sequence includes a portrait
  • the processing chip is based on a semantic segmentation model to segment the video frame sequence in real time to obtain a semantic segmentation image sequence.
  • the processing chip is based on the semantic segmentation model to perform real-time semantic segmentation of the scene video frame sequence.
  • the generated semantic segmentation image sequence can provide a real-time semantic mask for the video frame, and then can target different targets.
  • the regions are optimized separately.
  • the user can turn on the multi-object recognition function of the smart terminal, and the processing chip is based on the semantic segmentation model to perform real-time semantic segmentation of the scene video frame sequence, and the generated semantic segmentation image includes multiple target regions.
  • the intelligent terminal can identify the category of the object corresponding to each target area, such as a cup or chair, etc., to bring users a strong visual experience.
  • the semantic segmentation method provided by the embodiments of the present application is not limited to the aforementioned scenes or fields.
  • the semantic segmentation method provided by the embodiments of the present application is introduced below. Taking the method applied to a computer device as an example, the computer device is provided with an execution device 210.
  • the computer device can be a terminal device, or a server, or a system composed of a terminal device and a server.
  • an embodiment of the semantic segmentation method of the present application may include the following steps:
  • the first video frame sequence may be a video acquired in real time by an image acquisition device of a computer device, or may be a video acquired from a network.
  • the first video frame and the second video frame are different video frames. Specifically, they can be two adjacent video frames in the first video frame sequence, or they can be one or more videos in the first video frame sequence. Frame of two video frames.
  • the first video frame and the second video frame in the first video frame sequence may be input into a pre-trained image segmentation model, respectively.
  • the image segmentation model is a semantic segmentation model for an image, which is used to perform semantic segmentation on an input image and output a semantic segmentation image of the image.
  • an image segmentation model may be trained based on multiple images with annotation information, and the annotation information of the image may be an annotated semantic segmentation image of the image (referred to as annotated semantic segmentation image).
  • the semantic segmentation model may be a convolutional neural network model, and the semantic segmentation model may include an input layer, a multi-layer network layer, and an output layer. Each network layer in the multi-layer network layer of the semantic segmentation model is used to extract the features of the input image or feature map and output the feature map.
  • the multi-layer network layer of the general semantic segmentation model can be considered as an encoder-decoder structure, that is to say, the network layer close to the input layer in the multi-layer network layer is equivalent to the encoder, and can be used for input video frames or features.
  • the graph is down-sampled, so that the resolution of the output feature map is smaller than the resolution of the input video frame or feature map; the network layer close to the output layer in the multi-layer network layer is equivalent to the decoder, and the input feature map can be up-sampled , So that the resolution of the output feature map is greater than the resolution of the input video frame.
  • the network layer with the smallest resolution of the output feature map in the multilayer network layer is referred to as the intermediate network layer.
  • the middle network layer and the network layer located between the input layer and the middle network layer are used to downsample the input image or feature map, and the middle network layer and the output layer
  • the network layer is used to upsample the input feature map; for the semantic segmentation model of the image, the intermediate network layer and the network layer located between its input layer and the intermediate network layer can also extract the semantic features in the image, but The spatial information of the image will be lost.
  • the feature map output by the intermediate network layer can provide the most semantic features of the input image; the network layer between the intermediate network layer and the output layer can also lose the spatial information in the output feature map Therefore, it can be considered that the closer the network layer is to the output layer, the output feature map can provide the most spatial information.
  • the image segmentation model can process the first video frame and output the semantic segmentation image of the first video frame.
  • the input layer of the image segmentation model may preprocess the first video frame, for example, standardize the video frame, and the red, green, and blue RGB components of the video frame are mapped to conform to a normal distribution.
  • the first network layer of the image segmentation model can extract the features of the first video frame and output the feature map;
  • the second network layer of the image segmentation model can extract the feature maps output by the first network layer and output the feature map;
  • the last (or bottom) network layer of the image segmentation model can extract features from the feature maps output by the previous network layer, and output the feature maps;
  • the output layer of the image segmentation model can output features from the underlying network layer
  • the image is processed, and the semantic segmentation image of the first video frame is output.
  • the computer device can obtain the feature map (called the first feature map) output by a certain network layer (called the first image segmentation network layer) after inputting the first video frame into the image segmentation model, in order to obtain as many semantic features as possible
  • the first image segmentation network layer may be an intermediate network layer of the image segmentation model or any network layer located between the input layer and the intermediate network layer of the image segmentation model.
  • the image segmentation model can process the second video frame and output the semantic segmentation image of the second video frame.
  • the specific process can refer to the above-mentioned processing process of the first video frame.
  • the computer device can obtain the feature map (called the second feature map) output by a certain network layer (called the second image segmentation network layer) after inputting the second video frame into the image segmentation model, in order to make the second feature map provide
  • the second video frame has as much spatial information as possible.
  • the second image segmentation network layer may be any network layer between the intermediate network layer and the output layer of the image segmentation model.
  • step 303 and step 304 is not limited.
  • the computer device can input the first feature map and the second feature map into the first inter-frame fusion model to generate a semantic segmentation image of the second video frame.
  • the first inter-frame fusion model can be a trained neural network model.
  • the first inter-frame fusion model can be trained in an end-to-end manner, or part of the network layer in the neural network model can be trained first. After that, other network layers are trained.
  • the first feature map of the first video frame can provide the timing information for the second video frame.
  • the second feature map of the second video frame output by the second image segmentation network layer The map can provide the spatial information of the second video frame to a greater extent.
  • the semantic segmentation image of the second video frame is generated according to the first feature map and the second feature map, which is beneficial in On the premise of maintaining the segmentation accuracy of a single video frame, the timing information is used to improve the stability of semantic segmentation of the second video frame.
  • the image segmentation model may be a convolutional neural network (convolutional neuron network, CNN), which is a deep neural network with a convolutional structure.
  • the network is a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network responds to overlapping regions in the input image.
  • a convolutional neural network (CNN) 400 may include an input layer 410, a multilayer network layer 420, and an output layer 430.
  • the multilayer network layer 420 may include a convolutional layer and a hidden layer, optionally It can also include a pooling layer.
  • the first layer is the input layer 410
  • the last layer is the output layer 430
  • the number of layers in the middle is the network layer 420.
  • the network layer 420 will be introduced below.
  • the network layer 420 may include layers 421-426.
  • layer 421 is a convolutional layer
  • layer 422 is a pooling layer
  • layer 423 is a convolutional layer
  • layer 424 is a pooling layer.
  • 425 is a convolutional layer
  • 426 is a pooling layer; in another implementation, 421 and 422 are convolutional layers, 423 is a pooling layer, 424 and 425 are convolutional layers, and 426 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 421 can include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually predefined. In the process of convolution on the image, the weight matrix is usually one pixel after another pixel in the horizontal direction on the input image ( Or two pixels followed by two pixels...It depends on the value of stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases, a single weight matrix is not used, but multiple weight matrices with the same dimension are applied. The output of each weight matrix is stacked to form the depth dimension of the convolutional image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Fuzzy... the multiple weight matrices have the same dimensions, and the feature maps extracted from the multiple weight matrices with the same dimensions have the same dimensions, and then the extracted feature maps with the same dimensions are merged to form a volume The output of the product operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 400 to make correct predictions.
  • the initial convolutional layer (such as 421) often extracts more general features, which can also be called low-level features; with the convolutional neural network
  • the features extracted by the subsequent convolutional layers (for example 426) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • the first image segmentation network layer may be a convolutional layer, for example, the last convolutional layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the first image segmentation network layer may be a pooling layer, for example, the last pooling layer.
  • the convolutional neural network 400 is not enough to output required output information, such as semantic segmentation images. Because as before, the convolutional layer and the pooling layer only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 400 needs to use hidden layers to generate one or a group of required classes of output. Therefore, the network layer can include multiple hidden layers (427, 428 to 429 as shown in Figure 4A), and the parameters contained in the multiple hidden layers can be pre-trained according to the relevant training data of the specific task type. For example, the task type can include image recognition, image classification, and so on.
  • the output layer 430 After the hidden layer, that is, the final layer of the entire convolutional neural network 400 is the output layer 430.
  • the output layer 430 has a loss function similar to the classification cross-entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 400 shown in FIG. 4A is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example,
  • the multiple convolutional layers or pooling layers shown in FIG. 4B are in parallel, and the respectively extracted feature maps are input to the hidden layer for processing.
  • the first image segmentation network layer is used to downsample the input video frame or feature map
  • the second image segmentation network layer is used to upsample the input feature map
  • step 305 may specifically include the following steps:
  • the first neighboring frame prediction model is used to predict the information of neighboring video frames, and the neighboring video frame and the video frame to which the feature map input to the first neighboring frame prediction model belongs belong to the same video frame sequence.
  • the first neighboring frame prediction model may be a convolutional neural network model, and the first neighboring frame prediction model may include an input layer, a multi-layer network layer, and an output layer. Each network layer in the multi-layer network layer of the first neighboring frame prediction model is used to perform feature extraction on the input image or feature map, and output the feature map.
  • the multi-layer network layer of the first adjacent frame prediction model can be considered as an encoder-decoder structure, that is to say, the network layer close to the input layer in the multi-layer network layer is equivalent to the encoder, and can be used for the input video frame or
  • the feature map is down-sampled so that the resolution of the output feature map is smaller than the resolution of the input video frame or feature map;
  • the network layer near the output layer in the multi-layer network layer is equivalent to the decoder, which can upload the input feature map Sampling so that the resolution of the output feature map is greater than the resolution of the input video frame.
  • the network layer with the smallest resolution of the output feature map in the multilayer network layer is referred to as the intermediate network layer.
  • the middle network layer and the network layer located between the input layer and the middle network layer are used to downsample the input image or feature map, and the middle network layer and the output layer
  • the network layer of is used to upsample the input feature map.
  • the first compressed feature map of the first feature map output by the first neighboring frame prediction network layer can be obtained.
  • the first adjacent frame prediction network layer may be the intermediate network layer of the first adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model Floor.
  • the second inter-frame fusion model may be a neural network model.
  • the feature map output by the first neighboring frame prediction network layer is beneficial to reflect the timing change information in the first video frame sequence. Therefore, the second image is not required
  • the feature map of the second video frame output by the segmentation network layer is used for comparison, and the first feature map can independently provide timing change information.
  • the first neighboring frame prediction model can be obtained by training based on the first sample set with label information, the first sample is any sample in the first sample set, and the first sample is the same.
  • This is the feature map of the third video frame output by the first image segmentation network layer
  • the label information of the first sample is the feature map of the fourth video frame output by the first image segmentation network layer
  • the third video frame and the fourth video Frames are different video frames in the same video frame sequence.
  • the video frame corresponding to the sample used to train the first neighboring frame prediction model is in the video frame corresponding to its standard information
  • the first timing direction of the third video frame is the first timing direction of the fourth video frame. That is to say, if the first feature map of the first video frame is input into the first neighboring frame prediction model, the resulting feature map will be close to the first feature map of the second video frame, then the first neighboring frame prediction network layer outputs the first feature map.
  • the first compressed feature map of a feature map can reflect the timing change information from the first video frame to the second video frame.
  • the first time sequence direction may refer to the pre-order, that is, the time when the first video frame is shot before the time when the second video frame is shot; or the first time sequence direction may refer to the post-order, that is, the time when the first video frame is shot at After the moment the second video frame was shot.
  • the first time sequence direction may be the pre-order.
  • the first video frame may be the previous frame of the second video frame.
  • two or more network layers in the image segmentation model can be obtained.
  • more feature maps for example, the first feature map output by the first image segmentation network layer and the fourth feature map output by the third image segmentation network layer can be obtained, and the third image segmentation network layer can be an image segmentation model Any layer of the network layer in the.
  • the second image segmentation network layer may be the first network layer or the penultimate network layer.
  • the acquired feature maps can be input into the first inter-frame fusion model, for example, the first feature map, the second feature map, and the fourth feature map are input into the first inter-frame fusion model to generate the semantics of the second video frame Segmenting the image helps to make the final semantic segmentation result more stable and accurate.
  • the compressed feature map of each feature map of the first video frame can be obtained by referring to the method corresponding to FIG. 3B, and then the second feature map and each compressed feature map of the first video frame are input into the second frame Fusion model.
  • different adjacent frame prediction models can be used. For example, when extracting the timing information of the feature maps output by the first network layer, the adjacent frame prediction model used can be The training is performed based on the feature map of the sample video frame output by the first network layer.
  • the semantic segmentation method provided in this embodiment of the present application may further include: acquiring a third feature map of the second video frame output by the first image segmentation network layer.
  • the third feature map can be used to provide timing information.
  • a possible refinement step of step 305 may include:
  • the second neighboring frame prediction model is used to predict the information of neighboring video frames, and the neighboring video frame and the video frame to which the feature map of the second neighboring frame prediction model belongs belong to the same video frame sequence.
  • the second neighbor frame prediction model is a convolutional neural network model.
  • the second neighboring frame prediction model may be a convolutional neural network model, and the second neighboring frame prediction model may include an input layer, a multi-layer network layer, and an output layer.
  • Each network layer in the multi-layer network layer of the second adjacent frame prediction model is used to perform feature extraction on the input image or feature map, and output the feature map.
  • the multi-layer network layer of the second adjacent frame prediction model can be considered as an encoder-decoder structure, that is to say, the network layer close to the input layer in the multi-layer network layer is equivalent to the encoder, and can be used for the input video frame or
  • the feature map is down-sampled so that the resolution of the output feature map is smaller than the resolution of the input video frame or feature map;
  • the network layer near the output layer in the multi-layer network layer is equivalent to the decoder, which can upload the input feature map Sampling so that the resolution of the output feature map is greater than the resolution of the input video frame.
  • the network layer with the smallest resolution of the output feature map in the multilayer network layer is referred to as the intermediate network layer.
  • the middle network layer and the network layer located between the input layer and the middle network layer are used to downsample the input image or feature map, and the middle network layer and the output layer
  • the network layer of is used to upsample the input feature map.
  • the second adjacent frame prediction network layer is an intermediate network layer of the second adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model.
  • the second neighboring frame prediction model is trained based on a second sample set with annotation information, the second sample is any sample in the second sample set, and the second sample is the first image
  • the feature map of the fifth video frame output by the segmentation network layer, the label information of the second sample is the feature map of the sixth video frame output by the first image segmentation network layer, the fifth video frame and the sixth video frame are the same video frame sequence Different video frames in.
  • the first video frame is in the first time sequence direction of the second video frame
  • the sixth video frame is in the first time sequence direction of the fifth video frame. That is to say, if the third feature map of the first video frame is input into the second neighboring frame prediction model, the resulting feature map will be close to the first feature map of the first video frame, then the second neighboring frame prediction network layer outputs the first feature map.
  • the second compressed feature map of the three feature map can reflect the timing change information from the second video frame to the first video frame.
  • the second inter-frame fusion model may be trained based on a third sample set with annotation information.
  • the third sample is any sample in the third sample set, and the third sample includes the first sample set.
  • the fourth feature diagram is a feature diagram of the seventh video frame output by the first image segmentation network layer
  • the fifth feature diagram is a feature diagram of the eighth video frame output by the first image segmentation network layer
  • the seventh video frame and the first image segmentation network layer The eight video frames are different video frames in the same video frame sequence
  • the label information of the third sample is the label semantic segmentation image of the eighth video frame.
  • the pre-trained neural network model is used to fuse the timing information and spatial information of the second video frame, which is beneficial to improve the accuracy and stability of semantic segmentation of the second video frame.
  • the embodiments of the present application also provide a model training method for training the aforementioned first inter-frame fusion model.
  • the training device 220 trains the first inter-frame fusion model.
  • the training process of the neural network model generally requires the use of a large number of samples for training, and each training process requires at least two video frames (called the first frame and the second frame) in the same video to use the first frame and the second frame.
  • the process of frame training semantic segmentation model is taken as an example to introduce the model training method provided in the embodiment of the present application.
  • the model training method provided by the embodiment of the present application may include the following steps:
  • the training device 220 can generate corresponding semantic segmentation models 201 based on different data for different targets, so as to provide users with better results.
  • semantic segmentation models 201 based on different data for different targets, so as to provide users with better results.
  • both the first frame and the second frame may include portraits.
  • the client device 240 may also serve as a data collection terminal to store the collected video frame sequence (including the first frame and the second frame) in the database 230.
  • the image segmentation model is a trained model for semantic segmentation of input images.
  • the image segmentation model is a convolutional neural network model.
  • the convolutional neural network model includes an input layer and an output The layer and the multi-layer network layer located between the input layer and the output layer, each layer in the multi-layer network layer is used to extract the features of the input data, and the middle network layer is the resolution of the output feature map in the multi-layer network layer The network layer with the lowest rate.
  • the first feature map of the first frame output by the first image segmentation network layer can be obtained.
  • the first image segmentation network layer can be the middle network layer of the image segmentation model or located in the image segmentation model. Any network layer between the input layer and the intermediate network layer.
  • the second feature map of the second frame output by the second image segmentation network layer can be obtained.
  • the second image segmentation network layer is located between the intermediate network layer and the output layer of the image segmentation model. Any network layer.
  • the first inter-frame fusion model may be a neural network model.
  • the embodiment of the present application provides a semantic segmentation model for the video frame, including the above-mentioned trained image segmentation model Fusion model with the first frame.
  • the embodiment of the present application provides a method for training a first inter-frame fusion model.
  • the image segmentation model and the first inter-frame fusion model are used to perform semantic segmentation on video frames, which is beneficial to improve the accuracy and accuracy of the semantic segmentation results of the video frames. stability.
  • the training process for the semantic segmentation model may be a training process for the first inter-frame fusion model, or may include a training process for the image segmentation model and a training process for the first inter-frame fusion model.
  • the work of each layer can be understood as completing the transformation from input space to output space (that is, the row space of the matrix to the column space) through five operations on the input space (the set of input vectors). These five operations include: 1. Upgrading/reducing dimension; 2. Enlargement/reduction; 3. Rotation; 4. Translation; 5. "Bending”. The operations of 1, 2, and 3 are completed by "Wgx", the operation of 4 is completed by "+b", and the operation of 5 is implemented by "a()".
  • W is a weight vector, and each value in the vector represents the weight value of a neuron in the layer of neural network.
  • This vector W determines the space transformation from the input space to the output space above, that is, the weight W of each layer controls how the space is transformed.
  • the purpose of training a deep neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vector W of many layers). Therefore, the training process of the neural network is essentially the way of learning the control space transformation, and more specifically the learning weight matrix.
  • the weight vector of the network (of course, there is usually an initialization process before the first update, which is to pre-configure parameters for each layer in the deep neural network). For example, if the predicted value of the network is high, adjust the weight vector to make it The prediction is lower and keep adjusting until the neural network can predict the target value you really want. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value".
  • This is the loss function or objective function, which is used to measure the difference between the predicted value and the target value. Important equation. Among them, take the loss function as an example. The higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.
  • the neural network when the neural network satisfies the preset constraint conditions, it can be considered that the neural network has completed the training, and the neural network at this time can be called a neural network model.
  • the constraint condition can be to reach the preset number of iterations, or the performance of the neural network after adjusting the parameters to reach the preset index, etc.
  • the method further includes: acquiring a third feature map of the second frame output by the first image segmentation network layer.
  • a refinement step of step 505 may include: inputting the first feature map, the second feature map, and the third feature map into the first inter-frame fusion model, and updating the parameters of the first inter-frame fusion model.
  • a refinement step of step 505 may include: using the third feature map as annotation information, inputting the first feature map to the first neighboring frame prediction model, and updating the first neighboring frame prediction model. parameter.
  • the first neighboring frame prediction model is a convolutional neural network model; the prediction model based on the first neighboring frame satisfies the first constraint, for example, the number of training times for the first neighboring frame prediction model reaches the expected Assuming the number of iterations, at this time, referring to FIG. 5B, a refinement step of step 505 may include:
  • the first adjacent frame prediction network layer is an intermediate network layer of the first adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model.
  • the compressed feature map output by the first neighbor frame prediction network layer in the trained first neighbor frame prediction model It can reflect the information that changes over time between different video frames in the same video, that is, timing information. Therefore, without the second feature map, the timing information between the second frame and the first frame can be provided based on the first feature map.
  • a refinement step of step 505 may further include: using the first feature map as annotation information, inputting the third feature map into the second neighboring frame prediction model, and updating the second neighboring frame prediction model Parameters.
  • the second neighboring frame prediction model may be a convolutional neural network model.
  • a refinement step of step 505 may further include:
  • the second adjacent frame prediction network layer is the intermediate network layer of the second adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model.
  • the compressed feature map output by the second neighbor frame prediction network layer in the trained second neighbor frame prediction model It can reflect the information that changes over time between different video frames in the same video, that is, timing information.
  • timing information By further reducing the features used to provide timing information, it is helpful to remove redundant information such as noise, so that the subsequent second inter-frame fusion model It is easier to learn and helps to further reduce the storage pressure of intermediate variables in computer equipment.
  • first video frame and “the first frame” do not limit the frame as the first video frame in the video
  • second video frame and “the second frame” do not limit the frame as the first video frame in the video
  • the second video frame is not limited to "first video frame” and “first frame” as the same frame
  • second video frame” and “second frame” are not limited to the same frame.
  • the image segmentation model in the model training method provided in the embodiment of the present application can be understood with reference to the image segmentation model in the semantic segmentation method provided in the embodiment of the present application, and will not be repeated here.
  • the first inter-frame fusion model in the semantic segmentation method provided in the embodiment of the present application can be obtained according to the training method of the first inter-frame fusion model in the model training method provided in the embodiment of the present application.
  • the first neighboring frame prediction model in the foregoing semantic segmentation method embodiment may be obtained according to the training method of the first neighboring frame prediction model in the model training method provided in the embodiment of the present application;
  • the second adjacent frame prediction model in the semantic segmentation method provided in the embodiment of the application can be obtained according to the training method of the second adjacent frame prediction model in the model training method provided in the embodiment of the application;
  • the second inter-frame fusion model in the semantic segmentation method provided in the embodiment of the present application can be obtained according to the training method of the second inter-frame fusion model in the model training method provided in the embodiment of the present application.
  • an embodiment of the video call method of the present application may include the following steps:
  • the first terminal device collects a first local video frame through its image acquisition module.
  • the first terminal device can collect images in real time through its image acquisition module (such as a camera) to obtain an image sequence. Because of the timing correlation between the images in the image sequence, each image collected can be called a video frame (referred to as the first video frame). A local video frame).
  • image acquisition module such as a camera
  • the first terminal device receives the opposite end video frame collected by the second terminal device through its image collection module.
  • the second terminal device can collect video frames in real time, and send the collected video frames to the first terminal device.
  • the first terminal device generates a second local video frame according to the first local video frame.
  • the first terminal device may generate a second local video frame according to the first local video frame, the first image area of the first local video frame displays the first image, and the second image of the second local video frame The area displays the second image, and the first image is different from the second image.
  • the third image area of the first local video frame and the fourth image area of the second local video frame both display the third image.
  • the third image can be called the foreground, and the first image and the second image can be called the background of the third image; the third image area and the fourth image area can be called the foreground area, the first image area and the second image area It can be called a background area.
  • the size of the first image area and the second image area or the position relative to the video frame are not limited to the same, and the size of the third image area and the fourth image area or the position relative to the video frame are not limited to the same .
  • the first terminal device simultaneously displays the opposite end video frame and the second local video frame through its display screen.
  • the first terminal device may simultaneously display the opposite end video frame and the second local video frame in different layers.
  • the sizes of the corresponding areas of the two video frames on the display screen are different.
  • the first terminal device may generate a frame of fused image based on the opposite end video frame and the second local video frame, a part of the fused image displays all or part of the image area of the opposite end video frame, and another part of the fused image Display all or part of the image area of the second local video frame, or replace the first image displayed in the first image area in the first local video frame with the second image.
  • the background switching of the video frames collected by the terminal device is beneficial to improve the interest of the video call process and increase user stickiness.
  • the video call method of this application may further include: the first terminal device sends the second local video frame to the second terminal device.
  • the second terminal device can display the second local video frame on the video call interface, so that the callee can see the video image after the background switch.
  • step 603 may include: the first terminal device generates the second local video frame according to the first local video frame according to the user's switching instruction, and the switching instruction is used to instruct the first terminal device to set the first local video frame. The first image in a local video frame is switched to the second image.
  • the switching instruction may be generated in response to a user's selection operation on a certain option, and the option is used to prompt to switch the background of the video frame to the second image.
  • the terminal device may provide the user with multiple options for issuing switching instructions, and different options are used to switch the background of the video frame to different images.
  • the switching instruction may be generated in response to the terminal device detecting a change in its posture (for example, the orientation of the camera), and the posture change of the terminal device will cause the background of the portrait in the captured video frame to change.
  • the user can choose to switch the background of the portrait to a set of images, which can be used to embody environmental images in a three-dimensional space.
  • the set of images are shot in the same scene or environment with a shooting device. Yes, different images correspond to different orientations of the camera of the shooting device, and the orientation information of the camera corresponding to each image can be associated and saved.
  • the terminal device can automatically select the corresponding according to its own orientation
  • the background image of the portrait in the currently collected video frame is switched to the selected image, so that the image of the background area can be changed according to the posture of the terminal device, and the background image after the switch can be dynamically switched to improve the switching
  • the authenticity of the video behind the background is as if the person in the video has truly switched to the environment.
  • the semantic types corresponding to the first image and the third image are different.
  • the second local video frame may be generated according to the semantic segmentation result of the first local video frame.
  • the first terminal device generates the semantics of the second local video frame according to any one of the foregoing semantic segmentation methods provided by the present application, according to the first local video frame and the third local video frame The image is divided, and the third local video frame and the first local video frame are different video frames in the same video frame sequence collected by the first terminal device, for example, the previous frame of the first local video frame. After that, the first terminal device generates a second local video frame according to the semantically segmented image and the first local video frame.
  • the embodiment of the present application also provides a background switching method of video frames.
  • An embodiment of the method may include the following steps: the first terminal device collects the first video frame through its image acquisition module; in response to the first terminal device detecting its own posture The first terminal device generates a second video frame according to the first video frame, the first image area of the first video frame displays the first image, and the second image area of the second video frame displays the second image, the first image and The second image is different.
  • the third image area of the first video frame and the fourth image area of the second video frame both display the third image; the first terminal device displays the second video frame through its display screen.
  • the terminal device may store the image and the posture information (or posture change information) of the terminal device in association, and when the first terminal device detects the change of its posture, it can select the image corresponding to the current posture, or The image corresponding to the change information of the posture (such as the change direction or the change speed) is selected, and the selected image is displayed in the second image area (or the background area of the third image) of the second video frame.
  • the user can choose to switch the background of the portrait to a set of images, which can be used to embody environmental images in a three-dimensional space.
  • the set of images are shot in the same scene or environment with a shooting device. Yes, different images correspond to different orientations of the camera of the shooting device, and the orientation information of the camera corresponding to each image can be associated and saved.
  • the terminal device can automatically select the corresponding according to its own orientation
  • the background image of the portrait in the currently collected video frame is switched to the selected image, so that the image of the background area can be changed according to the posture of the terminal device, and the background image after the switch can be dynamically switched to improve the switching
  • the authenticity of the video behind the background is as if the person in the video has truly switched to the environment.
  • User 1 and user 2 can conduct video calls through their respective terminal devices (such as mobile phones).
  • the mobile phone of user 1 is recorded as mobile phone 1
  • the mobile phone of user 2 is recorded as mobile phone 2.
  • the video passing process can be supported by a system application (such as a phone application) in a mobile phone, or it can be supported by a third-party application (such as a social application).
  • a system application such as a phone application
  • a third-party application such as a social application
  • mobile phone 1 and mobile phone 2 can separately collect their respective video frame sequences, and send the respective collected video frame sequences to each other through the Internet to simultaneously display mobile phone 1 on their respective screens.
  • the video frame sequence collected with the mobile phone 2 realizes the effect of face-to-face communication.
  • the following takes a video frame in the video frame sequence collected by mobile phone 1 and mobile phone 2 as an example to introduce the display content of mobile phone 1 on its display screen when it executes the video call method provided in this application.
  • Mobile phone 1 can send video frame 1 to mobile phone 2 via the Internet, and mobile phone 2 can send video frame 1 to mobile phone 1 via the Internet.
  • Mobile phone 1 can provide user 1 with multiple options for portrait background switching (option 1, option 2, option 3, and option 4).
  • mobile phone 1 When user 1 selects option 1 (the arrow in Figure 7C), mobile phone 1 does not implement the implementation of this application
  • the video call method provided that is, without switching the portrait background of video frame 1
  • mobile phone 1 displays video frame 1 and video frame 2 on its display at the same time, as shown in Figure 7C
  • option 2 Figure 7D In the arrow
  • the mobile phone 1 executes the video call method provided in the embodiment of the present application, and according to replacing the background area of the portrait in the video frame 1 (ie, the first image area) with the image corresponding to option 2, the video after the portrait background switching is obtained Frame 1', then, video frame 1'and video frame 2 can be displayed on the display screen at the same time, as shown in FIG. 7D.
  • the application scenario is: when a user uses a smart terminal to record a video or uses a smart terminal to make a video call with others, the smart terminal uses the area corresponding to the portrait as the target area, and uses other areas other than the target area in the video frame as the background. Switch the image corresponding to the background to realize the background switch of the portrait.
  • an embodiment of the semantic segmentation method implemented by the semantic segmentation model 800 in this application may include the following steps:
  • Step 1 Obtain frame 1, frame 2, and frame 3 sequentially through the camera;
  • Step 2 Input frame 1 into the image segmentation model 801, and obtain the feature map 1_1 output by the network layer 1, the feature map 1_4 output by the network layer 4, the feature map 1_6 output by the network layer 6 and the feature map 1_7 output by the network layer 7;
  • Step 3 Input the feature map 1_7 into the second inter-frame fusion model 802 to obtain the semantic segmentation image 1 of frame 1 (mask 1 in the embodiment of this application), for example, generate the mask used in Fig. 7B, refer to Fig. 8B ;
  • Step 4 According to the segmented image of the portrait of frame 1, replace the area outside the target area (ie the background) corresponding to the portrait in frame 1 with the specified image, and obtain frame 1 after switching the background, which is called frame 1';
  • Step 5 Input the feature map 1_1, feature map 1_4, and feature map 1_6 into the first neighboring frame prediction model (represented by a white filled circle in Figure 8A) to obtain compressed feature 1_1a, compressed feature map 1_4a, and compressed feature map 1_6a , And cache;
  • step 3 The timing relationship between step 3 and step 5 is not limited.
  • Step 6 Input frame 2 into the image segmentation model 801, and obtain the feature map 2_1 output by the network layer 1, the feature map 2_4 output by the network layer 4, the feature map 2_6 output by the network layer 6, and the feature map 2_7 output by the network layer 7;
  • Step 7 Input the feature map 2_1, feature map 2_4, and feature map 2_6 into the first neighboring frame prediction model (indicated by white filled circles in Figure 8A) to obtain compressed feature 2_1a, compressed feature map 2_4a, and compressed feature map 2_6a , And cache;
  • Step 8 Input the feature map 2_1, feature map 2_4, and feature map 2_6 into the second neighboring frame prediction model (represented by a black filled circle in Figure 8A) to obtain compressed feature 2_1b, compressed feature map 2_4b, and compressed feature map 2_6b ;
  • Step 9 Input the compressed feature map 1_1a, compressed feature map 1_4a, compressed feature map 1_6a, compressed feature map 2_1b, compressed feature map 2_4b, compressed feature map 2_6b, and feature map 2_7 into the second inter-frame fusion model 802 to obtain frame 2 Portrait segmented image 2 (mask 2 in the embodiment of this application);
  • Step 10 The segmented portrait image of frame 2 replaces the area other than the target area corresponding to the portrait in frame 2 with the designated background to obtain frame 2 after switching the background, which is called frame 2';
  • step 7 The timing relationship between step 7 and step 8 is not limited.
  • Step 11 Input frame 3 into the image segmentation model 801, and obtain the feature diagram 3_1 output by the network layer 1, the feature diagram 3_4 output by the network layer 4, the feature diagram 3_6 output by the network layer 6, and the feature diagram 3_7 output by the network layer 7;
  • Step 12 Input the feature map 3_1, feature map 3_4, and feature map 3_6 into the first neighboring frame prediction model (indicated by white filled circles in Figure 8A) to obtain compressed feature 3_1a, compressed feature map 3_4a, and compressed feature map 3_6a , And cache;
  • Step 13 Input the feature map 3_1, feature map 3_4, and feature map 3_6 into the second neighboring frame prediction model (represented by a black filled circle in Figure 8A) to obtain compressed feature 3_1b, compressed feature map 3_4b, and compressed feature map 3_6b ;
  • Step 14 Input the compressed feature 2_1a, compressed feature graph 2_4a, compressed feature graph 2_6a, compressed feature 3_1b, compressed feature graph 3_4b, compressed feature graph 3_6b, and feature graph 3_7 into the second inter-frame fusion model 802 to obtain the portrait segmentation of frame 3.
  • Image 3 mask 3 in the embodiment of this application);
  • Step 15 In the segmented portrait image of frame 3, the area outside the target area corresponding to the portrait in frame 3 is replaced with the designated background to obtain frame 3 after switching the background, which is called frame 3'.
  • step 12 and step 13 The timing relationship between step 12 and step 13 is not limited.
  • the first neighboring frame prediction models corresponding to the feature maps output by different network layers may be different.
  • the second inter-frame fusion model 802 can fuse multiple input data.
  • the following continues based on the embodiment corresponding to FIG. 8A, taking step 9 as an example to introduce the data processing process of the second inter-frame fusion model 802.
  • the four-pointed star represents the connection operation, such as the pixel-by-pixel addition operation or the merge (concat) operation.
  • the concat operation is used to connect two or more arrays. This method does not change the existing array, but only returns the A copy of the concatenated array.
  • diamonds represent convolution operations. Diamonds with different identified numbers can represent different types of convolution operations.
  • a diamond with a "1" can represent one or more convolutions; a diamond with a "2"
  • FIG. 8C is used to exemplarily introduce the internal structure of the second inter-frame fusion model 802.
  • the second inter-frame fusion model 802 may also include other operations.
  • the data output by the last convolution operation may be post-processed, for example, Normalization processing, such as a normalized exponential function (or softmax function).
  • the semantic segmentation method of subsequent video frames can refer to the foregoing steps, which will not be repeated here.
  • the evaluation index of the semantic segmentation method provided in this application is better than that of a low-latency network that uses a single video frame for semantic segmentation (called a single-frame low-latency network).
  • the evaluation index of the semantic segmentation method provided in this application is equivalent to the test result of a large-scale network that uses a single video frame for semantic segmentation (referred to as a single-frame large-scale network model).
  • the semantic segmentation method provided in the embodiments of the present application also optimizes the fragmentation phenomenon of single frame segmentation.
  • the converged network model provided by the embodiment of the present application has a smaller added delay, and the number of fixed-point multiply accumulation operations (giga multiply accumulate per second, Macc) performed per second is less than 90M.
  • IOU is an abbreviation for Intersection over Union, and IOU is a standard for measuring the accuracy of detecting corresponding objects in a specific data set.
  • the evaluation index of the semantic segmentation method provided by this application is better than the network model using the optical flow method for semantic segmentation.
  • the FPS in Table 2 is the abbreviation of frame per second.
  • GPU is an abbreviation for graphics processing unit.
  • Table 3 shows the comparison of the results of semantic segmentation on portrait video frames.
  • the evaluation index of the semantic segmentation method provided in this application is better than the network model that uses video object segmentation (VOS) for semantic segmentation.
  • VOS video object segmentation
  • Fig. 9 is a hardware structure diagram of a chip provided by an embodiment of the present invention.
  • the neural network processor 970 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks.
  • the core part of the neural network processor 970 is the arithmetic circuit 903.
  • the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data from the memory and perform multiplication operations.
  • the computing capability required by the method in the embodiment of the present application may be provided by the neural network processor 970 or the neural network processor 970 and the main CPU shown in FIG. 9.
  • the arithmetic circuit 903 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 903 is a general-purpose matrix processor.
  • the arithmetic circuit 903 fetches the data corresponding to the matrix B from the weight memory 902 and buffers it on each PE in the arithmetic circuit 903.
  • the arithmetic circuit 903 takes the matrix A data and the matrix B from the input memory 901 to perform matrix operations, and the partial results or final results of the obtained matrix are stored in the accumulator 908.
  • the unified memory 906 is used to store input data and output data.
  • the weight data directly passes through the storage unit to access the controller 905 Direct Memory Access Controller, and the DMAC is transferred to the weight memory 902.
  • the input data is also transferred to the unified memory 906 through the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 910, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer 909.
  • the bus interface unit 910 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 909 to obtain instructions from the external memory, and is also used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 906 or to transfer the weight data to the weight memory 902 or to transfer the input data to the input memory 901.
  • the vector calculation unit 907 includes multiple arithmetic processing units, and if necessary, further processes the output of the arithmetic circuit 903, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 907 can store the processed output vector in the unified buffer 906.
  • the vector calculation unit 907 may apply a nonlinear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 907 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 903, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 909 connected to the controller 904 is used to store instructions used by the controller 904;
  • the unified memory 906, the input memory 901, the weight memory 902, and the fetch memory 909 are all On-Chip memories.
  • the external memory is private to the neural network processor hardware architecture.
  • each layer in each neural network model in the embodiment of the present application may be executed by the vector calculation unit 907.
  • each functional module can be divided corresponding to each function, or two Or two or more functions are integrated in one function module.
  • the above-mentioned integrated functional modules can be implemented either in the form of hardware or in the form of software functional units.
  • FIG. 10 shows a schematic structural diagram of a semantic segmentation device.
  • an embodiment of the semantic segmentation apparatus 1000 of the present application may include:
  • a video frame acquisition module 1001 configured to acquire a first video frame and a second video frame in a first video frame sequence, where the first video frame is different from the second video frame;
  • the feature map acquisition module 1002 is configured to input the first video frame and the second video frame into an image segmentation model, the image segmentation model is used to perform semantic segmentation on the input image, and the image segmentation model is a A convolutional neural network model.
  • the convolutional neural network model includes an input layer, an output layer, and a multi-layer network layer located between the input layer and the output layer, each of the multi-layer network layers It is used for feature extraction of input data.
  • the intermediate network layer is the network layer with the smallest resolution of the output feature map in the multi-layer network layer; the feature map acquisition module 1002 is also used to acquire the first image The first feature map of the first video frame output by the segmentation network layer, where the first image segmentation network layer is the intermediate network layer of the image segmentation model or the input layer and the input layer of the image segmentation model. Any network layer between the intermediate network layers; the feature map acquisition module 1002 is further configured to acquire the second feature map of the second video frame output by the second image segmentation network layer, the second The image segmentation network layer is any network layer located between the intermediate network layer and the output layer of the image segmentation model;
  • the fusion module 1003 is configured to input the first feature map and the second feature map into a first inter-frame fusion model to generate a semantic segmentation image of the second video frame, and the first inter-frame fusion model is a A neural network model.
  • the fusion module 1003 is configured to: input the first feature map into a first neighboring frame prediction model, and the first neighboring frame prediction model is used to predict information about neighboring video frames,
  • the adjacent video frame and the video frame to which the feature map input to the first adjacent frame prediction model belongs belong to the same video frame sequence, and the first adjacent frame prediction model is a kind of the convolutional neural network model;
  • the first neighboring frame prediction model is obtained by training based on a first sample set with annotation information, and the first sample is any sample in the first sample set,
  • the first sample is the feature map of the third video frame output by the first image segmentation network layer, and the label information of the first sample is the feature map of the fourth video frame output by the first image segmentation network layer.
  • the third video frame and the fourth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the third video frame is in the first time sequence direction of the fourth video frame.
  • the feature map obtaining module 1002 is further configured to: after inputting the second video frame into the image segmentation model, obtain the second video output by the first image segmentation network layer The third feature map of the frame; the fusion module is also used to: input the third feature map into a second neighboring frame prediction model, and the second neighboring frame prediction model is used to predict the information of neighboring video frames, the The adjacent video frame and the video frame to which the feature map of the second adjacent frame prediction model belongs belong to the same video frame sequence, and the second adjacent frame prediction model is a kind of the convolutional neural network model; and the second adjacent frame prediction model is obtained.
  • the second compressed feature map of the third feature map output by the frame prediction network layer, the second adjacent frame prediction network layer is the intermediate network layer of the second adjacent frame prediction model or is located in the second adjacent frame Any network layer between the input layer and the intermediate network layer of the frame prediction model; input the first compressed feature map, the second compressed feature map, and the second feature map to the first
  • the two-frame fusion model generates the semantic segmentation image of the second video frame.
  • the second neighboring frame prediction model is obtained by training based on a second sample set with annotation information, the second sample is any sample in the second sample set, and the first The second sample is the feature map of the fifth video frame output by the first image segmentation network layer, and the label information of the second sample is the feature map of the sixth video frame output by the first image segmentation network layer.
  • the fifth video frame and the sixth video frame are different video frames in the same video frame sequence.
  • the first video frame is in the first time sequence direction of the second video frame
  • the sixth video frame is in the first time sequence direction of the fifth video frame.
  • the second inter-frame fusion model 1003 is trained based on a third sample set with annotation information, and the third sample is any sample in the third sample set.
  • the third sample includes the compressed feature map of the fourth feature map output by the first adjacent frame prediction network layer, the compressed feature map of the fifth feature map output by the second adjacent frame prediction network layer, and the second image segmentation
  • the sixth feature map of the eighth video frame output by the network layer, the fourth feature map is the feature map of the seventh video frame output by the first image segmentation network layer, and the fifth feature map is the first
  • the feature map of the eighth video frame output by the image segmentation network layer, the seventh video frame and the eighth video frame are different video frames in the same video frame sequence, and the label information of the third sample is Describe the annotation semantic segmentation image of the eighth video frame.
  • the feature map acquisition module 1002 inputs the first video frame into the image segmentation model, it is further configured to: acquire the first image output from the first image segmentation network layer.
  • a fourth feature map of a video frame; the inputting the first feature map and the second feature map into a first inter-frame fusion model to generate a semantic segmentation image of the second video frame includes: The first feature map, the second feature map, and the fourth feature map are input to the first inter-frame fusion model to generate a semantic segmentation image of the second video frame.
  • Figure 11 shows a schematic structural diagram of a model training device.
  • an embodiment of the model training apparatus 1100 of the present application may include:
  • the sample acquisition module 1101 is configured to acquire the first frame and the second frame in the same video frame sequence, and the semantic segmentation image of the second frame;
  • the feature map acquisition module 1102 is configured to input the first frame and the second frame into an image segmentation model, the image segmentation model is used to perform semantic segmentation on the input image, and the image segmentation model is a volume A product neural network model.
  • the convolutional neural network model includes an input layer, an output layer, and a multi-layer network layer located between the input layer and the output layer. Each of the multi-layer network layers is used for Perform feature extraction on the input data.
  • the intermediate network layer is the one with the smallest resolution of the output feature map in the multi-layer network layer; the feature map acquisition module 1102 is also used to acquire the first image segmentation network
  • the first feature map of the first frame output by the layer, the first image segmentation network layer is the intermediate network layer of the image segmentation model or is located between the input layer and the intermediate layer of the image segmentation model Any network layer between the network layers;
  • the feature map acquisition module 1102 is also used to acquire the second feature map of the second frame output by the second image segmentation network layer, the second image segmentation network layer Is any network layer located between the intermediate network layer and the output layer of the image segmentation model;
  • the training module 1103 is configured to use the semantic segmentation image of the second frame as annotation information, input the first feature map and the second feature map into a first inter-frame fusion model, and update the first inter-frame fusion
  • the parameters of the model, the first inter-frame fusion model is a neural network model.
  • the feature map acquisition module 1102 inputs the second frame into the image segmentation model, it is further used to: acquire the second frame output by the first image segmentation network layer.
  • the third feature map; the training module 1103 is used to: input the first feature map, the second feature map, and the third feature map into the first inter-frame fusion model, and update the first inter-frame fusion The parameters of the model.
  • the training module 1103 is configured to: use the third feature map as annotation information, input the first feature map into a first neighboring frame prediction model, and update the first neighboring frame The parameters of the predictive model.
  • the first neighboring frame prediction model is a kind of the convolutional neural network model; the training module 1103 satisfies the first constraint condition based on the first neighboring frame prediction model, and also uses Yu: input the first feature map into the first neighboring frame prediction model; obtain the first compressed feature map of the first feature map output by the first neighboring frame prediction network layer, and the first neighboring frame prediction network
  • the layer is the intermediate network layer of the first adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the first adjacent frame prediction model; Two frames of semantically segmented images are used as annotation information, the first compressed feature map and the second feature map are input to a second inter-frame fusion model, and the parameters of the second inter-frame fusion model are updated.
  • the training module 1103 is further configured to: use the first feature map as annotation information, input the third feature map into a second neighboring frame prediction model, and update the first feature map. Parameters of the second-neighbor frame prediction model.
  • the second neighboring frame prediction model is a kind of the convolutional neural network model; the training module 1103 satisfies the second constraint condition based on the second neighboring frame prediction model, and also uses Yu: input the third feature map into the second adjacent frame prediction model; obtain a second compressed feature map of the third feature map output by the second adjacent frame prediction network layer, and the second adjacent frame prediction network
  • the layer is the intermediate network layer of the second adjacent frame prediction model or any network layer located between the input layer and the intermediate network layer of the second adjacent frame prediction model;
  • Two frames of semantically segmented images are used as annotation information, the first compressed feature map, the second compressed feature map, and the second feature map are input into the second inter-frame fusion model, and the second inter-frame is updated.
  • the parameters of the fusion model are used as annotation information, the first compressed feature map, the second compressed feature map, and the second feature map are input into the second inter-frame fusion model, and the second inter-frame is updated.
  • Fig. 12 shows a schematic structural diagram of a video call device.
  • an embodiment of the video call device 1200 of the present application may include:
  • the image collection module 1201 is used to collect the first local video frame
  • the communication module 1202 is configured to receive the opposite end video frame collected by the second terminal device through its image collection module;
  • the background switching module 1203 is configured to generate a second local video frame according to the first local video frame, the first image area of the first local video frame displays a first image, and the second local video frame The second image area of the display second image, the first image and the second image are different, the third image area of the first local video frame and the fourth image area of the second local video frame Both display the third image;
  • the display module 1204 is configured to simultaneously display the opposite end video frame and the second local video frame through the display screen of the first terminal device.
  • the communication module 1202 is further configured to transfer the second local video frame Sent to the second terminal device.
  • the background switching module 1203 is configured to: generate a second local video frame according to the first local video frame according to a user's switching instruction, and the switching instruction is used to instruct the The first terminal device switches the first image in the first local video frame to the second image.
  • the semantic types corresponding to the first image and the third image are different.
  • the background switching module 1203 is configured to, according to any embodiment method of the semantic segmentation method provided in the embodiment of this application, according to the first local video frame and the third local video frame Generating a semantic segmentation image of the second local video frame, where the third local video frame and the first local video frame are different video frames in the same video frame sequence collected by the first terminal device; Generate a second local video frame according to the semantically segmented image and the first local video frame.
  • FIG. 13 is a schematic diagram of the hardware structure of the computer device 1300.
  • ASICs application-specific integrated circuits
  • processors and memories that execute one or more software or firmware programs, integrated logic circuits, and/or other devices that can provide the aforementioned functions.
  • FIG. 13 is a schematic diagram of the hardware structure of the computer device 1300.
  • the semantic segmentation device 1000, the model training device 1100, and the video call device 1200 can adopt the form shown in FIG. 13.
  • the computer device 1300 includes at least one processor 1301 and a memory 1302.
  • the aforementioned processor 1301 may be a central processing unit (CPU), a network processor (NP) or a combination of a CPU and NP, a digital signal processor (DSP), or an application specific integrated circuit ( application specific integrated circuit (ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU central processing unit
  • NP network processor
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • Combining the steps of the method disclosed in this application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • the apparatus may include multiple processors or the processors may include multiple processing units.
  • the processor may be a single-core processor, or a multi-core or many-core processor.
  • the processor may be an ARM architecture processor.
  • the memory 1302 is used to store computer instructions executed by the processor.
  • the memory 1302 may be a storage circuit or a memory.
  • the memory 1302 may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • the memory 1302 may be independent of the processor 1301.
  • the processor 1301 and the memory 1302 may be connected to each other through a bus 1303.
  • the bus 1303 may be a peripheral component interconnect standard (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into an address bus, a data bus, a control bus, and so on.
  • the memory 1302 may also be a storage unit in the processor 1301, and is directly attached to the processor 1301, which is not limited herein. Although only one memory 1302 is shown in the figure, the device may also include multiple memories 1302 or the memory 1302 may include multiple storage units.
  • the above-mentioned memory 1302 is used to store computer-executable instructions for executing the solution of the present application, and the above-mentioned processor 1301 controls the execution.
  • the processor 1301 is configured to execute computer-executable instructions stored in the memory 1302, so as to implement the semantic segmentation method and the model training method provided in the foregoing method embodiments of the present application.
  • the computer-executable instructions in the embodiments of the present application may also be referred to as application program codes, which are not specifically limited in the embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • words such as “exemplary” or “for example” are used as examples, illustrations, or illustrations. Any embodiment or design solution described as “exemplary” or “for example” in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as “exemplary” or “for example” are used to present related concepts in a specific manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种语义分割方法、模型训练方法及装置,应用于人工智能领域,用于提高对视频帧的分割结果的稳定性。语义分割方法包括:获取第一视频帧序列中的第一视频帧和第二视频帧,分别将第一视频帧和第二视频帧输入图像分割模型,图像分割模型用于对输入的图像进行语义分割,图像分割模型的中间网络层为多层网络层中输出的特征图的分辨率最小的一层网络层;获取第一图像分割网络层输出的第一视频帧的第一特征图;获取第二图像分割网络层输出的第二视频帧的第二特征图;将第一特征图和第二特征图输入第一帧间融合模型,生成第二视频帧的语义分割图像。

Description

一种语义分割方法、模型训练方法及装置
本申请要求于2019年11月26日提交中国专利局、申请号为201911177265.8、发明名称为“一种语义分割方法、模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种语义分割方法、模型训练方法及装置。
背景技术
当人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
语义分割是计算机视觉中的基本任务,在语义分割中我们需要将视觉输入分为不同的语义可解释类别,语义的可解释性即分类类别在真实世界中是有意义的。与图像分类或目标检测相比,图像语义分割使我们对图像有更加细致的了解。这种了解在诸如自动驾驶、机器人以及图像搜索引擎等许多领域都是非常重要的。
随着深度学习的发展,对图像的语义分割任务取得了很大的突破,然而对视频的语义分割仍然是一个十分具有挑战性的任务。视频记录的场景中物体的位置和/或姿态变化,会导致对视频的分割结果不稳定性,比如一个物体在前序的视频帧中属于类别A,而在后序的视频帧中属于类别B。但是,针对如何提高对视频的分割结果的稳定性,目前并没有太多解决方案。
发明内容
本申请实施例提供了一种语义分割方法、模型训练方法及装置,用于提高对视频帧的分割结果的稳定性。
本申请第一方面提供一种对视频帧的语义分割方法,包括:获取第一视频帧序列中的第一视频帧和第二视频帧,所述第一视频帧与所述第二视频帧不同;分别将所述第一视频帧和所述第二视频帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;获取第一图像分割网络层输出的所述第一视频帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间 网络层之间的任意一层网络层;获取第二图像分割网络层输出的所述第二视频帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,所述第一帧间融合模型是一种神经网络模型。
第一视频帧的第一特征图能够为第二视频帧提供其时序信息,和第一图像分割网络层输出的特征图相比,第二图像分割网络层输出的第二视频帧的第二特征图能够更大程度的提供第二视频帧的空间信息,获取第一特征图和第二特征图后,根据第一特征图和第二特征图生成第二视频帧的语义分割图像,有利于在保持单个视频帧的分割精度的前提下,利用时序信息提高对第二视频帧进行语义分割的稳定性。
在一种可能的实现方式中,所述将所述第一特征图和所述第二特征图输入融合网络模型,生成所述第二视频帧的语义分割图像,包括:将所述第一特征图输入第一邻帧预测模型,所述第一邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第一邻帧预测模型是一种所述卷积神经网络模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第一邻帧预测模型为基于带有标注信息的第一样本集训练得到的,第一样本为所述第一样本集中的任意一个样本,所述第一样本为所述第一图像分割网络层输出的第三视频帧的特征图,所述第一样本的标注信息为所述第一图像分割网络层输出的第四视频帧的特征图,所述第三视频帧和所述第四视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第三视频帧在所述第四视频帧的所述第一时序方向。
在一种可能的实现方式中,将所述第二视频帧输入所述图像分割模型之后,所述方法还包括:获取所述第一图像分割网络层输出的所述第二视频帧的第三特征图;所述将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像,包括:将所述第三特征图输入第二邻帧预测模型,所述第二邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第二邻帧预测模型是一种所述卷积神经网络模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为所述第二样本集中的任意一个样本,所述第二样本为所述第一图像分割网络层输出的第五视频帧的特征图,所述第二样本的标注信息为所述第一图像分割网络层输出的第六视频帧的特征图,所述第五视频帧和所述第六视频帧为同一视频帧序列中的不同 视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第六视频帧在所述第五视频帧的所述第一时序方向。
在一种可能的实现方式中,所述第二帧间融合模型为基于带有标注信息的第三样本集训练得到的,第三样本为所述第三样本集中的任意一个样本,所述第三样本包括所述第一邻帧预测网络层输出的第四特征图的压缩特征图、所述第二邻帧预测网络层输出的第五特征图的压缩特征图和所述第二图像分割网络层输出的第八视频帧的第六特征图,所述第四特征图为所述第一图像分割网络层输出的第七视频帧的特征图,所述第五特征图为所述第一图像分割网络层输出的所述第八视频帧的特征图,所述第七视频帧和所述第八视频帧为同一视频帧序列中的不同视频帧,所述第三样本的标注信息为所述第八视频帧的标注语义分割图像。
在一种可能的实现方式中,将所述第一视频帧输入所述图像分割模型之后,所述方法还包括:获取所述第一图像分割网络层输出的所述第一视频帧的第四特征图;所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,包括:将所述第一特征图、所述第二特征图和所述第四特征图输入所述第一帧间融合模型,生成所述第二视频帧的语义分割图像。
本申请第二方面提供一种模型训练方法,包括:获取同一视频帧序列中的第一帧和第二帧、以及所述第二帧的语义分割图像;分别将所述第一帧和所述第二帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;获取第一图像分割网络层输出的所述第一帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;获取第二图像分割网络层输出的所述第二帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,所述第一帧间融合模型是一种神经网络模型。
在一种可能的实现方式中,在将所述第二帧输入图像分割模型之后,所述方法还包括:获取所述第一图像分割网络层输出的所述第二帧的第三特征图;所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,包括:将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数。
在一种可能的实现方式中,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,包括:以所述第三特征图为标注信息,将所述第一特征图输入第一邻帧预测模型,更新所述第一邻帧预测模型的参数。
在一种可能的实现方式中,所述第一邻帧预测模型是一种所述卷积神经网络模型;基于所述第一邻帧预测模型满足第一约束条件,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:将所述第 一特征图输入所述第一邻帧预测模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,更新所述第二帧间融合模型的参数。
在一种可能的实现方式中,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:以所述第一特征图为标注信息,将所述第三特征图输入第二邻帧预测模型,更新所述第二邻帧预测模型的参数。
在一种可能的实现方式中,所述第二邻帧预测模型是一种所述卷积神经网络模型;基于所述第二邻帧预测模型满足第二约束条件,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:将所述第三特征图输入所述第二邻帧预测模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,更新所述第二帧间融合模型的参数。
本申请第三方面提供一种语义分割装置,包括:视频帧获取模块,用于获取第一视频帧序列中的第一视频帧和第二视频帧,所述第一视频帧与所述第二视频帧不同;特征图获取模块,用于分别将所述第一视频帧和所述第二视频帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;所述特征图获取模块,还用于获取第一图像分割网络层输出的所述第一视频帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;所述特征图获取模块,还用于获取第二图像分割网络层输出的所述第二视频帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;融合模块,用于将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,所述第一帧间融合模型是一种神经网络模型。
在一种可能的实现方式中,所述融合模块用于:将所述第一特征图输入第一邻帧预测模型,所述第一邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第一邻帧预测模型是一种所述卷积神经网络模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第一邻帧预测模型为基于带有标注信息的第一样本集训练得到的,第一样本为所述第一样本集中的任意一个样本,所述第一样本为所述第一图像分 割网络层输出的第三视频帧的特征图,所述第一样本的标注信息为所述第一图像分割网络层输出的第四视频帧的特征图,所述第三视频帧和所述第四视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第三视频帧在所述第四视频帧的所述第一时序方向。
在一种可能的实现方式中,所述特征图获取模块还用于:在将所述第二视频帧输入图像分割模型之后,获取所述第一图像分割网络层输出的所述第二视频帧的第三特征图;所述融合模块还用于:将所述第三特征图输入第二邻帧预测模型,所述第二邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第二邻帧预测模型是一种所述卷积神经网络模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为所述第二样本集中的任意一个样本,所述第二样本为所述第一图像分割网络层输出的第五视频帧的特征图,所述第二样本的标注信息为所述第一图像分割网络层输出的第六视频帧的特征图,所述第五视频帧和所述第六视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第六视频帧在所述第五视频帧的所述第一时序方向。
在一种可能的实现方式中,所述第二帧间融合模型为基于带有标注信息的第三样本集训练得到的,第三样本为所述第三样本集中的任意一个样本,所述第三样本包括所述第一邻帧预测网络层输出的第四特征图的压缩特征图、所述第二邻帧预测网络层输出的第五特征图的压缩特征图和所述第二图像分割网络层输出的第八视频帧的第六特征图,所述第四特征图为所述第一图像分割网络层输出的第七视频帧的特征图,所述第五特征图为所述第一图像分割网络层输出的所述第八视频帧的特征图,所述第七视频帧和所述第八视频帧为同一视频帧序列中的不同视频帧,所述第三样本的标注信息为所述第八视频帧的标注语义分割图像。
在一种可能的实现方式中,所述特征图获取模块在将所述第一视频帧输入所述图像分割模型之后,还用于:获取所述第一图像分割网络层输出的所述第一视频帧的第四特征图;所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,包括:将所述第一特征图、所述第二特征图和所述第四特征图输入所述第一帧间融合模型,生成所述第二视频帧的语义分割图像。
本申请第四方面提供一种模型训练装置,包括:样本获取模块,用于获取同一视频帧序列中的第一帧和第二帧、以及所述第二帧的语义分割图像;特征图获取模块,用于分别将所述第一帧和所述第二帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对 输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;所述特征图获取模块,还用于获取第一图像分割网络层输出的所述第一帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;所述特征图获取模块,还用于获取第二图像分割网络层输出的所述第二帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;训练模块,用于以所述第二帧的语义分割图像作为标注信息,将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,所述第一帧间融合模型是一种神经网络模型。
在一种可能的实现方式中,所述特征图获取模块在将所述第二帧输入图像分割模型之后,还用于:获取所述第一图像分割网络层输出的所述第二帧的第三特征图;所述训练模块用于:将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数。
在一种可能的实现方式中,所述训练模块用于:以所述第三特征图为标注信息,将所述第一特征图输入第一邻帧预测模型,更新所述第一邻帧预测模型的参数。
在一种可能的实现方式中,所述第一邻帧预测模型是一种所述卷积神经网络模型;所述训练模块基于所述第一邻帧预测模型满足第一约束条件,还用于:将所述第一特征图输入所述第一邻帧预测模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,更新所述第二帧间融合模型的参数。
在一种可能的实现方式中,所述所述训练模块还用于:以所述第一特征图为标注信息,将所述第三特征图输入第二邻帧预测模型,更新所述第二邻帧预测模型的参数。
在一种可能的实现方式中,所述第二邻帧预测模型是一种所述卷积神经网络模型;所述训练模块基于所述第二邻帧预测模型满足第二约束条件,还用于:将所述第三特征图输入所述第二邻帧预测模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,更新所述第二帧间融合模型的参数。
本申请第五方面提供一种视频通话方法,包括:第一终端设备通过其图像采集模块采集第一本端视频帧;所述第一终端设备接收第二终端设备通过其图像采集模块采集的对端视频帧;所述第一终端设备根据所述第一本端视频帧生成第二本端视频帧,所述第一本端视频帧的第一图像区域显示第一图像,所述第二本端视频帧的第二图像区域显示第二图像,所述第一图像和所述第二图像不同,所述第一本端视频帧的第三图像区域和所述第二本端视频帧的第四图像区域均显示第三图像;所述第一终端设备通过其显示屏同时显示所述对端视频帧和所述第二本端视频帧。
在一种可能的实现方式中,所述第一终端设备根据所述第一本端视频帧生成第二本端视频帧之后,所述方法还包括:所述第一终端设备将所述第二本端视频帧发送给所述第二终端设备。
在一种可能的实现方式中,所述第一终端设备根据所述第一本端视频帧生成第二本端视频帧,包括:所述第一终端设备根据用户的切换指令,根据所述第一本端视频帧生成第二本端视频帧,所述切换指令用于指示所述第一终端设备将所述第一本端视频帧中的所述第一图像切换为所述第二图像。
在一种可能的实现方式中,所述第一图像与所述第三图像对应的语义类型不同。
在一种可能的实现方式中,所述第一终端设备根据所述第一本端视频帧生成第二本端视频帧,包括:所述第一终端设备根据第一方面或第一方面的任一可能的实现方式所述的方法,根据所述第一本端视频帧和第三本端视频帧生成所述第二本端视频帧的语义分割图像,所述第三本端视频帧与所述第一本端视频帧为所述第一终端设备采集的同一视频帧序列中的不同视频帧;所述第一终端设备根据所述语义分割图像和所述第一本端视频帧生成第二本端视频帧。
本申请实施例第六方面提供一种视频通话装置,包括:图像采集模块,用于采集第一本端视频帧;通信模块,用于接收第二终端设备通过其图像采集模块采集的对端视频帧;背景切换模块,用于根据所述第一本端视频帧生成第二本端视频帧,所述第一本端视频帧的第一图像区域显示第一图像,所述第二本端视频帧的第二图像区域显示第二图像,所述第一图像和所述第二图像不同,所述第一本端视频帧的第三图像区域和所述第二本端视频帧的第四图像区域均显示第三图像;显示模块用于所述第一终端设备通过其显示屏同时显示所述对端视频帧和所述第二本端视频帧。
在一种可能的实现方式中,所述背景切换模块根据所述第一本端视频帧生成第二本端视频帧之后,所述通信模块还用于将所述第二本端视频帧发送给所述第二终端设备。
在一种可能的实现方式中,所述背景切换模块用于:所述第一终端设备根据用户的切换指令,根据所述第一本端视频帧生成第二本端视频帧,所述切换指令用于指示所述第一终端设备将所述第一本端视频帧中的所述第一图像切换为所述第二图像。
在一种可能的实现方式中,所述第一图像与所述第三图像对应的语义类型不同。
在一种可能的实现方式中,所述背景切换模块用于,根据第一方面或第一方面的任一可能的实现方式所述的方法,根据所述第一本端视频帧和第三本端视频帧生成所述第二本端视频帧的语义分割图像,所述第三本端视频帧与所述第一本端视频帧为所述第一终端设备采集的同一视频帧序列中的不同视频帧;根据所述语义分割图像和所述第一本端视频帧生成第二本端视频帧。
本申请实施例第七方面提供一种计算机设备,包括处理器和存储器,所述处理器在运行所述存储器存储的计算机指令时,执行如第一方面或第二方面或第五方面任一可能的实现方式所述的方法。
本申请实施例第八方面提供一种计算机可读存储介质,包括指令,当所述指令在计算机上运行时,使得计算机执行如第一方面或第二方面或第五方面任一可能的实现方式所述的方法。
本申请实施例第九方面提供一种计算机程序产品,包括指令,当所述指令在计算机上运行时,使得计算机执行如第一方面或第二方面或第五方面任一可能的实现方式所述的方法。
本申请实施例第十方面提供一种计算机系统,包括终端设备和服务器,所述终端设备用于将采集的视频发送给所述服务器,所述服务器用于执行本申请实施例第一方面提供的语义分割方法,并将生成的语义分割结果返回给所述终端设备。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种人工智能主体框架示意图;
图2A为本申请实施例提供的一种应用环境示意图;
图2B为本申请实施例提供的一种端云系统的结构示意图;
图3A为本申请实施例提供的一种语义分割方法一个实施例示意图;
图3B为图3A对应的实施例中步骤305的一种细化步骤示意图;
图3C为图3A对应的实施例中步骤305的另一种细化步骤示意图;
图4A为本申请实施例提供的一种卷积神经网络结构示意图;
图4B为本申请实施例提供的另一种卷积神经网络结构示意图;
图5A为本申请实施例提供的模型训练方法一个实施例示意图;
图5B为图5A提供的方法实施例中步骤505的一个细化步骤示意图;
图5C为图5A提供的方法实施例中步骤505的另一个细化步骤示意图;
图6为本申请实施例提供的视频通话方法一个实施例示意图;
图7A-图7D为本申请实施例提供的视频通话方法的一种应用场景示意图;
图8A-图8C为本申请实施例提供的语义分割方法的一种应用场景示意图;
图9为本申请实施例提供的一种神经网络处理器的结构示意图;
图10为本申请实施例提供的语义分割装置一个实施例示意图;
图11为本申请实施例提供的模型训练装置一个实施例示意图;
图12为本申请实施例提供的视频通话装置一个实施例示意图;
图13为本申请计算机设备一个实施例示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一系列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
语义分割是指,调用语义分割模型对图像或视频帧进行语义识别,根据识别结果对图像中各物体对象的类别进行分割预测。视频都是由静止的画面组成的,这些静止的画面被称为帧或视频帧。在调用语义分割模型对图像或视频帧进行语义分割时,语义分割模型可以对图像或视频帧中的各个像素进行语义识别,根据语义识别结果对图像或视频帧中的各个像素进行类别预测,生成语义分割图像。语义分割图像用于对视频帧中的各个像素点进行分类,实现视频帧的语义标注。或者说,语义分割图像包括经过语义识别分割出的一个或多个目标区域, 同一目标区域对应于预测出的同一类别的像素,一般采用相同标识(例如颜色)对同一目标区域进行标注,不同目标区域对应于预测出的不同类别的像素,一般采用不同标识(例如颜色)对不同目标区域进行标注。需要说明的是,本申请实施例中不对分类单位进行限定,语义分类可以是逐像素分类,也可以是按图像块分类,一个图像块包括多个像素。
本申请实施例提供一种对视频帧的语义分割方法,该语义分割方法基于人工智能模型(称作语义分割模型),由图1中的基础设施提供计算能力支持,用于对输入的视频帧进行数据处理,生成对视频帧的语义分割结果,例如,得到视频帧的语义分割图像,实现诸如视频帧的背景虚化或背景替换、直播制作、电影或动画制作、对视频帧进行分区优化、对视频帧中的物体进行识别等功能,可以应用于智能终端、自动驾驶、智能医疗等领域。
下面对本申请实施例的语义分割方法的实施场景进行说明,图2A为本申请实施例提供的语义分割方法的实施场景的示意图,参见图2A,本发明实施例提供了一种系统架构200。
数据采集设备260用于采集视频帧序列并存入数据库230,训练设备220基于数据库230中维护的视频帧序列生成语义分割模型201。训练设备220得到的语义分割模型201可以应用不同的系统或设备中。在图2A中,执行设备210配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。计算模块211使用语义分割模型201对输入视频帧序列进行语义分割,得到语义分割图像序列。最后,I/O接口212将处理结果(即得到的语义分割图像序列)返回给客户设备240,提供给用户。
在图2A中所示情况下,用户可以手动指定输入执行设备210中的数据,例如,在I/O接口212提供的界面中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入数据并获得结果,如果客户设备240自动输入数据需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看执行设备210输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。
值得注意的,图2A仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2A中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。
在一种可能的实现方式中,执行设备210可以设置在服务器中。参见附图2B,本发明实施例提供了一种系统架构300。执行设备210由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备;执行设备210可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备210可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请实施例方法。
用户可以操作各自的客户设备(例如本地设备2401和本地设备2402)与执行设备210进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备210进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在另一种实现中,执行设备210的一个方面或多个方面可以由每个本地设备实现,例如,本地设备2401可以为执行设备210提供本地数据或反馈计算结果。
或者,在一种可能的实现方式中,客户设备240和执行设备210均可以设置在终端设备中。
以执行设备210设置在服务器中为例,客户设备240可以向执行设备210发送语义分割请求,该语义分割请求可以包括待进行语义分割的一段视频(或称视频帧序列)。执行设备210用于通过执行本申请实施例提供的语义分割方法,依次生成视频帧序列的语义分割图像序列,并将得到的语义分割图像序列返回给客户设备240。
执行设备210的所有功能也可以由本地设备实现。例如,本地设备2401实现执行设备210的的功能并为自己的用户提供服务,或者为本地设备2402的用户提供服务。
以客户设备240执行本申请实施例的语义分割方法为例,客户设备240可以获取待进行语义分割的视频帧序列,通过执行本申请实施例提供的语义分割方法,依次生成视频帧序列的语义分割图像序列。客户设备240可以基于得到语义分割图像序列执行相应的场景或领域的应用,例如智能终端、无人驾驶和医疗领域等。下面举例介绍智能终端领域的应用:
智能终端(比如手机)上可以设置有图像采集装置(比如摄像头)及处理芯片。用户在智能终端开启视频通话功能,在视频通话过程中或录像过程中,图像采集装置能够采集包括场景视频帧序列。
在一种具体的应用中,场景视频帧序列包括人像,处理芯片基于语义分割模型,对视频帧序列实时进行人像分割,得到语义分割图像序列。选取语义分割图像中人像对应的目标区域作为前景,其他区域作为背景,将视频帧序列中的背景替换为其他背景,从而实现时空变幻、背景切换的效果。
在另一种具体的应用中,处理芯片基于语义分割模型,对场景视频帧序列进行实时的语义分割,生成的语义分割图像序列可以为视频帧提供实时的语义掩模,之后可以对不同的目标区域分别进行优化。
在另一种具体的应用中,用户可以开启智能终端的多物体识物功能,处理芯片基于语义分割模型,对场景视频帧序列进行实时的语义分割,生成的语义分割图像包括多个目标区域。基于语义分割图像序列,智能终端可以识别各个目标区域对应的物体的类别,比如属于杯子或椅子等,给用户带来强劲的视觉体验。
本申请实施例提供的语义分割方法并不限于上述提到的场景或领域。
下面对本申请实施例提供的语义分割方法进行介绍,以该方法应用于计算机设备为例,该计算机设备中设置有执行设备210。该计算机设备可以是终端设备,或服务器,或,终端设备和服务器组成的系统。
以对某个视频(称作第一视频或第一视频帧序列)中的某个视频帧(称作第二视频帧)进行语义分割为例,介绍对第二视频帧的语义分割过程,第一视频帧序列包括多个连续的视频帧。参考图3A,本申请语义分割方法一个实施例可以包括如下步骤:
301、获取第一视频帧序列中的第一视频帧和第二视频帧;
第一视频帧序列可以是通过计算机设备的图像采集装置实时采集的视频,或者,可以是从网络上获取的视频。第一视频帧和第二视频帧为不同的视频帧,具体的,可以为第一视频 帧序列中相邻的两个视频帧,或者,可以为第一视频帧序列中相隔一个或多个视频帧的两个视频帧。
302、分别将第一视频帧和第二视频帧输入图像分割模型;
303、获取第一图像分割网络层输出的第一视频帧的第一特征图;
304、获取第二图像分割网络层输出的第二视频帧的第二特征图;
获取第一视频帧序列中的第一视频帧和第二视频帧之后,可以分别将第一视频帧和第二视频帧输入预训练的图像分割模型。
该图像分割模型为对图像的语义分割模型,用于对输入的图像进行语义分割,输出该图像的语义分割图像。示例性的,可以基于多个带有标注信息的图像训练图像分割模型,图像的标注信息可以为该图像的标注的语义分割图像(简称标注语义分割图像)。
语义分割模型可以是一种卷积神经网络模型,语义分割模型可以包括输入层、多层网络层和输出层。语义分割模型的多层网络层中的每层网络层用于对输入的图像或特征图进行特征提取,输出特征图。一般的语义分割模型的多层网络层可以被认为是一个编码器-解码器结构,也就是说,多层网络层中靠近输入层的网络层相当于编码器,可以对输入的视频帧或特征图进行下采样,使得输出的特征图的分辨率小于输入的视频帧或特征图的分辨率;多层网络层中靠近输出层的网络层相当于解码器,可以对输入的特征图进行上采样,使得输出的特征图的分辨率大于输入的视频帧的分辨率。在本申请实施例中,将多层网络层中输出的特征图的分辨率最小的网络层称作中间网络层。对于处理图像信息的神经网络模型来说,其中间网络层和位于其输入层与中间网络层之间的网络层用于对输入的图像或特征图进行下采样,中间网络层与输出层之间的网络层用于对输入的特征图进行上采样;对于图像的语义分割模型来说,中间网络层和位于其输入层与中间网络层之间的网络层还可以提取图像中的语义特征,但是会丢失图像的空间信息,因此,可以认为中间网络层输出的特征图能够提供输入图像最多的语义特征;中间网络层与输出层之间的网络层还可以在输出的特征图中丢失的空间信息,因此,可以认为越靠近输出层的网络层,其输出的特征图能够提供最多的空间信息。
将第一视频帧输入图像分割模型后,图像分割模型可以对第一视频帧进行处理,输出第一视频帧的语义分割图像。具体的,图像分割模型的输入层可以对第一视频帧进行预处理,例如对视频帧进行标准化,视频帧的红色绿色蓝色RGB分量被映射成符合正态分布。图像分割模型的第一层网络层可以提取第一视频帧的特征,输出特征图;图像分割模型的第二层网络层可以对第一层网络层输出的特征图进行特征提取,输出特征图;依次类推,图像分割模型的最后一层(或称底层)网络层可以对前一层网络层输出的特征图进行特征提取,输出特征图;图像分割模型的输出层可以对底层网络层输出的特征图进行处理,输出第一视频帧的语义分割图像。计算机设备可以在将第一视频帧输入图像分割模型后,获取某一网络层(称作第一图像分割网络层)输出的特征图(称作第一特征图),为了根据尽量多的语义特征提取时序信息,在一种可能的实现方式中,第一图像分割网络层可以为图像分割模型的中间网络层或位于图像分割模型的输入层和中间网络层之间的任意一层网络层。
将第二视频帧输入图像分割模型后,图像分割模型可以对第二视频帧进行处理,输出第二视频帧的语义分割图像。具体过程可以参考上述对第一视频帧的处理过程。计算机设备可以在将第二视频帧输入图像分割模型后,获取某一网络层(称作第二图像分割网络层)输出 的特征图(称作第二特征图),为了使得第二特征图提供第二视频帧尽量多的空间信息,在一种可能的实现方式中,第二图像分割网络层可以为图像分割模型的中间网络层和输出层之间的任意一层网络层。
在本申请实施例中,不限定获取步骤303和步骤304的先后执行顺序。
305、将第一特征图和第二特征图输入第一帧间融合模型,生成第二视频帧的语义分割图像。
计算机设备可以将第一特征图和第二特征图输入第一帧间融合模型,生成第二视频帧的语义分割图像。第一帧间融合模型可以是一种训练好的神经网络模型,该第一帧间融合模型可以通过端到端的方式训练,或者,可以先对神经网络模型中的部分网络层进行训练,训练好之后,再对其他网络层进行训练。
第一视频帧的第一特征图能够为第二视频帧提供其时序信息,和第一图像分割网络层输出的特征图相比,第二图像分割网络层输出的第二视频帧的第二特征图能够更大程度的提供第二视频帧的空间信息,获取第一特征图和第二特征图后,根据第一特征图和第二特征图生成第二视频帧的语义分割图像,有利于在保持单个视频帧的分割精度的前提下,利用时序信息提高对第二视频帧进行语义分割的稳定性。
本申请实施例不限定图像分割模型的结构,在一种可能的实现方式中,图像分割模型可以为卷积神经网络(convolutional neuron network,CNN),CNN是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。
如图4A所示,卷积神经网络(CNN)400可以包括输入层410、多层网络层420和输出层430,其中,多层网络层420可以包括卷积层和隐含层,可选的还可以包括池化层。一般来说第一层是输入层410,最后一层是输出层430,中间的层数都是网络层420。
下面对网络层420进行介绍。
关于卷积层:
如图4A所示网络层420可以包括如示例421-426层,在一种实现中,421层为卷积层,422层为池化层,423层为卷积层,424层为池化层,425为卷积层,426为池化层;在另一种实现方式中,421、422为卷积层,423为池化层,424、425为卷积层,426为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层421为例,卷积层421可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的 整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图(feature map)维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络400进行正确的预测。
当卷积神经网络400有多个卷积层的时候,初始的卷积层(例如421)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络400深度的加深,越往后的卷积层(例如426)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。在一种可能的实现方式中,第一图像分割网络层可以为一层卷积层,例如最后一层卷积层。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图4A中420所示例的421-426各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。在一种可能的实现方式中,第一图像分割网络层可以为一层池化层,例如最后一层池化层。
关于隐含层:
在经过卷积层和池化层的处理后,卷积神经网络400还不足以输出所需要的输出信息,例如语义分割图像。因为如前,卷积层和池化层只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络400需要利用隐含层来生成一个或者一组所需要的类的数量的输出。因此,网络层可以包括多层隐含层(如图4A所示的427、428至429),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类等。
在隐含层之后,也就是整个卷积神经网络400的最后层为输出层430,该输出层430具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络400的前向传播(如图4A由410至430的传播为前向传播)完成,反向传播(如图4A由430至410的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络400的损失及卷积神经网络400通过输出层输出的结果和理想结果之间的误差。在一种可能 的实现方式中,图3A对应的实施例中的底层网络层可以指最后一层隐含层,或者指输出层之前的最后一层网络层。
需要说明的是,如图4A所示的卷积神经网络400仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图4B所示的多个卷积层或池化层并行,将分别提取的特征图均输入给隐含层进行处理。
在一种可能的实现方式中,第一图像分割网络层用于对输入的视频帧或特征图进行下采样,第二图像分割网络层用于对输入的特征图进行上采样。
为了降低计算机设备的中间变量存储压力,可以对用于提供时序信息的第一特征图进行下采样,在一种可能的实现方式中,参考图3B,步骤305可以具体包括如下步骤:
3051A、将第一特征图输入第一邻帧预测模型;
第一邻帧预测模型用于预测相邻视频帧的信息,相邻视频帧与输入第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列。
第一邻帧预测模型可以是一种卷积神经网络模型,第一邻帧预测模型可以包括输入层、多层网络层和输出层。第一邻帧预测模型的多层网络层中的每层网络层用于对输入的图像或特征图进行特征提取,输出特征图。第一邻帧预测模型的多层网络层可以被认为是一个编码器-解码器结构,也就是说,多层网络层中靠近输入层的网络层相当于编码器,可以对输入的视频帧或特征图进行下采样,使得输出的特征图的分辨率小于输入的视频帧或特征图的分辨率;多层网络层中靠近输出层的网络层相当于解码器,可以对输入的特征图进行上采样,使得输出的特征图的分辨率大于输入的视频帧的分辨率。在本申请实施例中,将多层网络层中输出的特征图的分辨率最小的网络层称作中间网络层。对于处理图像信息的神经网络模型来说,其中间网络层和位于其输入层与中间网络层之间的网络层用于对输入的图像或特征图进行下采样,中间网络层与输出层之间的网络层用于对输入的特征图进行上采样。
3052A、获取第一邻帧预测网络层输出的第一特征图的第一压缩特征图;
将第一视频帧输入第一邻帧预测模型后,可以获取第一邻帧预测网络层输出的第一特征图的第一压缩特征图。在一种可能的实现方式中,第一邻帧预测网络层可以为第一邻帧预测模型的中间网络层或位于第一邻帧预测模型的输入层和中间网络层之间的任意一层网络层。
3053A、将第一压缩特征图和第二特征图输入第二帧间融合模型,生成第二视频帧的语义分割图像;
在一种可能的实现方式中,第二帧间融合模型可以是一种神经网络模型。
由于第一邻帧预测模型用于预测相邻视频帧的信息,因此,第一邻帧预测网络层输出的特征图有利于体现第一视频帧序列中的时序变化信息,因此,无需第二图像分割网络层输出的第二视频帧的特征图作为对比,第一特征图便可以独自提供时序变化信息。
并且,通过进一步减小用于提供时序信息的特征,有利于去除噪声等冗余信息,使后续第二帧间融合模型更容易学习,并且有利于进一步降低计算机设备的中间变量存储压力。
在一种可能的实现方式中,第一邻帧预测模型可以为基于带有标注信息的第一样本集训练得到的,第一样本为第一样本集中的任意一个样本,第一样本为第一图像分割网络层输出的第三视频帧的特征图,第一样本的标注信息为第一图像分割网络层输出的第四视频帧的特征图,第三视频帧和第四视频帧为同一视频帧序列中的不同视频帧。
假设第一视频帧在第二视频帧的第一时序方向,那么,在一种可能的实现方式中,用于训练第一邻帧预测模型的样本对应的视频帧在其标准信息对应的视频帧的第一时序方向,即第三视频帧在第四视频帧的第一时序方向。也就是说,若将第一视频帧的第一特征图输入第一邻帧预测模型,得到的特征图将接近第二视频帧的第一特征图,那么第一邻帧预测网络层输出的第一特征图的第一压缩特征图能够体现第一视频帧至第二视频帧的时序变化信息。第一时序方向可以指前序,即第一视频帧被拍摄的时刻先于第二视频帧被拍摄的时刻;或者,第一时序方向可以指后序,即第一视频帧被拍摄的时刻在第二视频帧被拍摄的时刻之后。
在实时视频帧的语义分割应用中,为了减少延时,第一时序方向可以为前序,在一种可能的实现方式中,第一视频帧可以为第二视频帧的前一帧。
在一种可能的实现方式中,为了丰富时序信息,提高时序信息的精度,将第一视频帧输入图像分割模型后,可以获取图像分割模型中的两层或更多层网络层输出的两个或更多个特征图,示例性的,可以获取第一图像分割网络层输出的第一特征图和第三图像分割网络层输出的第四特征图,第三图像分割网络层可以为图像分割模型中的任意一层网络层。例如,第二图像分割网络层可以为第一层网络层,或者倒数第二层网络层。
之后,可以将获取到的各个特征图输入第一帧间融合模型,例如,将第一特征图、第二特征图和第四特征图输入第一帧间融合模型,生成第二视频帧的语义分割图像,有利于使得最终的语义分割结果更加稳定和准确。
在一种可能的实现方式中,可以参考图3B对应的方法获取第一视频帧的各个特征图的压缩特征图,之后将第二特征图和第一视频帧的各压缩特征图输入第二帧间融合模型。获取不同网络层输出的特征图的压缩特征图时,可以采用不同的邻帧预测模型,例如,对第一层网络层输出的特征图的时序信息进行提取时,利用的邻帧预测模型可以是基于第一层网络层输出的样本视频帧的特征图来进行训练。
在一种可能的实现方式中,步骤304之后,步骤305之前,本申请实施例提供的语义分割方法还可以包括:获取第一图像分割网络层输出的第二视频帧的第三特征图。第三特征图可以用来提供时序信息。此时,参考图3C,步骤305的一种可能的细化步骤可以包括:
3051B、将第三特征图输入第二邻帧预测模型;
第二邻帧预测模型用于预测相邻视频帧的信息,相邻视频帧与输入第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列。
第二邻帧预测模型是一种卷积神经网络模型。第二邻帧预测模型可以是一种卷积神经网络模型,第二邻帧预测模型可以包括输入层、多层网络层和输出层。第二邻帧预测模型的多层网络层中的每层网络层用于对输入的图像或特征图进行特征提取,输出特征图。第二邻帧预测模型的多层网络层可以被认为是一个编码器-解码器结构,也就是说,多层网络层中靠近输入层的网络层相当于编码器,可以对输入的视频帧或特征图进行下采样,使得输出的特征图的分辨率小于输入的视频帧或特征图的分辨率;多层网络层中靠近输出层的网络层相当于解码器,可以对输入的特征图进行上采样,使得输出的特征图的分辨率大于输入的视频帧的分辨率。在本申请实施例中,将多层网络层中输出的特征图的分辨率最小的网络层称作中间网络层。对于处理图像信息的神经网络模型来说,其中间网络层和位于其输入层与中间网络层之间的网络层用于对输入的图像或特征图进行下采样,中间网络层与输出层之间的网络层 用于对输入的特征图进行上采样。
3052B、获取第二邻帧预测网络层输出的第三特征图的第二压缩特征图;
第二邻帧预测网络层为第二邻帧预测模型的中间网络层或位于第二邻帧预测模型的输入层和中间网络层之间的任意一层网络层。
3053B、将第一压缩特征图、第二压缩特征图和第二特征图输入第二帧间融合模型,生成第二视频帧的语义分割图像;
在一种可能的实现方式中,第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为第二样本集中的任意一个样本,第二样本为第一图像分割网络层输出的第五视频帧的特征图,第二样本的标注信息为第一图像分割网络层输出的第六视频帧的特征图,第五视频帧和第六视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,第一视频帧在第二视频帧的第一时序方向,那么第六视频帧在第五视频帧的第一时序方向。也就是说,若将第一视频帧的第三特征图输入第二邻帧预测模型,得到的特征图将接近第一视频帧的第一特征图,那么第二邻帧预测网络层输出的第三特征图的第二压缩特征图能够体现第二视频帧至第一视频帧的时序变化信息。
根据第一压缩特征图和第二压缩特征图(时序信息)对第二特征图(空间信息)进行调整,进而生成第二视频帧的语义分割图像,有利于提高时序信息的信息量,提高语义分割的稳定性。
在一种可能的实现方式中,第二帧间融合模型可以为基于带有标注信息的第三样本集训练得到的,第三样本为第三样本集中的任意一个样本,第三样本包括第一邻帧预测网络层输出的第四特征图的压缩特征图、第二邻帧预测网络层输出的第五特征图的压缩特征图和第二图像分割网络层输出的第八视频帧的第六特征图,第四特征图为第一图像分割网络层输出的第七视频帧的特征图,第五特征图为第一图像分割网络层输出的第八视频帧的特征图,第七视频帧和第八视频帧为同一视频帧序列中的不同视频帧,第三样本的标注信息为第八视频帧的标注语义分割图像。
通过预先训练好的神经网络模型对第二视频帧的时序信息和空间信息进行融合,有利于提高对第二视频帧进行语义分割的精确度和稳定性。
上面已对本申请实施例提供的对视频帧的语义分割方法进行介绍,示例性的,本申请实施例还提供一种模型训练方法,用于对上述第一帧间融合模型进行训练。继续参考图2A提供的系统架构200,介绍训练设备220如何训练第一帧间融合模型。
对神经网络模型的训练过程一般需要利用大量样本进行训练,且每次训练过程至少需要同一视频中的两帧视频帧(称作第一帧和第二帧),以利用第一帧和第二帧训练语义分割模型的过程为例,介绍本申请实施例提供的模型训练方法。
参考图5A,本申请实施例提供的模型训练方法可以包括如下步骤:
501、获取同一视频帧序列中的第一帧和第二帧、以及第二帧的语义分割图像;
训练设备220可以针对不同的目标,基于不同的数据生成相应的语义分割模型201,以给用户提供更佳的结果。例如在人像的语义分割应用中,第一帧和第二帧可以均包括人像。
在一种可能的实现方式中,客户设备240也可以作为数据采集端将采集到的视频帧序列 (包括第一帧和第二帧)存入数据库230。
502、分别将第一帧和第二帧输入图像分割模型;
图像分割模型为训练好的用于对输入的图像进行语义分割的模型,在一种可能的实现方式中,图像分割模型是一种卷积神经网络模型,卷积神经网络模型包括输入层、输出层以及位于输入层和输出层之间的多层网络层,多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为多层网络层中输出的特征图的分辨率最小的一层网络层。关于卷积神经网络模型的介绍可以参考前述相关描述,此处不再赘述。
503、获取第一图像分割网络层输出的第一帧的第一特征图;
将第一帧输入图像分割模型之后,可以获取第一图像分割网络层输出的第一帧的第一特征图,第一图像分割网络层可以为图像分割模型的中间网络层或位于图像分割模型的输入层和中间网络层之间的任意一层网络层。
504、获取第二图像分割网络层输出的第二帧的第二特征图;
将第二帧输入图像分割模型之后,可以获取第二图像分割网络层输出的第二帧的第二特征图,第二图像分割网络层为位于图像分割模型的中间网络层和输出层之间的任意一层网络层。
505、以第二帧的语义分割图像作为标注信息,将第一特征图和第二特征图输入第一帧间融合模型,更新第一帧间融合模型的参数。
第一帧间融合模型可以是一种神经网络模型。
由于计算机设备可以通过图像分割模型和第一帧间融合模型对视频帧进行语义分割,因此,可以认为本申请实施例提供了一种对视频帧的语义分割模型,包括上述训练好的图像分割模型和第一帧间融合模型。
本申请实施例提供了一种第一帧间融合模型的训练方法,利用图像分割模型和第一帧间融合模型对视频帧进行语义分割,有利于提高对视频帧的语义分割结果的准确性和稳定性。
对语义分割模型的训练过程可以为对第一帧间融合模型的训练过程,或者可以包括对图像分割模型的训练过程和对第一帧间融合模型的训练过程。
第一帧间融合模型可以包括一个或多个深度神经网络,深度神经网络中的每一层的工作可以用数学表达式y=a(Wgx+b)来描述:从物理层面深度神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由“Wgx”完成,4的操作由“+b”完成,5的操作则由“a()”来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合。其中,W是权重向量,该向量中的每一个值表示该层神经网络中的一个神经元的权重值。该向量W决定着上文的输入空间到输出空间的空间变换,即每一层的权重W控制着如何变换空间。训练深度神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权 重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
一般来说,神经网络满足预设的约束条件时,可以认为神经网络完成训练,可以将此时的神经网络称作神经网络模型。约束条件可以是达到预设的迭代次数,或者调整参数后的神经网络的性能达到预设指标等。
在一种可能的实现方式中,在将第二帧输入图像分割模型之后,方法还包括:获取第一图像分割网络层输出的第二帧的第三特征图。步骤505的一种细化步骤可以包括:将第一特征图、第二特征图和第三特征图输入第一帧间融合模型,更新第一帧间融合模型的参数。
在一种可能的实现方式中,步骤505的一种细化步骤可以包括:以第三特征图为标注信息,将第一特征图输入第一邻帧预测模型,更新第一邻帧预测模型的参数。
在一种可能的实现方式中,第一邻帧预测模型是一种卷积神经网络模型;基于第一邻帧预测模型满足第一约束条件,比如对第一邻帧预测模型的训练次数达到预设迭代次数,此时,参考图5B,步骤505的一种细化步骤可以包括:
5051A、将第一特征图输入第一邻帧预测模型;
5052A、获取第一邻帧预测网络层输出的第一特征图的第一压缩特征图;
第一邻帧预测网络层为第一邻帧预测模型的中间网络层或位于第一邻帧预测模型的输入层和中间网络层之间的任意一层网络层。
5053A、以第二帧的语义分割图像作为标注信息,将第一压缩特征图和第二特征图输入第二帧间融合模型,更新第二帧间融合模型的参数。
由于第一邻帧预测模型的输入数据和输出数据为同一视频中的不同视频帧的特征图,因此,训练好的第一邻帧预测模型中的第一邻帧预测网络层输出的压缩特征图能够体现同一视频中不同视频帧之间随时间发生变化的信息,即时序信息,因此,无需第二特征图,根据第一特征图便可以提供第二帧与第一帧之间的时序信息。通过进一步减小用于提供时序信息的特征,有利于去除噪声等冗余信息,使后续融合模块更容易学习,并且有利于进一步降低计算机设备的中间变量存储压力。
在一种可能的实现方式中,步骤505的一种细化步骤还可以包括:以第一特征图为标注信息,将第三特征图输入第二邻帧预测模型,更新第二邻帧预测模型的参数。在一种可能的实现方式中,第二邻帧预测模型可以是一种卷积神经网络模型。
基于第二邻帧预测模型满足第二约束条件,参考图5C,步骤505的一种细化步骤还可以包括:
5051B、将第三特征图输入第二邻帧预测模型;
5052B、获取第二邻帧预测网络层输出的第三特征图的第二压缩特征图;
第二邻帧预测网络层为第二邻帧预测模型的中间网络层或位于第二邻帧预测模型的输入 层和中间网络层之间的任意一层网络层。
5053B、以第二帧的语义分割图像作为标注信息,将第一压缩特征图、第二压缩特征图和第二特征图输入第二帧间融合模型,更新第二帧间融合模型的参数。
由于第二邻帧预测模型的输入数据和输出数据为同一视频中的不同视频帧的特征图,因此,训练好的第二邻帧预测模型中的第二邻帧预测网络层输出的压缩特征图能够体现同一视频中不同视频帧之间随时间发生变化的信息,即时序信息,通过进一步减小用于提供时序信息的特征,有利于去除噪声等冗余信息,使后续第二帧间融合模型更容易学习,并且有利于进一步降低计算机设备的中间变量存储压力。
需要说明的是,“第一视频帧”和“第一帧”不是限定该帧为视频中的第一个视频帧,“第二视频帧”和“第二帧”不是限定该帧为视频中的第二个视频帧,并且,并未限定“第一视频帧”和“第一帧”为同一帧,并未限定“第二视频帧”和“第二帧”为同一帧。
本申请实施例提供的模型训练方法中的图像分割模型可以参考本申请实施例提供的语义分割方法中的图像分割模型进行理解,此处不再赘述。
在一种可能的实现方式中,可以按照本申请实施例提供的模型训练方法中的第一帧间融合模型的训练方法获取本申请实施例提供的语义分割方法中的第一帧间融合模型。具体的,在一种可能的实现方式中,可以按照本申请实施例提供的模型训练方法中的第一邻帧预测模型的训练方法获取前述语义分割方法实施例中的第一邻帧预测模型;在一种可能的实现方式中,可以按照本申请实施例提供的模型训练方法中的第二邻帧预测模型的训练方法获取本申请实施例提供的语义分割方法中的第二邻帧预测模型;在一种可能的实现方式中,可以按照本申请实施例提供的模型训练方法中的第二帧间融合模型的训练方法获取本申请实施例提供的语义分割方法中的第二帧间融合模型。
本申请实施例还提供一种视频通话方法,参考图6,本申请视频通话方法一个实施例可以包括如下步骤:
601、第一终端设备通过其图像采集模块采集第一本端视频帧;
第一终端设备可以通过其图像采集模块(比如摄像头)实时采集图像,得到图像序列,由于图像序列中的图像之间存在时序关联,可以将采集的每个图像称作一个视频帧(称作第一本端视频帧)。
602、第一终端设备接收第二终端设备通过其图像采集模块采集的对端视频帧;
第二终端设备可以实时采集视频帧,并将采集到的视频帧发送给第一终端设备。
603、第一终端设备根据第一本端视频帧生成第二本端视频帧;
步骤601之后,第一终端设备可以根据第一本端视频帧生成第二本端视频帧,第一本端视频帧的第一图像区域显示第一图像,第二本端视频帧的第二图像区域显示第二图像,第一图像和第二图像不同,第一本端视频帧的第三图像区域和第二本端视频帧的第四图像区域均显示第三图像。可以将第三图像称作前景,将第一图像和第二图像称作第三图像的背景;第三图像区域和第四图像区域可以被称作前景区域,第一图像区域和第二图像区域可以被称作背景区域。需要说明的是,不限定第一图像区域和第二图像区域的尺寸或相对于所在视频帧的位置相同,不限定第三图像区域和第四图像区域的尺寸或相对于所在视频帧的位置相同。
604、第一终端设备通过其显示屏同时显示对端视频帧和第二本端视频帧。
第一终端设备可以在不同图层同时显示对端视频帧和第二本端视频帧,可选的,两个视频帧在显示屏上对应的区域的尺寸不同。或者,第一终端设备可以根据对端视频帧和第二本端视频帧生成一帧融合图像,该融合图像的一部分区域显示对端视频帧的全部或部分图像区域,该融合图像的另一部分区域显示第二本端视频帧的全部或部分图像区域,或者,将第一本端视频帧中第一图像区域显示的第一图像替换为第二图像。
在视频通话过程中,对终端设备采集的视频帧进行背景切换,有利于提高视频通话过程的趣味性,增加用户粘性。
在一种可能的实现方式中,步骤603之后,本申请视频通话方法还可以包括:第一终端设备将第二本端视频帧发送给第二终端设备。这样,第二终端设备可以在视频通话界面显示第二本端视频帧,使通话对象看到背景切换后的视频图像。
在一种可能的实现方式中,步骤603可以包括:第一终端设备根据用户的切换指令,根据第一本端视频帧生成第二本端视频帧,切换指令用于指示第一终端设备将第一本端视频帧中的第一图像切换为第二图像。
在一种可能的实现方式中,该切换指令可以为响应于用户对某个选项的选择操作生成的,该选项用于提示将视频帧的背景切换为第二图像。在一种可能的实现方式中,终端设备可以向用户提供多个用于下达切换指令的选项,不同选项用于将视频帧的背景切换为不同图像。
在一种可能的实现方式中,该切换指令可以为响应于终端设备检测到自身姿态(例如摄像头的朝向)的改变生成的,终端设备的姿态变化会引起采集的视频帧中人像的背景发生变化。示例性的,用户可以选择将人像的背景切换至一组图像,该组图像可以用于体现三维空间中的环境图像,例如,该组图像为利用拍摄设备在同一场景或环境中拍摄的,不同图像对应于拍摄设备的摄像头的不同朝向,可以关联保存各图像对应的摄像头的朝向信息,在录制视频的过程中或进行视频通话的过程中,终端设备可以根据自身的朝向自动选择相应的图像,并将当前采集到的视频帧中人像的背景切换为选择的该图像,从而可以随着终端设备的姿态变化,相应改变背景区域的图像,实现动态切换切换后的背景图像,提高切换背景后的视频的真实性,仿佛视频中的人真正实现了所处环境的切换。
在一种可能的实现方式中,第一图像与第三图像对应的语义类型不同。
在一种可能的实现方式中,第二本端视频帧可以为根据第一本端视频帧的语义分割结果生成的。在一种可能的实现方式中,第一终端设备根据前述本申请提供的语义分割方法任一实施例,根据第一本端视频帧和第三本端视频帧生成第二本端视频帧的语义分割图像,第三本端视频帧与第一本端视频帧为第一终端设备采集的同一视频帧序列中的不同视频帧,例如第一本端视频帧的前一帧。之后,第一终端设备根据语义分割图像和第一本端视频帧生成第二本端视频帧。
本申请实施例还提供一种视频帧的背景切换方法,该方法一个实施例可以包括如下步骤:第一终端设备通过其图像采集模块采集第一视频帧;响应于第一终端设备检测到自身姿态的变化,第一终端设备根据第一视频帧生成第二视频帧,第一视频帧的第一图像区域显示第一图像,第二视频帧的第二图像区域显示第二图像,第一图像和第二图像不同,第一视频帧的 第三图像区域和第二视频帧的第四图像区域均显示第三图像;第一终端设备通过其显示屏显示第二视频帧。
在一种可能的实现方式中,终端设备可以关联存储图像和终端设备的姿态信息(或姿态变化信息),第一终端设备检测到自身姿态的变化时,可以选择与当前姿态对应的图像,或选择与姿态的变化信息(比如变化方向或变化速度)对应的图像,在第二视频帧的第二图像区域(或称第三图像的背景区域)显示选择的图像。
示例性的,用户可以选择将人像的背景切换至一组图像,该组图像可以用于体现三维空间中的环境图像,例如,该组图像为利用拍摄设备在同一场景或环境中拍摄的,不同图像对应于拍摄设备的摄像头的不同朝向,可以关联保存各图像对应的摄像头的朝向信息,在录制视频的过程中或进行视频通话的过程中,终端设备可以根据自身的朝向自动选择相应的图像,并将当前采集到的视频帧中人像的背景切换为选择的该图像,从而可以随着终端设备的姿态变化,相应改变背景区域的图像,实现动态切换切换后的背景图像,提高切换背景后的视频的真实性,仿佛视频中的人真正实现了所处环境的切换。
为了便于理解,下面结合具体应用场景,分别示例性的介绍本申请实施例方法的具体实现过程。
首先结合具体应用场景,示例性的介绍本申请实施例提供的视频通话方法的具体实现过程。
用户1和用户2可以通过各自的终端设备(比如手机)进行视频通话,为了便于描述,将用户1的手机记为手机1,将用户2的手机记为手机2。该视频通过过程可以由手机中的系统应用程序(比如电话应用程序)支持,也可以由第三方应用程序(比如社交应用程序)支持。在用户1和用户2进行视频通过的过程中,手机1和手机2可以分别采集各自的视频帧序列,并通过互联网相互发送各自采集的视频帧序列,以在各自的显示屏上同时显示手机1和手机2采集的视频帧序列,实现面对面沟通的效果。
下面以手机1和手机2采集的视频帧序列中的一个视频帧为例,介绍手机1执行本申请提供的视频通话方法时在其显示屏上的显示内容。
假设手机1当前采集的视频帧1如图7A所示,手机2当前采集的视频帧2如图7B所示。手机1可以通过互联网向手机2发送视频帧1,手机2可以通过互联网向手机1发送视频帧1。手机1可以为用户1提供人像背景切换的多个选项(选项1、选项2、选项3和选项4),当用户1选择选项1时(图7C中的箭头),手机1不执行本申请实施例提供的视频通话方法,即不对视频帧1进行人像背景切换,手机1在其显示屏上同时显示视频帧1和视频帧2,如图7C所示;当用户1选择选项2时(图7D中的箭头),手机1执行本申请实施例提供的视频通话方法,根据将视频帧1中人像的背景区域(即第一图像区域)替换为选项2对应的图像,得到人像背景切换后的视频帧1’,之后,可以在显示屏上同时显示视频帧1’和视频帧2,如图7D所示。
下面结合具体应用场景,示例性的介绍本申请实施例提供的语义分割方法的具体实现过程。
例如,应用场景为:在用户利用智能终端录制视频或利用智能终端与他人进行视频通话 的过程中,智能终端以人像对应的区域为目标区域,以视频帧中目标区域以外的其他区域为背景,切换背景对应的图像,实现人像的背景切换。
假设图像分割模型包括7层网络层,第一层最靠近图像分割模型的输入层,第七层最靠近图像分割模型的输出层,第四层为图像分割模型的中间网络层。参考图8A,本申请利用语义分割模型800实现的语义分割方法一个实施例方法可以包括如下步骤:
步骤1、通过摄像头依次获取帧1、帧2和帧3;
步骤2、将帧1输入图像分割模型801,获取网络层1输出的特征图1_1、网络层4输出的特征图1_4、网络层6输出的特征图1_6和网络层7输出的特征图1_7;
步骤3、将特征图1_7输入第二帧间融合模型802,得到帧1的语义分割图像1(在本申请实施例中为蒙版1),例如生成图7B所利用的蒙版,参考图8B;
步骤4、根据帧1的人像分割图像将帧1中人像对应的目标区域以外的区域(即背景)替换为指定图像,得到切换背景后的帧1,称作帧1’;
步骤5、分别将特征图1_1、特征图1_4和特征图1_6输入第一邻帧预测模型(图8A中以白色填充的圆形表示),得到压缩特征1_1a、压缩特征图1_4a和压缩特征图1_6a,并缓存;
不限定步骤3和步骤5之间的时序关系。
步骤6、将帧2输入图像分割模型801,获取网络层1输出的特征图2_1、网络层4输出的特征图2_4、网络层6输出的特征图2_6和网络层7输出的特征图2_7;
步骤7、分别将特征图2_1、特征图2_4和特征图2_6输入第一邻帧预测模型(图8A中以白色填充的圆形表示),得到压缩特征2_1a、压缩特征图2_4a和压缩特征图2_6a,并缓存;
步骤8、分别将特征图2_1、特征图2_4和特征图2_6输入第二邻帧预测模型(图8A中以黑色填充的圆形表示),得到压缩特征2_1b、压缩特征图2_4b和压缩特征图2_6b;
步骤9、将压缩特征图1_1a、压缩特征图1_4a、压缩特征图1_6a、压缩特征图2_1b、压缩特征图2_4b、压缩特征图2_6b和特征图2_7输入第二帧间融合模型802,得到帧2的人像分割图像2(在本申请实施例中为蒙版2);
步骤10、帧2的人像分割图像将帧2中人像对应的目标区域以外的区域替换为指定背景,得到切换背景后的帧2,称作帧2’;
不限定步骤7和步骤8之间的时序关系。
步骤11、将帧3输入图像分割模型801,获取网络层1输出的特征图3_1、网络层4输出的特征图3_4、网络层6输出的特征图3_6和网络层7输出的特征图3_7;
步骤12、分别将特征图3_1、特征图3_4和特征图3_6输入第一邻帧预测模型(图8A中以白色填充的圆形表示),得到压缩特征3_1a、压缩特征图3_4a和压缩特征图3_6a,并缓存;
步骤13、分别将特征图3_1、特征图3_4和特征图3_6输入第二邻帧预测模型(图8A中以黑色填充的圆形表示),得到压缩特征3_1b、压缩特征图3_4b和压缩特征图3_6b;
步骤14、将压缩特征2_1a、压缩特征图2_4a、压缩特征图2_6a、压缩特征3_1b、压缩特征图3_4b、压缩特征图3_6b和特征图3_7输入第二帧间融合模型802,得到帧3的人像分割图像3(在本申请实施例中为蒙版3);
步骤15、帧3的人像分割图像将帧3中人像对应的目标区域以外的区域替换为指定背景,得到切换背景后的帧3,称作帧3’。
不限定步骤12和步骤13之间的时序关系。不同网络层输出的特征图对应的第一邻帧预测模型可以不同。
第二帧间融合模型802可以对输入的多个数据进行融合,下面继续基于图8A对应的实施例,以步骤9为例,介绍第二帧间融合模型802的数据处理过程。图8C中以四角星代表连接操作,例如逐像素相加操作或合并(concat)操作,concat操作用于连接两个或多个数组,该方法不会改变现有的数组,而仅仅会返回被连接数组的一个副本。图8C中以菱形代表卷积操作,标识的数字不同的菱形可以代表不同类型的卷积操作,示例性的,标识有“1”的菱形可以代表一次或多次空洞卷积;标识有“2”的菱形可以代表一次或多次分离式卷积,或者代表先进行一次或多次分离式卷积,再进行一次或多次普通卷积;标识有“3”的菱形可以代表一次或多次普通卷积。
图8C进行用于示例性的介绍第二帧间融合模型802的内部结构,第二帧间融合模型802还可以包括其他运算,例如,可以对最后一个卷积操作输出的数据进行后处理,例如归一化处理,例如归一化指数函数(或称softmax函数)。
后续视频帧的语义分割方法可以参考前述步骤,此处不再赘述。
下面对本申请实施例提供的语义分割方法的性能进行介绍。
如表1所示,在视频帧的多类别语义分割的仿真数据上,本申请提供的语义分割方法的评测指标优于利用单个视频帧进行语义分割的低时延网络(称作单帧低时延网络模型),本申请提供的语义分割方法的评测指标与利用单个视频帧进行语义分割的大型网络(称作单帧大型网络模型)的测试结果相当。从可视化结果来看,本申请实施例提供的语义分割方法还优化了单帧分割的碎片化现象。相对于现有语义分割模型,本申请实施例提供的融合网络模型增加的时延较小,每秒执行的定点乘累加操作次数(giga multiply accumulate per second,Macc)<90M。表1中,IOU为交并比(Intersection over Union)的缩写,IOU是一种测量在特定数据集中检测相应物体准确度的一个标准。
表1
Figure PCTCN2020113206-appb-000001
如表2所示,本申请提供的语义分割方法的评测指标优于利用光流法进行语义分割的网络模型,表2中的FPS为每秒钟画面更新的数量(frame per second)的缩写,GPU为图形处理器(graphics processing unit)的缩写。
表2
方法 时间(GPU)
光流法模型 2FPS
本申请语义分割模型 50FPS
如表3所示为在人像视频帧上进行语义分割的结果对比,本申请提供的语义分割方法的评测指标优于利用视频对象分割(video object segmentation,VOS)进行语义分割的网络模型。
表3
方法 边界IOU
VOS 92.1%
本发明 93.8%
图9是本发明实施例提供的一种芯片硬件结构图。
神经网络处理器970作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。神经网络处理器970的核心部分为运算电路903,通过控制器904控制运算电路903提取存储器中的矩阵数据并进行乘法运算。本申请实施例方法所需的计算能力可以由图9所示的神经网络处理器970或神经网络处理器970和主CPU提供。
在一些实现中,运算电路903内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路903是二维脉动阵列。运算电路903还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路903是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路903从权重存储器902中取矩阵B相应的数据,并缓存在运算电路903中每一个PE上。运算电路903从输入存储器901中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器908 accumulator中。
统一存储器906用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器905 Direct Memory Access Controller,DMAC被搬运到权重存储器902中。输入数据也通过DMAC被搬运到统一存储器906中。
BIU为Bus Interface Unit即,总线接口单元910,用于AXI总线与DMAC和取指存储器909 Instruction Fetch Buffer的交互。
总线接口单元910(Bus Interface Unit,简称BIU),用于取指存储器909从外部存储器获取指令,还用于存储单元访问控制器905从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器906或将权重数据搬运到权重存储器902中或将输入数据数据搬运到输入存储器901中。
向量计算单元907包括多个运算处理单元,在需要的情况下,对运算电路903的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。
在一些实现种,向量计算单元能907将经处理的输出的向量存储到统一缓存器906。例如,向量计算单元907可以将非线性函数应用到运算电路903的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元907生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路903的激活输入,例如用于在神经网络中的后续层中的使用。
控制器904连接的取指存储器(instruction fetch buffer)909,用于存储控制器904使用的指令;
统一存储器906,输入存储器901,权重存储器902以及取指存储器909均为On-Chip存储器。外部存储器私有于该神经网络处理器硬件架构。
其中,本申请实施例中各神经网络模型中各层的运算可以由向量计算单元907执行。
从功能模块的角度,本申请可以根据上述方法实施例对执行语义分割方法的装置和执行模型训练方法的装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个功能模块中。上述集成的功能模块既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
比如,以采用集成的方式划分各个功能单元的情况下,图10示出了一种语义分割装置的结构示意图。如图10所示,本申请语义分割装置1000的一个实施例可以包括:
视频帧获取模块1001,用于获取第一视频帧序列中的第一视频帧和第二视频帧,所述第一视频帧与所述第二视频帧不同;
特征图获取模块1002,用于分别将所述第一视频帧和所述第二视频帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;所述特征图获取模块1002,还用于获取第一图像分割网络层输出的所述第一视频帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;所述特征图获取模块1002,还用于获取第二图像分割网络层输出的所述第二视频帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
融合模块1003,用于将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,所述第一帧间融合模型是一种神经网络模型。
在一种可能的实现方式中,所述融合模块1003用于:将所述第一特征图输入第一邻帧预测模型,所述第一邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第一邻帧预测模型是一种 所述卷积神经网络模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第一邻帧预测模型为基于带有标注信息的第一样本集训练得到的,第一样本为所述第一样本集中的任意一个样本,所述第一样本为所述第一图像分割网络层输出的第三视频帧的特征图,所述第一样本的标注信息为所述第一图像分割网络层输出的第四视频帧的特征图,所述第三视频帧和所述第四视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第三视频帧在所述第四视频帧的所述第一时序方向。
在一种可能的实现方式中,所述特征图获取模块1002还用于:在将所述第二视频帧输入图像分割模型之后,获取所述第一图像分割网络层输出的所述第二视频帧的第三特征图;所述融合模块还用于:将所述第三特征图输入第二邻帧预测模型,所述第二邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第二邻帧预测模型是一种所述卷积神经网络模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,生成所述第二视频帧的语义分割图像。
在一种可能的实现方式中,所述第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为所述第二样本集中的任意一个样本,所述第二样本为所述第一图像分割网络层输出的第五视频帧的特征图,所述第二样本的标注信息为所述第一图像分割网络层输出的第六视频帧的特征图,所述第五视频帧和所述第六视频帧为同一视频帧序列中的不同视频帧。
在一种可能的实现方式中,所述第一视频帧在所述第二视频帧的第一时序方向,所述第六视频帧在所述第五视频帧的所述第一时序方向。
在一种可能的实现方式中,所述第二帧间融合模型1003为基于带有标注信息的第三样本集训练得到的,第三样本为所述第三样本集中的任意一个样本,所述第三样本包括所述第一邻帧预测网络层输出的第四特征图的压缩特征图、所述第二邻帧预测网络层输出的第五特征图的压缩特征图和所述第二图像分割网络层输出的第八视频帧的第六特征图,所述第四特征图为所述第一图像分割网络层输出的第七视频帧的特征图,所述第五特征图为所述第一图像分割网络层输出的所述第八视频帧的特征图,所述第七视频帧和所述第八视频帧为同一视频帧序列中的不同视频帧,所述第三样本的标注信息为所述第八视频帧的标注语义分割图像。
在一种可能的实现方式中,所述特征图获取模块1002在将所述第一视频帧输入所述图像分割模型之后,还用于:获取所述第一图像分割网络层输出的所述第一视频帧的第四特征图;所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,包括:将所述第一特征图、所述第二特征图和所述第四特征图输入所述第一帧 间融合模型,生成所述第二视频帧的语义分割图像。
图11示出了一种模型训练装置的结构示意图。如图11所示,本申请模型训练装置1100的一个实施例可以包括:
样本获取模块1101,用于获取同一视频帧序列中的第一帧和第二帧、以及所述第二帧的语义分割图像;
特征图获取模块1102,用于分别将所述第一帧和所述第二帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;所述特征图获取模块1102,还用于获取第一图像分割网络层输出的所述第一帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;所述特征图获取模块1102,还用于获取第二图像分割网络层输出的所述第二帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
训练模块1103,用于以所述第二帧的语义分割图像作为标注信息,将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,所述第一帧间融合模型是一种神经网络模型。
在一种可能的实现方式中,所述特征图获取模块1102在将所述第二帧输入图像分割模型之后,还用于:获取所述第一图像分割网络层输出的所述第二帧的第三特征图;所述训练模块1103用于:将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数。
在一种可能的实现方式中,所述训练模块1103用于:以所述第三特征图为标注信息,将所述第一特征图输入第一邻帧预测模型,更新所述第一邻帧预测模型的参数。
在一种可能的实现方式中,所述第一邻帧预测模型是一种所述卷积神经网络模型;所述训练模块1103基于所述第一邻帧预测模型满足第一约束条件,还用于:将所述第一特征图输入所述第一邻帧预测模型;获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,更新所述第二帧间融合模型的参数。
在一种可能的实现方式中,所述所述训练模块1103还用于:以所述第一特征图为标注信息,将所述第三特征图输入第二邻帧预测模型,更新所述第二邻帧预测模型的参数。
在一种可能的实现方式中,所述第二邻帧预测模型是一种所述卷积神经网络模型;所述训练模块1103基于所述第二邻帧预测模型满足第二约束条件,还用于:将所述第三特征图输入所述第二邻帧预测模型;获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;以所述第二帧的语义分 割图像作为标注信息,将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,更新所述第二帧间融合模型的参数。
图12示出了一种视频通话装置的结构示意图。如图12所示,本申请视频通话装置1200的一个实施例可以包括:
图像采集模块1201,用于采集第一本端视频帧;
通信模块1202,用于接收第二终端设备通过其图像采集模块采集的对端视频帧;
背景切换模块1203,用于根据所述第一本端视频帧生成第二本端视频帧,所述第一本端视频帧的第一图像区域显示第一图像,所述第二本端视频帧的第二图像区域显示第二图像,所述第一图像和所述第二图像不同,所述第一本端视频帧的第三图像区域和所述第二本端视频帧的第四图像区域均显示第三图像;
显示模块1204,用于所述第一终端设备通过其显示屏同时显示所述对端视频帧和所述第二本端视频帧。
在一种可能的实现方式中,所述背景切换模块1203根据所述第一本端视频帧生成第二本端视频帧之后,所述通信模块1202还用于将所述第二本端视频帧发送给所述第二终端设备。
在一种可能的实现方式中,所述背景切换模块1203用于:根据用户的切换指令,根据所述第一本端视频帧生成第二本端视频帧,所述切换指令用于指示所述第一终端设备将所述第一本端视频帧中的所述第一图像切换为所述第二图像。
在一种可能的实现方式中,所述第一图像与所述第三图像对应的语义类型不同。
在一种可能的实现方式中,所述背景切换模块1203用于,根据本申请实施例提供的语义分割方法任一实施例方法,根据所述第一本端视频帧和第三本端视频帧生成所述第二本端视频帧的语义分割图像,所述第三本端视频帧与所述第一本端视频帧为所述第一终端设备采集的同一视频帧序列中的不同视频帧;根据所述语义分割图像和所述第一本端视频帧生成第二本端视频帧。
图10至图12对应的装置实施例可以参考前述各方法实施例中的相关部分进行理解,此处不再赘述。
上述各模块可以指特定应用集成电路(application-specific integrated circuit,ASIC),执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,和/或其他可以提供上述功能的器件。图13为计算机设备1300的硬件结构示意图,在一个简单的实施例中,本领域的技术人员可以想到语义分割装置1000和模型训练装置1100和视频通话装置1200可以采用图13所示的形式。
该计算机设备1300包括至少一个处理器1301和存储器1302。
上述的处理器1301可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中 的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。虽然图中仅仅示出了一个处理器,该装置可以包括多个处理器或者处理器包括多个处理单元。具体的,处理器可以是一个单核处理器,也可以是一个多核或众核处理器。该处理器可以是ARM架构处理器。
存储器1302用于存储处理器执行的计算机指令。存储器1302可以是存储电路也可以是存储器。存储器1302可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。存储器1302可以独立于处理器1301,一种可能的实现方式中,处理器1301和存储器1302可以通过总线1303相互连接。总线1303可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。或者,存储器1302也可以是处理器1301中的存储单元,与处理器1301直接相连(attach),在此不做限定。虽然图中仅仅示出了一个存储器1302,该装置也可以包括多个存储器1302或者存储器1302包括多个存储单元。
其中,上述存储器1302用于存储执行本申请方案的计算机执行指令,并由上述处理器1301来控制执行。处理器1301用于执行存储器1302中存储的计算机执行指令,从而实现本申请上述方法实施例提供的语义分割方法和模型训练方法。
一种可能的实现方式,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
上述实施例,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机执行指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此 外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。在本申请实施例中,“多个”指两个或两个以上。
本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其他实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的各实施例中,为了方面理解,进行了多种举例说明。然而,这些例子仅仅是一些举例,并不意味着是实现本申请的最佳实现方式。
以上对本申请所提供的技术方案进行了详细介绍,本申请中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (33)

  1. 一种语义分割方法,其特征在于,包括:
    获取第一视频帧序列中的第一视频帧和第二视频帧,所述第一视频帧与所述第二视频帧不同;
    分别将所述第一视频帧和所述第二视频帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;
    获取第一图像分割网络层输出的所述第一视频帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;
    获取第二图像分割网络层输出的所述第二视频帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
    将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,所述第一帧间融合模型是一种神经网络模型。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述第一特征图和所述第二特征图输入融合网络模型,生成所述第二视频帧的语义分割图像,包括:
    将所述第一特征图输入第一邻帧预测模型,所述第一邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第一邻帧预测模型是一种所述卷积神经网络模型;
    获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像。
  3. 根据权利要求2所述的方法,其特征在于,所述第一邻帧预测模型为基于带有标注信息的第一样本集训练得到的,第一样本为所述第一样本集中的任意一个样本,所述第一样本为所述第一图像分割网络层输出的第三视频帧的特征图,所述第一样本的标注信息为所述第一图像分割网络层输出的第四视频帧的特征图,所述第三视频帧和所述第四视频帧为同一视频帧序列中的不同视频帧。
  4. 根据权利要求3所述的方法,其特征在于,所述第一视频帧在所述第二视频帧的第一时序方向,所述第三视频帧在所述第四视频帧的所述第一时序方向。
  5. 根据权利要求2至4中任一项所述的方法,其特征在于,将所述第二视频帧输入所述图像分割模型之后,所述方法还包括:
    获取所述第一图像分割网络层输出的所述第二视频帧的第三特征图;
    所述将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像,包括:
    将所述第三特征图输入第二邻帧预测模型,所述第二邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第二邻帧预测模型是一种所述卷积神经网络模型;
    获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,生成所述第二视频帧的语义分割图像。
  6. 根据权利要求5所述的方法,其特征在于,所述第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为所述第二样本集中的任意一个样本,所述第二样本为所述第一图像分割网络层输出的第五视频帧的特征图,所述第二样本的标注信息为所述第一图像分割网络层输出的第六视频帧的特征图,所述第五视频帧和所述第六视频帧为同一视频帧序列中的不同视频帧。
  7. 根据权利要求6所述的方法,其特征在于,所述第一视频帧在所述第二视频帧的第一时序方向,所述第六视频帧在所述第五视频帧的所述第一时序方向。
  8. 根据权利要求5至7中任一项所述的方法,其特征在于,所述第二帧间融合模型为基于带有标注信息的第三样本集训练得到的,第三样本为所述第三样本集中的任意一个样本,所述第三样本包括所述第一邻帧预测网络层输出的第四特征图的压缩特征图、所述第二邻帧预测网络层输出的第五特征图的压缩特征图和所述第二图像分割网络层输出的第八视频帧的第六特征图,所述第四特征图为所述第一图像分割网络层输出的第七视频帧的特征图,所述第五特征图为所述第一图像分割网络层输出的所述第八视频帧的特征图,所述第七视频帧和所述第八视频帧为同一视频帧序列中的不同视频帧,所述第三样本的标注信息为所述第八视频帧的标注语义分割图像。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,将所述第一视频帧输入所述图像分割模型之后,所述方法还包括:
    获取所述第一图像分割网络层输出的所述第一视频帧的第四特征图;
    所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,包括:
    将所述第一特征图、所述第二特征图和所述第四特征图输入所述第一帧间融合模型,生成所述第二视频帧的语义分割图像。
  10. 一种模型训练方法,其特征在于,包括:
    获取同一视频帧序列中的第一帧和第二帧、以及所述第二帧的语义分割图像;
    分别将所述第一帧和所述第二帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;
    获取第一图像分割网络层输出的所述第一帧的第一特征图,所述第一图像分割网络层为 所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;
    获取第二图像分割网络层输出的所述第二帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
    以所述第二帧的语义分割图像作为标注信息,将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,所述第一帧间融合模型是一种神经网络模型。
  11. 根据权利要求10所述的方法,其特征在于,在将所述第二帧输入图像分割模型之后,所述方法还包括:
    获取所述第一图像分割网络层输出的所述第二帧的第三特征图;
    所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,包括:
    将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数。
  12. 根据权利要求11所述的方法,其特征在于,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,包括:
    以所述第三特征图为标注信息,将所述第一特征图输入第一邻帧预测模型,更新所述第一邻帧预测模型的参数。
  13. 根据权利要求12所述的方法,其特征在于,所述第一邻帧预测模型是一种所述卷积神经网络模型;
    基于所述第一邻帧预测模型满足第一约束条件,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:
    将所述第一特征图输入所述第一邻帧预测模型;
    获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,更新所述第二帧间融合模型的参数。
  14. 根据权利要求13所述的方法,其特征在于,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:
    以所述第一特征图为标注信息,将所述第三特征图输入第二邻帧预测模型,更新所述第二邻帧预测模型的参数。
  15. 根据权利要求14所述的方法,其特征在于,所述第二邻帧预测模型是一种所述卷积神经网络模型;
    基于所述第二邻帧预测模型满足第二约束条件,所述将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,还包括:
    将所述第三特征图输入所述第二邻帧预测模型;
    获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测 网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,更新所述第二帧间融合模型的参数。
  16. 一种语义分割装置,其特征在于,包括:
    视频帧获取模块,用于获取第一视频帧序列中的第一视频帧和第二视频帧,所述第一视频帧与所述第二视频帧不同;
    特征图获取模块,用于分别将所述第一视频帧和所述第二视频帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;
    所述特征图获取模块,还用于获取第一图像分割网络层输出的所述第一视频帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;
    所述特征图获取模块,还用于获取第二图像分割网络层输出的所述第二视频帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
    融合模块,用于将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,所述第一帧间融合模型是一种神经网络模型。
  17. 根据权利要求16所述的装置,其特征在于,所述融合模块用于:
    将所述第一特征图输入第一邻帧预测模型,所述第一邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第一邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第一邻帧预测模型是一种所述卷积神经网络模型;
    获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,生成所述第二视频帧的语义分割图像。
  18. 根据权利要求17所述的装置,其特征在于,所述第一邻帧预测模型为基于带有标注信息的第一样本集训练得到的,第一样本为所述第一样本集中的任意一个样本,所述第一样本为所述第一图像分割网络层输出的第三视频帧的特征图,所述第一样本的标注信息为所述第一图像分割网络层输出的第四视频帧的特征图,所述第三视频帧和所述第四视频帧为同一视频帧序列中的不同视频帧。
  19. 根据权利要求18所述的装置,其特征在于,所述第一视频帧在所述第二视频帧的第一时序方向,所述第三视频帧在所述第四视频帧的所述第一时序方向。
  20. 根据权利要求17至19中任一项所述的装置,其特征在于,所述特征图获取模块还用于:
    在将所述第二视频帧输入图像分割模型之后,获取所述第一图像分割网络层输出的所述第二视频帧的第三特征图;
    所述融合模块还用于:
    将所述第三特征图输入第二邻帧预测模型,所述第二邻帧预测模型用于预测相邻视频帧的信息,所述相邻视频帧与输入所述第二邻帧预测模型的特征图所属的视频帧属于同一视频帧序列,所述第二邻帧预测模型是一种所述卷积神经网络模型;
    获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,生成所述第二视频帧的语义分割图像。
  21. 根据权利要求20所述的装置,其特征在于,所述第二邻帧预测模型为基于带有标注信息的第二样本集训练得到的,第二样本为所述第二样本集中的任意一个样本,所述第二样本为所述第一图像分割网络层输出的第五视频帧的特征图,所述第二样本的标注信息为所述第一图像分割网络层输出的第六视频帧的特征图,所述第五视频帧和所述第六视频帧为同一视频帧序列中的不同视频帧。
  22. 根据权利要求21所述的装置,其特征在于,所述第一视频帧在所述第二视频帧的第一时序方向,所述第六视频帧在所述第五视频帧的所述第一时序方向。
  23. 根据权利要求20至22中任一项所述的装置,其特征在于,所述第二帧间融合模型为基于带有标注信息的第三样本集训练得到的,第三样本为所述第三样本集中的任意一个样本,所述第三样本包括所述第一邻帧预测网络层输出的第四特征图的压缩特征图、所述第二邻帧预测网络层输出的第五特征图的压缩特征图和所述第二图像分割网络层输出的第八视频帧的第六特征图,所述第四特征图为所述第一图像分割网络层输出的第七视频帧的特征图,所述第五特征图为所述第一图像分割网络层输出的所述第八视频帧的特征图,所述第七视频帧和所述第八视频帧为同一视频帧序列中的不同视频帧,所述第三样本的标注信息为所述第八视频帧的标注语义分割图像。
  24. 根据权利要求16至23中任一项所述的装置,其特征在于,所述特征图获取模块在将所述第一视频帧输入所述图像分割模型之后,还用于:
    获取所述第一图像分割网络层输出的所述第一视频帧的第四特征图;
    所述将所述第一特征图和所述第二特征图输入第一帧间融合模型,生成所述第二视频帧的语义分割图像,包括:
    将所述第一特征图、所述第二特征图和所述第四特征图输入所述第一帧间融合模型,生成所述第二视频帧的语义分割图像。
  25. 一种模型训练装置,其特征在于,包括:
    样本获取模块,用于获取同一视频帧序列中的第一帧和第二帧、以及所述第二帧的语义分割图像;
    特征图获取模块,用于分别将所述第一帧和所述第二帧输入图像分割模型,所述图像分割模型用于对输入的图像进行语义分割,所述图像分割模型是一种卷积神经网络模型,所述 卷积神经网络模型包括输入层、输出层以及位于所述输入层和所述输出层之间的多层网络层,所述多层网络层中的每一层用于对输入的数据进行特征提取,中间网络层为所述多层网络层中输出的特征图的分辨率最小的一层网络层;
    所述特征图获取模块,还用于获取第一图像分割网络层输出的所述第一帧的第一特征图,所述第一图像分割网络层为所述图像分割模型的所述中间网络层或位于所述图像分割模型的所述输入层和所述中间网络层之间的任意一层网络层;
    所述特征图获取模块,还用于获取第二图像分割网络层输出的所述第二帧的第二特征图,所述第二图像分割网络层为位于所述图像分割模型的所述中间网络层和所述输出层之间的任意一层网络层;
    训练模块,用于以所述第二帧的语义分割图像作为标注信息,将所述第一特征图和所述第二特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数,所述第一帧间融合模型是一种神经网络模型。
  26. 根据权利要求25所述的装置,其特征在于,所述特征图获取模块在将所述第二帧输入图像分割模型之后,还用于:
    获取所述第一图像分割网络层输出的所述第二帧的第三特征图;
    所述训练模块用于:
    将所述第一特征图、所述第二特征图和所述第三特征图输入第一帧间融合模型,更新所述第一帧间融合模型的参数。
  27. 根据权利要求26所述的装置,其特征在于,所述训练模块用于:
    以所述第三特征图为标注信息,将所述第一特征图输入第一邻帧预测模型,更新所述第一邻帧预测模型的参数。
  28. 根据权利要求27所述的装置,其特征在于,所述第一邻帧预测模型是一种所述卷积神经网络模型;
    所述训练模块基于所述第一邻帧预测模型满足第一约束条件,还用于:
    将所述第一特征图输入所述第一邻帧预测模型;
    获取第一邻帧预测网络层输出的所述第一特征图的第一压缩特征图,所述第一邻帧预测网络层为所述第一邻帧预测模型的所述中间网络层或位于所述第一邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图和所述第二特征图输入第二帧间融合模型,更新所述第二帧间融合模型的参数。
  29. 根据权利要求28所述的装置,其特征在于,所述所述训练模块还用于:
    以所述第一特征图为标注信息,将所述第三特征图输入第二邻帧预测模型,更新所述第二邻帧预测模型的参数。
  30. 根据权利要求29所述的装置,其特征在于,所述第二邻帧预测模型是一种所述卷积神经网络模型;
    所述训练模块基于所述第二邻帧预测模型满足第二约束条件,还用于:
    将所述第三特征图输入所述第二邻帧预测模型;
    获取第二邻帧预测网络层输出的所述第三特征图的第二压缩特征图,所述第二邻帧预测 网络层为所述第二邻帧预测模型的所述中间网络层或位于所述第二邻帧预测模型的所述输入层和所述中间网络层之间的任意一层网络层;
    以所述第二帧的语义分割图像作为标注信息,将所述第一压缩特征图、所述第二压缩特征图和所述第二特征图输入所述第二帧间融合模型,更新所述第二帧间融合模型的参数。
  31. 一种计算机设备,其特征在于,包括处理器和存储器,所述处理器在运行所述存储器存储的计算机指令时,执行如权利要求1至15中任一项所述的方法。
  32. 一种计算机可读存储介质,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至15中任一项所述的方法。
  33. 一种计算机程序产品,其特征在于,包括指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至15中任一项所述的方法。
PCT/CN2020/113206 2019-11-26 2020-09-03 一种语义分割方法、模型训练方法及装置 WO2021103731A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911177265.8 2019-11-26
CN201911177265.8A CN112862828B (zh) 2019-11-26 2019-11-26 一种语义分割方法、模型训练方法及装置

Publications (1)

Publication Number Publication Date
WO2021103731A1 true WO2021103731A1 (zh) 2021-06-03

Family

ID=75985054

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113206 WO2021103731A1 (zh) 2019-11-26 2020-09-03 一种语义分割方法、模型训练方法及装置

Country Status (2)

Country Link
CN (1) CN112862828B (zh)
WO (1) WO2021103731A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2622238A (en) * 2022-09-07 2024-03-13 Samsung Electronics Co Ltd A method and device for personalised image segmentation and processing

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113570610B (zh) * 2021-07-26 2022-05-13 北京百度网讯科技有限公司 采用语义分割模型对视频进行目标分割的方法、装置
CN113554640A (zh) * 2021-07-30 2021-10-26 四川大学华西医院 Ai模型的训练方法、使用方法、计算机设备及存储介质
CN113822147B (zh) * 2021-08-04 2023-12-15 北京交通大学 一种协同机器语义任务的深度压缩方法
CN114972422B (zh) * 2022-05-07 2024-06-07 安徽工业大学科技园有限公司 图像序列运动遮挡检测方法、装置、存储器和处理器
WO2024077463A1 (en) * 2022-10-11 2024-04-18 Intel Corporation Sequential modeling with memory including multi-range arrays

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114071A1 (en) * 2016-10-21 2018-04-26 Nokia Technologies Oy Method for analysing media content
CN108229336A (zh) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 视频识别及训练方法和装置、电子设备、程序和介质
CN109377494A (zh) * 2018-09-14 2019-02-22 阿里巴巴集团控股有限公司 一种用于图像的语义分割方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784654B (zh) * 2016-08-26 2020-09-25 杭州海康威视数字技术股份有限公司 图像分割方法、装置及全卷积网络系统
CN108875900B (zh) * 2017-11-02 2022-05-24 北京旷视科技有限公司 视频图像处理方法和装置、神经网络训练方法、存储介质
US10318842B1 (en) * 2018-09-05 2019-06-11 StradVision, Inc. Learning method, learning device for optimizing parameters of CNN by using multiple video frames and testing method, testing device using the same
CN110009598B (zh) * 2018-11-26 2023-09-05 腾讯科技(深圳)有限公司 用于图像分割的方法和图像分割设备

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114071A1 (en) * 2016-10-21 2018-04-26 Nokia Technologies Oy Method for analysing media content
CN108229336A (zh) * 2017-12-13 2018-06-29 北京市商汤科技开发有限公司 视频识别及训练方法和装置、电子设备、程序和介质
CN109377494A (zh) * 2018-09-14 2019-02-22 阿里巴巴集团控股有限公司 一种用于图像的语义分割方法和装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2622238A (en) * 2022-09-07 2024-03-13 Samsung Electronics Co Ltd A method and device for personalised image segmentation and processing

Also Published As

Publication number Publication date
CN112862828B (zh) 2022-11-18
CN112862828A (zh) 2021-05-28

Similar Documents

Publication Publication Date Title
WO2021103731A1 (zh) 一种语义分割方法、模型训练方法及装置
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2019228358A1 (zh) 深度神经网络的训练方法和装置
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
WO2021227726A1 (zh) 面部检测、图像检测神经网络训练方法、装置和设备
WO2021238366A1 (zh) 一种神经网络构建方法以及装置
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2020192736A1 (zh) 物体识别方法及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
CN112990211B (zh) 一种神经网络的训练方法、图像处理方法以及装置
WO2021063341A1 (zh) 图像增强方法以及装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2022179581A1 (zh) 一种图像处理方法及相关设备
CN110222718B (zh) 图像处理的方法及装置
CN111832592B (zh) Rgbd显著性检测方法以及相关装置
WO2021018245A1 (zh) 图像分类方法及装置
CN111402130A (zh) 数据处理方法和数据处理装置
WO2021227787A1 (zh) 训练神经网络预测器的方法、图像处理方法及装置
CN111368672A (zh) 一种用于遗传病面部识别模型的构建方法及装置
WO2021047587A1 (zh) 手势识别方法、电子设备、计算机可读存储介质和芯片
WO2023174098A1 (zh) 一种实时手势检测方法及装置
WO2023083030A1 (zh) 一种姿态识别方法及其相关设备
CN113066018A (zh) 一种图像增强方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20891810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20891810

Country of ref document: EP

Kind code of ref document: A1