WO2023051343A1 - 视频语义分割方法、装置、电子设备、存储介质及计算机程序产品 - Google Patents

视频语义分割方法、装置、电子设备、存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023051343A1
WO2023051343A1 PCT/CN2022/120176 CN2022120176W WO2023051343A1 WO 2023051343 A1 WO2023051343 A1 WO 2023051343A1 CN 2022120176 W CN2022120176 W CN 2022120176W WO 2023051343 A1 WO2023051343 A1 WO 2023051343A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
data
feature point
point
similarity
Prior art date
Application number
PCT/CN2022/120176
Other languages
English (en)
French (fr)
Inventor
李江彤
牛力
四建楼
钱晨
张丽清
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023051343A1 publication Critical patent/WO2023051343A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the embodiment of the present disclosure is based on the Chinese patent application with the application number 202111165458.9, the application date is September 30, 2021, and the application name is "Video Semantic Segmentation Method, Device, Electronic Equipment, and Storage Medium", and requires the Chinese patent application Priority, the entire content of this Chinese patent application is hereby incorporated into this disclosure by reference.
  • the present disclosure relates to the field of deep learning technology, and in particular to a video semantic segmentation method, device, electronic equipment, storage medium and computer program product.
  • Semantic video segmentation aims to assign a semantic label to each pixel in the video frame, so as to segment the video frame according to semantics. For example, different semantic objects such as pedestrians, bicycles, and animals in the video frame can be segmented to obtain semantic Split results.
  • each video frame in the video data can be semantically segmented to determine the semantic segmentation results of each video frame; and then the semantic segmentation results of each video frame can be aligned, that is, the video The same object in different video frames of the data is associated to obtain the semantic segmentation result corresponding to the video data.
  • the above-mentioned process of semantically segmenting video data is relatively cumbersome, making the efficiency of semantic segmentation low.
  • the embodiments of the present disclosure at least provide a video semantic segmentation method, device, electronic equipment, storage medium, and computer program product.
  • an embodiment of the present disclosure provides a video semantic segmentation method, including:
  • the complex image area includes a plurality of different semantics The region of at least some pixels of the target object
  • an apparatus for video semantic segmentation including:
  • the obtaining module is configured to obtain the first characteristic data corresponding to the video frame to be detected in the video data, and the historical characteristic data corresponding to the historical video frame whose acquisition time is located before the video frame to be detected in the video data;
  • the first determining module is configured to determine a first feature point that matches a position point of a complex image region in the video frame to be detected from among a plurality of feature points corresponding to the first feature data; wherein, the complex image
  • the area is an area including at least some pixels of multiple target objects with different semantics;
  • a processing module configured to generate, based on the historical feature data and the feature data of the first feature point, feature data of semantically enhanced feature points corresponding to the first feature point;
  • the second determining module is configured to determine, based on the feature data of the enhanced feature point and the feature data of other feature points in the plurality of feature points corresponding to the first feature data except the first feature point, Target semantic information corresponding to each pixel in the video frame to be detected.
  • an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor The memory communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps of the above video semantic segmentation method are executed.
  • an embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the above video semantic segmentation method are executed.
  • an embodiment of the present disclosure provides a computer program product, including a computer-readable storage medium storing program code, and when instructions included in the program code are executed by a processor of a computer device, the above-mentioned video semantic segmentation method is implemented. step.
  • a first feature point corresponding to The feature data of the enhanced feature points after semantic enhancement so that the feature data of the enhanced feature points includes the feature information of the video frame to be detected and the feature information of the historical video frame; then based on the feature data of the enhanced feature points and the features of other feature points Data, to determine the target semantic information corresponding to each pixel in the video frame to be detected, based on the use of the historical feature data corresponding to the historical video frame in the video data, the semantic segmentation between different video frames in the video data is realized. , improving the efficiency of semantic segmentation.
  • the first feature point is a feature point that matches the position point of the complex image area, since the complex image area includes multiple objects with different semantics object, making it difficult to determine the semantic information of the position point corresponding to the first feature point, so based on the historical feature data and the feature data of the first feature point, the semantics of the first feature point can be strengthened to generate the corresponding
  • the feature data of enhanced feature points after semantic enhancement, based on the feature data of enhanced feature points and feature data of other feature points can more accurately determine the target semantic information of each pixel in the video frame to be detected, and improve the detection efficiency. Accuracy of Semantic Segmentation of Video Frames.
  • FIG. 1 shows a schematic flow diagram of a video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic flowchart of a method for determining a first feature point in a video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of first feature data in a video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of first feature data and historical feature data in a video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic structural diagram of a semantic segmentation neural network in a video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 6 shows a schematic flowchart of another video semantic segmentation method provided by an embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of a video semantic segmentation device provided by an embodiment of the present disclosure
  • Fig. 8 shows a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • each video frame in the video data can be semantically segmented to determine the semantic segmentation results of each video frame; then the semantic segmentation results of each video frame can be aligned, that is, the different The same object in the video frame is associated to obtain the semantic segmentation result corresponding to the video data, which realizes the semantic segmentation of different video frames in the video data with consistent timing.
  • the above process of correlating the semantic segmentation results of each video frame to obtain the semantic segmentation results of video data is relatively cumbersome, and the efficiency of semantic segmentation is low.
  • the semantic segmentation results of video frames in video data can be determined by estimating motion distortions such as optical flow between different video frames .
  • Semantic segmentation results of other video frames Semantic segmentation results of other video frames.
  • the structure of the optical flow neural network tends to be complicated, which reduces the inference efficiency of the optical flow neural network, which in turn results in low efficiency of semantic segmentation of video data.
  • multiple neural networks are used to perform semantic segmentation on each video frame in the video data, which makes the semantic segmentation process of the video data more cumbersome.
  • An embodiment of the present disclosure provides a video semantic segmentation method. After acquiring the first feature data of the video frame to be detected and the historical feature data of the historical video frame, based on the feature data of the first feature point in the first feature data and the history Feature data, generate the feature data of the strengthened feature points corresponding to the semantic enhancement of the first feature point, so that the feature data of the strengthened feature points includes the feature information of the video frame to be detected and the feature information of the historical video frame; then based on the strengthened feature point The feature data of the feature data and the feature data of other feature points determine the target semantic information corresponding to each pixel in the video frame to be detected. On the basis of using the historical feature data corresponding to the historical video frames in the video data, different Temporal consistent semantic segmentation between video frames improves the efficiency of semantic segmentation.
  • the first feature point is a feature point that matches the position point of the complex image area, since the complex image area includes multiple objects with different semantics object, making it difficult to determine the semantic information of the position point corresponding to the first feature point, so based on the historical feature data and the feature data of the first feature point, the semantics of the first feature point can be strengthened to generate the corresponding
  • the feature data of enhanced feature points after semantic enhancement, based on the feature data of enhanced feature points and feature data of other feature points can more accurately determine the target semantic information of each pixel in the video frame to be detected, and improve the detection efficiency. Accuracy of Semantic Segmentation of Video Frames.
  • the execution subject of the video semantic segmentation method provided by the embodiments of the present disclosure may be a terminal device or a server, wherein, for example, the server may be a local server or a cloud server; the terminal device may be, for example, a mobile device or a Personal Digital Assistant (PDA). ), computing devices, in-vehicle devices, wearable devices, etc.
  • the video semantic segmentation method may be implemented in a manner in which a processor invokes computer-readable instructions stored in a memory.
  • FIG. 1 it is a schematic flowchart of a video semantic segmentation method provided by an embodiment of the present disclosure.
  • the method includes S101 to S104, wherein:
  • S104 Determine the target semantics corresponding to each pixel in the video frame to be detected based on the feature data of the enhanced feature point and the feature data of the feature points other than the first feature point among the feature points corresponding to the first feature data. information.
  • the historical video frame is a video frame whose acquisition time is before the video frame to be detected in the video data, and the number of historical video frames can be one or more frames.
  • the target frame number corresponding to the video frame to be detected can be T+1, and the number of historical video frames can be T, that is, the first frame historical video frame (the corresponding target frame number is 1), and the second frame historical video frame can be obtained.
  • the value of T can be set as required.
  • the video frame to be detected and at least one historical video frame may be obtained, and then feature extraction is performed on the video frame to be detected to obtain first feature data corresponding to the video frame to be detected. Since the historical video frame is used as the video frame to be detected, the feature extraction is performed on the historical video frame, so there is corresponding historical feature data in the historical video frame, and the historical feature data corresponding to the historical video frame can be obtained directly, without the need for historical video again Feature extraction is performed on each frame to avoid waste of resources caused by repeated extraction of feature data of historical video frames.
  • historical feature data corresponding to each historical video frame may be acquired.
  • the size of the first feature data is 56 ⁇ 56 ⁇ 128 (128 is the number of channels, and 56 is the corresponding length and width), then the corresponding There are 56 ⁇ 56 feature points.
  • a feature value matching the feature position is obtained from each channel to form a feature vector corresponding to the feature point.
  • the feature values located on the first row and the first column are obtained from each channel, and the feature vectors corresponding to the feature points on the first row and the first column are obtained.
  • the first feature point may be determined from the multiple feature points corresponding to the first feature data, and other feature points other than the first feature point among the multiple feature points corresponding to the first feature data may be obtained. Wherein, the first feature point matches the position point on the complex image area in the video frame to be detected.
  • the complex image area includes at least some pixels of target objects with multiple different semantics. It can be seen that a complex image region may contain multiple semantic objects, or may contain boundaries between different semantic objects.
  • determining the first feature point matching the position point of the complex image region in the video frame to be detected from the plurality of feature points corresponding to the first feature data may include:
  • the first feature point matching the position point of the complex image area in the video frame to be detected may be determined from the plurality of feature points corresponding to the first feature data.
  • determining the adjacent similarity of each feature point corresponding to the first feature data may include step S2011 and step S2012, wherein:
  • Step S2011 taking each feature point corresponding to the first feature data as the current feature point, and determining the neighborhood feature matrix of the current feature point based on the position data of the current feature point and the preset neighborhood radius; wherein, the neighborhood feature The matrix includes feature vectors of each feature point located in the neighborhood of the current feature point;
  • Step S2012 based on the neighborhood feature matrix and the feature vector of the current feature point, determine the neighbor similarity corresponding to the current feature point.
  • the neighborhood radius may be determined based on multiple experiments.
  • the neighborhood radius r can be 1, 2, and so on.
  • the neighborhood feature matrix of the current feature point is determined. For example, if the position data of the current feature point in the first feature data is (u, v) and the neighborhood radius is r, then the neighborhood Q n corresponding to the current feature point is Q[ur:u+r][vr :v+r], and then based on the feature vectors of each feature point in the neighborhood Q n in the first feature data, a neighborhood feature matrix Q n corresponding to the current feature point can be generated.
  • the current feature point 31 in the first feature data 30 is included in Fig. 3.
  • each feature point in the rectangular frame 32 is each feature point located in the neighborhood .
  • Each feature point corresponds to a feature vector.
  • the first feature data includes 128 channels
  • the feature vector corresponding to each feature point includes 128 element values.
  • a neighborhood feature matrix corresponding to the current feature point 31 can be generated according to the feature vectors corresponding to each feature point. It can be seen from FIG. 3 that the neighborhood feature matrix is a 9 ⁇ 128 matrix.
  • the neighborhood similarity corresponding to the current feature point can be determined by using the neighborhood feature matrix and the feature vector of the current feature point.
  • the adjacent similarity can be used to characterize the feature similarity distribution between the current feature point and multiple surrounding feature points (that is, other feature points in the neighborhood except the current feature point).
  • the neighborhood feature matrix includes feature information of other feature points around the current feature point; so that based on the neighborhood feature matrix and the feature vector of the current feature point , more accurately determine the adjacent similarity corresponding to the current feature point, and provide data support for the subsequent determination of the first feature point.
  • determining the adjacent similarity corresponding to the current feature point may include:
  • Step S212a based on the neighborhood feature matrix and the feature vector of the current feature point, determine at least one target similarity corresponding to the current feature point; wherein, the target similarity includes at least one of the following: used to characterize the current feature point within the neighborhood The feature similarity distribution between each feature point and the current feature point, and the first target similarity of the similarity between the uniform distribution; used to characterize the average of each feature point in the neighborhood of the current feature point and the current feature point a second target similarity of feature similarity;
  • Step S212b based on at least one target similarity, determine the adjacent similarity of the current feature point.
  • At least one target similarity corresponding to the current feature point may be determined by using the neighborhood feature matrix and the feature vector of the current feature point.
  • the target similarity may include at least one of the following: a first target similarity; a second target similarity.
  • the first target similarity is used to characterize the degree of similarity between the feature similarity distribution and the uniform distribution
  • the feature similarity distribution is the distribution of feature similarities between each feature point in the neighborhood of the current feature point and the current feature point. For example, if the feature similarity distribution a is [0.1, 0.1, 0.7, 0.1], the feature similarity distribution b is [0.2, 0.3, 0.25, 0.25], and the uniform distribution is [0.25, 0.25, 0.25, 0.25], then it can be known
  • the similarity between the feature similarity a and the uniform distribution is low, and the value of the first target similarity is large; the similarity between the feature similarity distribution b and the uniform distribution is high, and the value of the first target similarity is small .
  • the second target similarity is used to represent the average feature similarity between each feature point in the neighborhood of the current feature point and the current feature point.
  • the first target similarity when the first target similarity is included in at least one target similarity, the first target similarity can be used as the adjacent similarity of the current feature point; when the second target similarity is included in at least one target similarity, it can be Use the second target similarity as the adjacent similarity of the current feature point; when at least one target similarity includes the first target similarity and the second target similarity, the first target similarity can be similar to the second target The sum of degrees is used as the adjacent similarity of the current feature point.
  • a Neighboring Similarity Matrix (Neighboring Similarity Matrix, NSM) matching the first feature data can be generated.
  • the size of the matrix is consistent with the first feature data.
  • the adjacent similarity of the current feature point can be determined more flexibly and accurately.
  • the following describes the process of determining the first target similarity.
  • determining the target similarity corresponding to the current feature point may include steps S2121 to Step S2123, wherein:
  • Step S2121 determining the feature similarity between the feature vector of each feature point in the neighborhood of the current feature point and the feature vector of the current feature point;
  • Step S2122 based on the feature similarity, obtain the similarity distribution vector corresponding to the current feature point;
  • Step S2123 based on the similarity distribution vector and the determined uniform distribution vector, determine the first target similarity corresponding to the current feature point.
  • a feature similarity between each feature vector included in the neighborhood feature matrix and the feature vector of the current feature point may be determined.
  • the obtained feature similarities are used as element values to form a similarity distribution vector corresponding to the current feature point.
  • the similarity distribution vector P sim can be determined according to the following formula (1):
  • q is the feature vector of the current feature point
  • Q n is the neighborhood feature matrix
  • P u is a uniform distribution
  • P sim is a similarity distribution vector
  • n b is the number of elements included in the similarity distribution vector.
  • the quotient between the element value of each first element in the similarity distribution vector and the element value of the second element matching the position of the first element in the uniform distribution vector can be determined; Multiply the logarithm of the quotient value corresponding to one element with the element value of the second element to obtain the product value corresponding to the first element; finally, add the product values corresponding to each first element in the similarity distribution vector to obtain the target The first target similarity corresponding to the feature point.
  • the following describes the process of determining the similarity of the second target.
  • determining the second target similarity corresponding to the current feature point based on the neighborhood feature matrix and the feature vector of the current feature point may include: Determine the cosine of the angle between each eigenvector in the neighborhood feature matrix and the eigenvector of the current feature point; based on the cosine of the angle corresponding to each eigenvector in the neighborhood feature matrix, determine the corresponding The second target similarity.
  • the second target similarity D cos can be determined according to the following formula (3):
  • n b is the number of elements included in the similarity distribution vector, and also the number of feature vectors included in the neighborhood feature matrix Q n .
  • the first feature point may be determined from the plurality of feature points corresponding to the first feature data according to the adjacent similarity corresponding to each feature point.
  • determining a selected number of first feature points from a plurality of feature points corresponding to the first feature data may include the following two methods:
  • Method 1 Determine the number of selected first feature points based on the number of feature points corresponding to the first feature data and the set selection ratio; according to the order of the adjacent similarity from large to small, from multiple Among the feature points, the selected number of first feature points is determined.
  • Method 2 Determine a selected number of first feature points from a plurality of feature points corresponding to the first feature data based on adjacent similarities and a set similarity threshold.
  • the selection ratio can be set according to needs, for example, the selection ratio can be 40%, 50% and so on. If the number of feature points corresponding to the first feature data is 16 ⁇ 16 and the selection ratio is 50%, it is determined that the number of selected first feature points is 128. Then, 128 first feature points may be determined from the plurality of feature points corresponding to the first feature data in descending order of adjacent similarities. That is, from the adjacent similarity matrix NSM matched with the first feature data, a plurality of target position points can be determined according to the order of the adjacent similarity from large to small, and the feature matching the target position point in the first feature data point as the first feature point.
  • the similarity threshold can be set as required. From the plurality of feature points corresponding to the first feature data, select adjacent feature points whose similarity is greater than or equal to a similarity threshold as the first feature point.
  • the first feature point can be determined more flexibly.
  • feature data of all feature points included in the historical feature data may be used to semantically enhance the feature data of the first feature point to generate semantically enhanced feature data of the first feature point corresponding to the first feature point.
  • the feature data of the enhanced feature point includes the feature information in the historical feature data and the feature information of the first feature point, and the semantic information of the enhanced feature point is relatively rich.
  • historical feature data and feature data of the first feature point can be input into a Temporal Transformer, and the feature data of the first feature point can be semantically enhanced, so that the first feature point in the video frame to be detected can be
  • the timing information and semantic information in the historical video frames are captured, and the feature data of enhanced feature points with richer information is generated, so that based on the feature data of enhanced feature points, the semantic segmentation results corresponding to the video data can be obtained in a consistent sequence.
  • the second feature point that matches the position data of the first feature point can also be selected from the historical feature data, and the feature data of the second feature point can be used to strengthen the semantics of the feature data of the first feature point to generate the first feature
  • the feature data of the enhanced feature point after semantic enhancement corresponding to the point can be input into the time series converter, and the feature data of the first feature point can be semantically enhanced to generate feature data of the enhanced feature point.
  • the feature data of the enhanced feature point after the semantic enhancement corresponding to the first feature point is generated, including: based on the position data of the first feature point, The radius of the area corresponding to the historical feature data, from the multiple feature points corresponding to the historical feature data, determine the second feature point; based on the feature data of the second feature point and the feature data of the first feature point, generate the corresponding feature point of the first feature point The feature data of the enhanced feature points after the semantic enhancement of .
  • the data and the area radius corresponding to the historical feature data can conveniently and efficiently determine the second feature point from a plurality of feature points corresponding to the historical feature data.
  • the feature data of the second feature point and the feature data of the first feature point can be used to more accurately Semantic enhancement is performed on the first feature point.
  • the feature data of all feature points in the historical feature data to enhance the semantics of the first feature point while improving the accuracy of semantic segmentation, it can reduce the time complexity of semantic segmentation and improve improve the efficiency of semantic segmentation.
  • each historical video frame corresponds to a historical feature data
  • each historical feature data corresponds to an area radius
  • different historical feature data correspond to different area radii
  • the first The area radius corresponding to the historical feature data of T frames of historical video frames may be l T
  • the area radius corresponding to the historical feature data of the T-1th frame of historical video frames may be l T-1 .
  • the second feature on the historical feature data can be determined from a plurality of feature points corresponding to the historical feature data according to the area radius corresponding to the historical feature data and the position data of the first feature point point.
  • the area radius corresponding to the historical feature data can be determined according to the following steps:
  • Step 301 based on the target frame number corresponding to the historical feature data, and the set radius starting value, frame number threshold, and expansion coefficient, determine the candidate radius corresponding to the historical feature data;
  • Step 302 in the case that the candidate radius is smaller than the set radius cut-off value, determine the candidate radius as the area radius corresponding to the historical characteristic data;
  • Step 303 in the case that the candidate radius is greater than or equal to the radius cutoff value, determine the radius cutoff value as the radius of the area corresponding to the historical feature data.
  • the corresponding area radius can be determined for each historical feature data, and then based on the area radius, the first position of each historical feature data can be determined more accurately. Two feature points.
  • the area radius l t corresponding to the historical feature data of the tth frame can be determined according to the following formula (4):
  • s is the starting value of the radius
  • is the expansion coefficient
  • t is the target frame number
  • T is the frame number threshold
  • e is the radius cut-off value.
  • s, ⁇ , e can be set according to the actual situation.
  • the frame number threshold T is the number of historical video frames.
  • determining the second feature point from a plurality of feature points corresponding to the historical feature data may include: Determine the intermediate feature point in the data that matches the position data of the first feature point; determine the target area in the historical feature data based on the area radius, with the intermediate feature point as the center; , determined as the second feature point.
  • Fig. 4 includes the first feature data 41, the first historical feature data 42 and the second historical feature data 43, the first feature data 41 includes the first feature point 411, and the first historical feature data 42 Including the intermediate feature point 421 matching the position data of the first feature point 411, when the area radius corresponding to the first historical feature data 42 is 1, the target area in the first historical feature data, that is, the first rectangular frame 422 can be obtained The area in is the target area, and then each feature point located in the target area in the first historical feature data can be determined as the second feature point corresponding to the first historical feature data; the second historical feature data 43 includes the first The middle feature point 431 matched by the position data of the feature point 411, when the area radius corresponding to the second historical feature data 43 is 2, can obtain the target area in the second historical feature data, that is, the area in the second rectangular frame 432 is In the target area, each feature point in the second historical feature data located in the target area may be determined as the second feature point corresponding to the second historical feature data.
  • generating the semantically enhanced feature data of the enhanced feature point corresponding to the first feature point includes: based on the historical feature data and the first feature point feature data to generate fusion feature data; perform feature extraction on the fusion feature data to generate intermediate feature data; based on the intermediate feature data and fusion feature data, generate feature data of enhanced feature points corresponding to the first feature point after semantic enhancement.
  • fusion feature data may also be generated based on the feature data of the second feature point and the feature data of the first feature point; feature extraction is performed on the fusion feature data to generate intermediate feature data; based on the intermediate feature data and the fusion feature data, the second feature data is generated.
  • feature extraction can be performed on the feature data of the enhanced processing feature point and the feature data of other feature points except the first feature point among the plurality of feature points corresponding to the first feature data, Determine the target semantic information corresponding to each pixel in the video frame to be detected, and obtain the semantic segmentation result corresponding to the video frame to be detected.
  • the semantic segmentation result may include a semantic segmentation map, each pixel in the semantic segmentation map corresponds to a semantic label, and different semantic labels may be marked with different colors.
  • the target semantic information corresponding to each pixel in the video frame to be detected is obtained by using the trained semantic segmentation neural network;
  • the semantic segmentation neural network includes: shared encoder, feature point selection module, timing converter, and split decoder;
  • the shared encoder is used to perform feature extraction on the video frame to be detected and the historical video frame respectively, and obtain the first feature data corresponding to the video frame to be detected and the historical video frame.
  • the feature point selection module is used to determine the first feature point from a plurality of feature points corresponding to the first feature data.
  • the timing converter is used to perform semantic enhancement processing on the feature data of the first feature point based on the historical feature data corresponding to the historical video frames, and generate feature data of the enhanced feature point corresponding to the first feature point.
  • the segmentation decoder is used to determine the corresponding pixel of each pixel in the video frame to be detected based on the feature data of the enhanced feature point and the feature data of other feature points in the plurality of feature points corresponding to the first feature data except the first feature point. target semantic information.
  • the semantic segmentation neural network is used to realize the semantic segmentation of different video frames in the video data with consistent timing, which improves the accuracy of semantic segmentation while ensuring the accuracy of semantic segmentation. efficiency.
  • the video semantic segmentation method may include:
  • the multiple historical video frames include historical video frame FT , historical video frame FT-1 , historical video frame FT-2 , historical video frame FT-3 , . . . , historical video frame F 1 .
  • the acquired historical feature data includes: historical feature data corresponding to the historical video frame FT , historical feature data corresponding to the historical video frame FT-1 , ..., historical feature data corresponding to the historical video frame F1 .
  • the adjacent similarity matrix NSM corresponding to the first feature data may be determined, and the above description may be referred to for the determination process of the NSM.
  • the first feature point can be determined from a plurality of feature points corresponding to the first feature data according to the NSM.
  • a selection ratio (such as 50%) may be set, and the first feature points are selected in descending order of the adjacent similarities corresponding to each feature point in the first feature data indicated by the NSM.
  • the feature point corresponding to the gray box is the first feature point 51 .
  • the feature data of the first feature point can be input as the query query of the time series converter; the feature data of the second feature point can be input as the key key of the time series converter; the feature data of the second feature point can be input as The value input of the timing converter; it can be seen that the key input is the same as the value input.
  • the multi-attention mechanism module in the time series converter performs feature fusion on the input data to generate the first fusion feature data, see formula 5 (the output result of the single-head attention mechanism module) and formula 6 (the output result of the multi-attention mechanism module ), MH It is the first fused feature data; then input the first fused feature data and the feature data of the first feature point to the feature processing layer Add&Norm in the sequence converter to perform feature fusion to generate the second fused feature data, see formula 7 , X is the second fusion feature data; then input the second fusion feature data to the Feed Forward Layer for feature extraction to generate the third fusion feature data, see formula 8, FFN(X) is the third fusion Feature data; then input the third fusion feature data and the second fusion feature data to the feature processing layer Add&Norm 2 for feature fusion, and generate the feature data of the enhanced feature points corresponding to the first feature point after semantic enhancement, see formula 9, TFE That is, feature data of enhanced feature points.
  • the video semantic segmentation method proposed in the embodiments of the present disclosure can be applied to scenes requiring video semantic segmentation, such as autonomous driving, live broadcast, and augmented reality (Augmented Reality, AR).
  • scenes requiring video semantic segmentation such as autonomous driving, live broadcast, and augmented reality (Augmented Reality, AR).
  • the video semantic segmentation method when the video data can be road video data collected by the driving device during driving, based on the above-mentioned video semantic segmentation method, each road video in the road video data
  • the frame is semantically segmented to generate the semantic segmentation result corresponding to each road video frame; and then the driving device is controlled based on the semantic segmentation result corresponding to each road video frame.
  • the driving device may be a self-driving vehicle, a vehicle equipped with an Advanced Driving Assistance System (Advanced Driving Assistance System, ADAS), or a robot.
  • the traveling device when controlling the traveling device, can be controlled to accelerate, decelerate, turn, brake, etc., or voice prompt information can be played to prompt the driver to control the traveling device to accelerate, decelerate, turn, brake, etc.
  • the semantic segmentation result corresponding to each road video frame is generated, which improves the accuracy and determination efficiency of the semantic segmentation result, and then based on each road video frame
  • the corresponding semantic segmentation results can control the driving device more accurately and efficiently.
  • the video data may be the scene video data of the real-time scene of the AR device, and the video semantic segmentation method proposed in the embodiment of the present disclosure is used to perform semantic segmentation on each scene video frame in the scene video data, Generate the semantic segmentation result corresponding to each scene video frame; then determine the matching target according to the semantic information of the target object indicated by the semantic segmentation result corresponding to each scene video frame and the matching relationship between the preset semantics and the virtual object virtual object; and control the AR device to display the scene video containing the target virtual object.
  • the matched target virtual object can be a preset virtual character; if the semantic information of the target object is a building, the matched target virtual object can be a preset virtual building, etc. .
  • the video data can be live video data.
  • each live video frame in the live video data is semantically segmented to generate each live video The semantic segmentation result corresponding to the frame; and then perform background replacement on the live video frame according to the semantic segmentation result corresponding to each live video frame.
  • the pixel information of other semantic pixels other than humans indicated by the semantic segmentation result can be replaced with a preset value to generate a live video frame after the background replacement.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a video semantic segmentation device, as shown in FIG. , processing module 703, second determination module 704, wherein:
  • the acquiring module 701 is configured to acquire the first feature data corresponding to the video frame to be detected in the video data, and the historical feature data corresponding to the historical video frame whose acquisition time is located before the video frame to be detected in the video data;
  • the first determination module 702 is configured to determine a first feature point that matches the position point of the complex image area in the video frame to be detected from the plurality of feature points corresponding to the first feature data; wherein, the complex The image area is an area including at least some pixels of multiple target objects with different semantics;
  • the processing module 703 is configured to generate, based on the historical feature data and the feature data of the first feature point, feature data of enhanced feature points corresponding to the first feature point after semantic enhancement;
  • the second determining module 704 is configured to determine the selected feature point based on the feature data of the enhanced feature point and feature data of feature points other than the first feature point among the plurality of feature points corresponding to the first feature data. Describe the target semantic information corresponding to each pixel in the video frame to be detected.
  • the first determination module 702 among the plurality of feature points corresponding to the first feature data, determines the first feature point that matches the position point of the complex image area in the video frame to be detected.
  • the configuration is:
  • the first feature point is determined from a plurality of feature points corresponding to the first feature data.
  • the first determination module 702 when determining the adjacent similarity of each feature point corresponding to the first feature data, is configured to:
  • the neighborhood feature matrix includes feature vectors of each feature point located in the neighborhood of the current feature point;
  • the first determining module 702 determines the neighbor similarity corresponding to the current feature point based on the neighborhood feature matrix and the feature vector of the current feature point
  • the configuration is:
  • the target similarity includes at least one of the following: used to characterize the The feature similarity distribution between each feature point in the neighborhood of the current feature point and the current feature point, and the first target similarity of the degree of similarity between the uniform distribution; used to characterize the neighborhood of the current feature point The second target similarity of the average feature similarity between each feature point and the current feature point;
  • the adjacent similarity of the current feature point is determined.
  • the first determining module 702 based on the neighborhood feature matrix and the feature vector of the current feature point, In the case of determining the target similarity corresponding to the current feature point, the configuration is:
  • a first target similarity corresponding to the current feature point is determined.
  • the first determination module 702 based on the neighborhood feature matrix and the feature of the current feature point Vector, in the case of determining the second target similarity corresponding to the current feature point, the configuration is:
  • a second target similarity corresponding to the target feature point is determined based on cosine values of included angles corresponding to respective feature vectors in the neighborhood feature matrix.
  • the first determining module 702 determines the first feature point from the plurality of feature points corresponding to the first feature data based on the adjacent similarity , configured as:
  • the method for determining the selected number of the first feature points includes at least one of the following:
  • the selected number of the first feature points are determined from the plurality of feature points corresponding to the first feature data.
  • the processing module 703 based on the historical feature data and the feature data of the first feature point, generates the feature of the enhanced feature point after the semantic enhancement corresponding to the first feature point
  • the configuration is:
  • the processing module 703 based on the position data of the first feature point and the radius of the area corresponding to the historical feature data, selects from the multiple feature points corresponding to the historical feature data , in the case of determining the second feature point, the configuration is:
  • Each feature point in the historical feature data located in the target area is determined as the second feature point.
  • the processing module 703 is configured to: determine the area radius corresponding to the historical feature data according to the following steps:
  • the candidate radius is determined as the area radius corresponding to the historical feature data
  • the radius cutoff value is determined as the area radius corresponding to the historical feature data.
  • the processing module 703 based on the historical feature data and the feature data of the first feature point, generates the feature of the enhanced feature point after the semantic enhancement corresponding to the first feature point
  • the configuration is:
  • feature data of semantically enhanced feature points corresponding to the first feature points are generated.
  • the target semantic information corresponding to each pixel in the video frame to be detected is obtained by using a trained semantic segmentation neural network;
  • the semantic segmentation neural network includes: a shared encoder, a feature point Select modules, timing converters, and split decoders;
  • the shared encoder is used to perform feature extraction on the video frame to be detected and the historical video frame respectively, and obtain the first feature data corresponding to the video frame to be detected and the historical video frame; the feature point selection module for determining the first feature point from a plurality of feature points corresponding to the first feature data;
  • the timing converter is used to perform semantic enhancement processing on the feature data of the first feature point based on the historical feature data corresponding to the historical video frame, and generate the feature data of the enhanced feature point corresponding to the first feature point;
  • the segmentation decoder is used to determine the to-be Detect the target semantic information corresponding to each pixel in the video frame.
  • the functions of the device provided by the embodiments of the present disclosure or the included templates can be used to execute the methods described in the above method embodiments, and the implementation manner can refer to the descriptions of the above method embodiments.
  • an embodiment of the present disclosure also provides an electronic device.
  • FIG. 8 it is a schematic structural diagram of an electronic device 800 provided by an embodiment of the present disclosure, including a processor 801 , a memory 802 and a bus 803 .
  • the memory 802 is used to store execution instructions, including a memory 8021 and an external memory 8022; the memory 8021 here is also called an internal memory, and is used to temporarily store calculation data in the processor 801 and exchange data with an external memory 8022 such as a hard disk.
  • the processor 801 exchanges data with the external memory 8022 through the memory 8021.
  • the processor 801 communicates with the memory 802 through the bus 803, so that the processor 801 executes the following instructions:
  • the complex image area includes a plurality of different semantics The region of at least some pixels of the target object
  • an embodiment of the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the method for semantic segmentation of video described in the above-mentioned method embodiments is executed. step.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the embodiment of the present disclosure also provides a computer program product, the computer program product is loaded with program code, and the instructions included in the program code can be used to execute the steps of the video semantic segmentation method described in the above method embodiment, please refer to the above method Example.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • This disclosure relates to the field of augmented reality.
  • acquiring the image information of the target object in the real environment and then using various visual related algorithms to detect or identify the relevant features, states and attributes of the target object, and then obtain the image information that matches the application scene.
  • AR effect combining virtual and reality.
  • the target object may involve faces, limbs, gestures, actions, etc. related to the human body, or markers and markers related to objects, or sand tables, display areas or display items related to venues or places.
  • Vision-related algorithms can involve visual positioning, Simultaneous Localization And Mapping (SLAM), 3D reconstruction, image registration, background segmentation, object key point extraction and tracking, object pose or depth detection, etc.
  • Application scenarios can not only involve interactive scenarios such as tours, navigation, explanations, reconstructions, virtual effect superimposition and display related to real scenes or objects, but also special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual Interactive scenarios such as model display.
  • the relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.
  • the above-mentioned convolutional neural network is a network model obtained by performing model training based on a deep learning framework.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

一种视频语义分割方法、装置、电子设备、存储介质及计算机程序产品,该视频语义分割方法包括:获取视频数据中待检测视频帧对应的第一特征数据,以及视频数据中采集时间位于待检测视频帧之前的历史视频帧对应的历史特征数据(S101);从第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点(S102);基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据(S103);基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息(S104)。

Description

视频语义分割方法、装置、电子设备、存储介质及计算机程序产品
相关申请的交叉引用
本公开实施例基于申请号为202111165458.9、申请日为2021年09月30日、申请名称为“视频语义分割方法、装置、电子设备及存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本公开。
技术领域
本公开涉及深度学习技术邻域,尤其涉及一种视频语义分割方法、装置、电子设备、存储介质及计算机程序产品。
背景技术
视频语义分割旨在为视频帧中的每个像素点分配一个语义标签,实现将视频帧按照语义进行分割,比如,可以将视频帧中的行人、自行车、动物等不同语义对象进行分割,得到语义分割结果。
一般的,在对视频数据进行语义分割时,可以对视频数据中的每个视频帧进行语义分割,确定各个视频帧的语义分割结果;再可以将各个视频帧的语义分割结果进行对齐,即将视频数据的不同视频帧中同一对象进行关联,得到视频数据对应的语义分割结果。但是,上述对视频数据进行语义分割的过程较为繁琐,使得语义分割的效率较低。
发明内容
有鉴于此,本公开实施例至少提供一种视频语义分割方法、装置、电子设备、存储介质及计算机程序产品。
第一方面,本公开实施例提供了一种视频语义分割方法,包括:
获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据;
基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
第二方面,本公开实施例提供了一种视频语义分割装置,包括:
获取模块,配置为获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
第一确定模块,配置为从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
处理模块,配置为基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据;
第二确定模块,配置为基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
第三方面,本公开实施例提供一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所 述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述视频语义分割方法的步骤。
第四方面,本公开实施例提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述视频语义分割方法的步骤。
第五方面,本公开实施例提供一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现上述视频语义分割方法的步骤。
上述方法中,在获取待检测视频帧的第一特征数据和历史视频帧的历史特征数据之后,基于第一特征数据中第一特征点的特征数据、和历史特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据,使得加强特征点的特征数据中包括待检测视频帧的特征信息和历史视频帧的特征信息;再基于加强特征点的特征数据和其它特征点的特征数据,确定待检测视频帧中每个像素点对应的目标语义信息,在使用视频数据中历史视频帧对应的历史特征数据的基础上,实现了视频数据中不同视频帧之间时序一致的语义分割,提高了语义分割的效率。
同时,通过从第一特征数据对应的多个特征点中确定第一特征点,第一特征点为与复杂图像区域的位置点匹配的特征点,由于复杂图像区域中包括多个不同语义的目标对象,使得第一特征点对应的位置点的语义信息的确定较为困难,故可以基于历史特征数据和第一特征点的特征数据,对第一特征点进行语义加强,生成第一特征点对应的语义加强后的加强特征点的特征数据,后续基于加强特征点的特征数据和其他特征点的特征数据,能够较准确的确定待检测视频帧中每个像素点的目标语义信息,提高了待检测视频帧语义分割的精准度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开实施例。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本邻域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种视频语义分割方法的流程示意图;
图2示出了本公开实施例所提供的一种视频语义分割方法中,确定第一特征点的方式的流程示意图;
图3示出了本公开实施例所提供的一种视频语义分割方法中,第一特征数据的示意图;
图4示出了本公开实施例所提供的一种视频语义分割方法中,第一特征数据和历史特征数据的示意图;
图5示出了本公开实施例所提供的一种视频语义分割方法中,语义分割神经网络的结构示意图;
图6示出了本公开实施例所提供的另一种视频语义分割方法的流程示意图;
图7示出了本公开实施例所提供的一种视频语义分割装置的架构示意图;
图8示出了本公开实施例所提供的一种电子设备的结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中 的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的描述并非旨在限制要求保护的本公开的范围,而是表示本公开的选定实施例。基于本公开的实施例,本邻域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
在对视频数据进行语义分割时,可以对视频数据中的每个视频帧进行语义分割,确定各个视频帧的语义分割结果;再可以将各个视频帧的语义分割结果进行对齐,即将视频数据的不同视频帧中同一对象进行关联,得到视频数据对应的语义分割结果,实现了对视频数据中不同视频帧执行时序一致的语义分割。但是,上述通过将各个视频帧的语义分割结果进行关联,得到视频数据的语义分割结果的过程较为繁琐,语义分割的效率较低。
在一些实施例中,为了实现视频数据中各个视频帧之间的时序一致的语义分割,可以通过估计不同视频帧之间的运动扭曲比如光流,以确定视频数据中各个视频帧的语义分割结果。比如,可以从视频数据中采样关键视频帧,使用语义分割神经网络预测关键视频帧的语义分割结果,再使用光流神经网络根据关键视频帧的语义分割结果,确定视频数据中除关键视频帧之外的其他视频帧的语义分割结果。但是,为了保证其他视频帧的语义分割的精准度,光流神经网络的结构趋向于复杂化,使得光流神经网络的推理效率降低,进而造成视频数据的语义分割的效率较低。同时,使用多个神经网络对视频数据中的各个视频帧进行语义分割,造成视频数据的语义分割过程较为繁琐。
本公开实施例提供了一种视频语义分割方法,在获取待检测视频帧的第一特征数据和历史视频帧的历史特征数据之后,基于第一特征数据中第一特征点的特征数据、和历史特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据,使得加强特征点的特征数据中包括待检测视频帧的特征信息和历史视频帧的特征信息;再基于加强特征点的特征数据和其它特征点的特征数据,确定待检测视频帧中每个像素点对应的目标语义信息,在使用视频数据中历史视频帧对应的历史特征数据的基础上,实现了视频数据中不同视频帧之间时序一致的语义分割,提高了语义分割的效率。
同时,通过从第一特征数据对应的多个特征点中确定第一特征点,第一特征点为与复杂图像区域的位置点匹配的特征点,由于复杂图像区域中包括多个不同语义的目标对象,使得第一特征点对应的位置点的语义信息的确定较为困难,故可以基于历史特征数据和第一特征点的特征数据,对第一特征点进行语义加强,生成第一特征点对应的语义加强后的加强特征点的特征数据,后续基于加强特征点的特征数据和其他特征点的特征数据,能够较准确的确定待检测视频帧中每个像素点的目标语义信息,提高了待检测视频帧语义分割的精准度。
针对以上方案,均是发明人在经过实践并仔细研究后得出的结果,因此,上述问题的发现过程以及下文中本公开针对上述问题所提出的解决方案,都应该是发明人在本公开过程中对本公开做出的贡献。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
为便于对本公开实施例进行理解,首先对本公开实施例所公开的一种视频语义分割方法进行介绍。本公开实施例所提供的视频语义分割方法的执行主体可以为终端设备或服务器,其中,服务器比如可以为本地服务器、云端服务器;终端设备比如可以为移动设备、个人数字助理(Personal Digital Assistant,PDA)、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该视频语义分割方法可以通过处理器调用存储器中 存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例所提供的视频语义分割方法的流程示意图,所述方法包括S101至S104,其中:
S101,获取视频数据中待检测视频帧对应的第一特征数据,以及视频数据中采集时间位于待检测视频帧之前的历史视频帧对应的历史特征数据;
S102,从第一特征数据对应的多个特征点中,确定与待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
S103,基于历史特征数据和第一特征点的特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据;
S104,基于加强特征点的特征数据,和第一特征数据对应的多个特征点中除第一特征点外的其它特征点的特征数据,确定待检测视频帧中每个像素点对应的目标语义信息。
下述对S101至S104进行说明。
针对S101:
历史视频帧为视频数据中采集时间位于待检测视频帧之前的视频帧,历史视频帧的数量可以为一帧或多帧。
实施时,待检测视频帧对应的目标帧数可以为T+1,历史视频帧的数量可以为T,即可以获取第1帧历史视频帧(对应的目标帧数为1)、第2帧历史视频帧、…、第T帧历史视频帧(对应的目标帧数为T)。其中,T的值可以根据需要进行设置。
可以获取待检测视频帧和至少一帧历史视频帧,再对待检测视频帧进行特征提取,得到待检测视频帧对应的第一特征数据。由于在将历史视频帧作为待检测视频帧时,对历史视频帧进行了特征提取,故历史视频帧存在对应的历史特征数据,可以直接获取历史视频帧对应的历史特征数据,无需再次对历史视频帧进行特征提取,避免重复提取历史视频帧的特征数据造成的资源浪费。
在历史视频帧为多帧时,可以获取每帧历史视频帧对应的历史特征数据。
针对S102:
第一特征数据中对应有多个特征点,比如,若第一特征数据的尺寸为56×56×128(128为通道数,56为对应的长和宽),则该第一特征数据中对应有56×56个特征点。针对第一特征数据中的每个特征点,根据该特征点的特征位置,从各个通道上获取与该特征位置匹配的特征值,构成了该特征点对应的特征向量。比如,针对位于第1行第1列上的特征点,从各个通道上获取位于第1行第1列上的特征值,得到第1行第1列上的特征点对应的特征向量。
可以从第一特征数据对应的多个特征点中,确定第一特征点,以及还可以得到第一特征数据对应的多个特征点中,除第一特征点之外的其他特征点。其中,第一特征点与待检测视频帧中复杂图像区域上的位置点相匹配。复杂图像区域上包括有多个不同语义的目标对象的至少部分像素点。可知复杂图像区域中可以包含多个语义对象,或者,包含不同语义对象之间的边界。
一种可选实施方式中,参见图2所示,从第一特征数据对应的多个特征点中,确定与待检测视频帧中复杂图像区域的位置点匹配的第一特征点,可以包括:
S201,确定第一特征数据对应的每个特征点的相邻相似度;其中,相邻相似度用于表征特征点与多个周围特征点之间的特征相似度分布;
S202,基于相邻相似度,从第一特征数据对应的多个特征点中,确定第一特征点。
一般的,复杂图像区域与包括单一语义的目标对象的简单图像区域相比,复杂图像区域对语义分割结果的精度和效率具有较大的贡献,因此,为了在语义分割结果的准确 度与效率之间进行均衡,可以从第一特征数据对应的多个特征点中,确定与待检测视频帧中复杂图像区域的位置点匹配的第一特征点。同时,考虑到不同语义的目标对象的像素信息之间会存在较大的差异,即复杂图像区域内像素点的像素特征相似度存在差异。基于此,本公开实施方式中,通过确定第一特征数据对应的每个特征点的相邻相似度,并根据相邻相似度,从第一特征数据对应的多个特征点中,确定第一特征点。
针对S201:
一种可选实施方式中,确定第一特征数据对应的每个特征点的相邻相似度,可以包括步骤S2011和步骤S2012,其中:
步骤S2011,将第一特征数据对应的每个特征点分别作为当前特征点,基于当前特征点的位置数据和预先设置的邻域半径,确定当前特征点的邻域特征矩阵;其中,邻域特征矩阵包括位于当前特征点的邻域内的各个特征点的特征向量;
步骤S2012,基于邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的相邻相似度。
在步骤S2011中,邻域半径可以根据多次试验进行确定。比如,邻域半径r可以为1、2等。基于当前特征点的位置数据和预先设置的邻域半径,确定当前特征点的邻域特征矩阵。比如,若当前特征点在第一特征数据中的位置数据为(u,v)、邻域半径为r时,则当前特征点对应的邻域Q n为Q[u-r:u+r][v-r:v+r],再可以基于第一特征数据中位于邻域Q n内的各个特征点的特征向量,生成当前特征点对应的邻域特征矩阵Q n
参见图3所示,图3中包括第一特征数据30中的当前特征点31,在预先设置的邻域半径为1时,矩形框32内的各个特征点即为位于邻域内的各个特征点。其中,每个特征点对应一个特征向量,比如若第一特征数据中包括128个通道时,则每个特征点对应的特征向量中包括128个元素值。再可以根据各个特征点对应的特征向量,生成当前特征点31对应的邻域特征矩阵,由图3可知,该邻域特征矩阵为9×128矩阵。
在步骤S2012中,可以利用邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的相邻相似度。其中,相邻相似度可以用于表征当前特征点与多个周围特征点(即邻域内除当前特征点之外的其他特征点)之间的特征相似度分布。
本公开实施例中,通过确定当前特征点的邻域特征矩阵,该邻域特征矩阵中包括位于当前特征点周围的其他特征点的特征信息;使得基于邻域特征矩阵和当前特征点的特征向量,较准确的确定当前特征点对应的相邻相似度,为后续确定第一特征点提供数据支持。
一种可选实施方式中,基于邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的相邻相似度,可以包括:
步骤S212a,基于邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的至少一种目标相似度;其中,目标相似度包括以下至少之一:用于表征当前特征点的邻域内的各个特征点与当前特征点之间的特征相似度分布、和均匀分布之间的相似程度的第一目标相似度;用于表征当前特征点的邻域内各个特征点与当前特征点之间的平均特征相似度的第二目标相似度;
步骤S212b,基于至少一种目标相似度,确定当前特征点的所述相邻相似度。
实施时,可以利用邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的至少一种目标相似度。其中,目标相似度可以包括以下至少之一:第一目标相似度;第二目标相似度。
第一目标相似度用于表征特征相似度分布与均匀分布之间的相似程度,特征相似度分布为当前特征点的邻域内的各个特征点与当前特征点之间的特征相似度的分布。比如,若特征相似度分布a为[0.1,0.1,0.7,0.1],特征相似度分布b为[0.2,0.3,0.25,0.25], 均匀分布为[0.25,0.25,0.25,0.25],则可知特征相似度a与均匀分布之间的相似程度较低,第一目标相似度的值较大;特征相似度分布b与均匀分布之间的相似程度较高,第一目标相似度的值较小。
第二目标相似度用于表征当前特征点的邻域内各个特征点与当前特征点之间的平均特征相似度。
在至少一种目标相似度中包括第一目标相似度时,可以将第一目标相似度作为当前特征点的相邻相似度;在至少一种目标相似度中包括第二目标相似度时,可以将第二目标相似度作为当前特征点的相邻相似度;在至少一种目标相似度中包括第一目标相似度和第二目标相似度时,可以将第一目标相似度与第二目标相似度的和,作为当前特征点的相邻相似度。
进而可以按照各个特征点在第一特征数据中的位置数据、以及该特征点对应的相邻相似度,生成与第一特征数据匹配的相邻相似矩阵(Neighboring Similarity Matrix,NSM),相邻相似矩阵的尺寸与第一特征数据一致。
这里,通过设置至少一种目标相似度,能够较灵活、较准确的确定当前特征点的相邻相似度。
下述对确定第一目标相似度的过程进行说明。
一种可选实施方式中,在目标相似度包括第一目标相似度的情况下,基于邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的目标相似度,可以包括步骤S2121至步骤S2123,其中:
步骤S2121,确定当前特征点的邻域内每个特征点的特征向量与当前特征点的特征向量之间的特征相似度;
步骤S2122,基于特征相似度,得到当前特征点对应的相似度分布向量;
步骤S2123,基于相似度分布向量和确定的均匀分布向量,确定当前特征点对应的第一目标相似度。
可以确定邻域特征矩阵中包括的每个特征向量与当前特征点的特征向量之间的特征相似度。将得到的各个特征相似度作为元素值,构成了当前特征点对应的相似度分布向量。
实施时,可以根据下述公式(1)确定相似度分布向量P sim
P sim=SoftMax(Q n·q T)       (1);
其中,q为当前特征点的特征向量;Q n为邻域特征矩阵。
以及可以根据下述公式(2)确定第一目标相似度D KL
Figure PCTCN2022120176-appb-000001
其中,P u为均匀分布,P sim为相似度分布向量,n b为相似度分布向量中包括的元素数量。
这里,在P u中包括n b个元素时,均匀分布
Figure PCTCN2022120176-appb-000002
在一些实施例中,可以确定相似度分布向量中每个第一元素的元素值、和均匀分布向量中与第一元素的位置匹配的第二元素的元素值之间的商值;再将第一元素对应的商值的对数与第二元素的元素值相乘,得到第一元素对应的乘积值;最后,将相似度分布向量中各个第一元素分别对应的乘积值相加,得到目标特征点对应的第一目标相似度。
下述对确定第二目标相似度的过程进行说明。
一种可选实施方式中,在目标相似度包括第二目标相似度的情况下,基于邻域特征矩阵和当前特征点的特征向量,确定当前特征点对应的第二目标相似度,可以包括:确定邻域特征矩阵中的每个特征向量与当前特征点的特征向量之间的夹角余弦值;基于邻域特征矩阵中的各个特征向量分别对应的夹角余弦值,确定目标特征点对应的第二目标 相似度。
实施时,可以根据下述公式(3)确定第二目标相似度D cos
Figure PCTCN2022120176-appb-000003
其中,
Figure PCTCN2022120176-appb-000004
为邻域特征矩阵Q n中的第i个特征向量;n b为相似度分布向量中包括的元素数量,也为邻域特征矩阵Q n中包括的特征向量的数量。
针对S202:
在得到第一特征数据对应的每个特征点的相邻相似度之后,可以根据各个特征点分别对应的相邻相似度,从第一特征数据对应的多个特征点中确定第一特征点。
一种可选实施方式中,在S202中,基于相邻相似度,从第一特征数据对应的多个特征点中,确定选取数量的第一特征点,可以包括下述两种方式:
方式一,基于第一特征数据对应的特征点的数量和设置的选取比例,确定第一特征点的选取数量;按照相邻相似度从大到小的顺序,从第一特征数据对应的多个特征点中,确定选取数量的第一特征点。
方式二,基于相邻相似度和设置的相似度阈值,从第一特征数据对应的多个特征点中,确定选取数量的第一特征点。
在方式一中,选取比例可以根据需要进行设置,比如,选取比例可以为40%、50%等。若第一特征数据对应的特征点的数量为16×16、选取比例为50%,则确定第一特征点的选取数量为128。再可以按照相邻相似度从大到小的顺序,从第一特征数据对应的多个特征点中,确定128个第一特征点。即可以从与第一特征数据匹配的相邻相似矩阵NSM中,按照相邻相似度从大到小的顺序,确定多个目标位置点,将第一特征数据中与该目标位置点匹配的特征点,作为第一特征点。
在方式二中,相似度阈值可以根据需要进行设置。从第一特征数据对应的多个特征点中,选取相邻相似度大于或等于相似度阈值的特征点,作为第一特征点。
这里,通过设置多种选取方式,能够较为灵活的确定第一特征点。
针对S103和S104:
这里,可以利用历史特征数据中包括的全部特征点的特征数据,对第一特征点的特征数据进行语义加强,生成第一特征点对应的语义加强后的加强特征点的特征数据。其中,加强特征点的特征数据中包括有历史特征数据中的特征信息和第一特征点的特征信息,加强特征点的语义信息较为丰富。比如,可以将历史特征数据和第一特征点的特征数据输入至时序转换器(Temporal Transformer)中,对第一特征点的特征数据进行语义加强,使得待检测视频帧中的第一特征点能够捕获历史视频帧中的时序信息和语义信息,生成信息较为丰富的加强特征点的特征数据,以便基于加强特征点的特征数据,能够得到视频数据对应的时序一致的语义分割结果。
或者,也可以从历史特征数据中选取与第一特征点的位置数据匹配的第二特征点,利用第二特征点的特征数据,对第一特征点的特征数据进行语义加强,生成第一特征点对应的语义加强后的加强特征点的特征数据。比如,可以将第二特征点的特征数据和第一特征点的特征数据输入至时序转换器中,对第一特征点的特征数据进行语义加强,生成加强特征点的特征数据。
一种可选实施方式中,基于历史特征数据和第一特征点的特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据,包括:基于第一特征点的位置数据、和历史特征数据对应的区域半径,从历史特征数据对应的多个特征点中,确定第二特征点;基于第二特征点的特征数据和第一特征点的特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据。
由于历史视频帧与待检测视频帧之间存在时序关系,和处于移动状态中的目标对象 在不同视频帧中的尺寸会发生改变,因此,可以通过设置的区域半径,基于第一特征点的位置数据、和历史特征数据对应的区域半径,能够较为方便和高效的从历史特征数据对应的多个特征点中确定第二特征点。
同时,由于该第二特征点具有的语义信息与第一特征点具有的语义信息一致的可能性较高,再利用第二特征点的特征数据和第一特征点的特征数据,能够较准确的对第一特征点进行语义加强。并且本公开实施方式中,与使用历史特征数据中全部特征点的特征数据对第一特征点进行语义加强相比,在提高了语义分割精准度的同时,能够减少语义分割的时间复杂度,提升了语义分割的效率。
本公开实施方式中,在历史视频帧为多帧时,每个历史视频帧对应一个历史特征数据,每个历史特征数据对应一个区域半径,不同的历史特征数据对应不同的区域半径,比如,第T帧历史视频帧的历史特征数据对应的区域半径可以为l T、第T-1帧历史视频帧的历史特征数据对应的区域半径可以为l T-1
针对每帧历史特征数据,可以根据该历史特征数据对应的区域半径和和第一特征点的位置数据,从该历史特征数据对应的多个特征点中,确定该历史特征数据上的第二特征点。
一种可选实施方式中,可以根据下述步骤确定历史特征数据对应的区域半径:
步骤301,基于历史特征数据对应的目标帧数、和设置的半径起始值、帧数阈值、扩展系数,确定历史特征数据对应的候选半径;
步骤302,在候选半径小于设置的半径截止值的情况下,将候选半径确定为历史特征数据对应的区域半径;
步骤303,在候选半径大于或等于半径截止值的情况下,将半径截止值确定为历史特征数据对应的区域半径。
首先确定历史特征数据对应的候选半径,在候选半径小于半径截止值时,将该候选半径确定为历史特征数据对应的区域半径;在候选半径大于或等于半径截止值时,将半径截止值确定为历史特征数据对应的区域半径。
考虑到目标对象在视频数据的不同视频帧中的尺寸会发生变化,因此,可以为每个历史特征数据确定对应的区域半径,进而能够基于区域半径,较准确的确定每个历史特征数据的第二特征点。
实施时,可以根据下述公式(4)确定第t帧历史特征数据对应的区域半径l t
Figure PCTCN2022120176-appb-000005
其中,s为半径起始值,∈为扩展系数,t为目标帧数,T为帧数阈值,e为半径截止值。s、∈、e可以根据实际情况进行设置。帧数阈值T为历史视频帧的数量。t为历史视频帧的目标帧数,比如,第T帧历史视频帧的目标帧数为T(即t=T),第T-1帧历史视频帧的目标帧数为T-1(即t=T-1)。
一种可选实施方式中,基于第一特征点的位置数据、和历史特征数据对应的区域半径,从历史特征数据对应的多个特征点中,确定第二特征点,可以包括:从历史特征数据中确定与第一特征点的位置数据匹配的中间特征点;基于区域半径,以中间特征点为中心,确定历史特征数据中的目标区域;将历史特征数据中位于目标区域内的各个特征点,确定为第二特征点。
参见图4所示,图4中包括第一特征数据41、第一历史特征数据42和第二历史特征数据43,第一特征数据41中包括第一特征点411,第一历史特征数据42中包括与第一特征点411的位置数据匹配的中间特征点421,在第一历史特征数据42对应的区域半径为1时,可以得到第一历史特征数据中的目标区域,即第一矩形框422中的区域为目标区域,进而可以将第一历史特征数据中位于目标区域内的各个特征点,确定为第一历 史特征数据对应的第二特征点;第二历史特征数据43中包括与第一特征点411的位置数据匹配的中间特征点431,在第二历史特征数据43对应的区域半径为2时,可以得到第二历史特征数据中的目标区域,即第二矩形框432中的区域为目标区域,进而可以将第二历史特征数据中位于目标区域内的各个特征点,确定为第二历史特征数据对应的第二特征点。
一种可能的实施方式中,基于历史特征数据和第一特征点的特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据,包括:基于历史特征数据和第一特征点的特征数据,生成融合特征数据;对融合特征数据进行特征提取,生成中间特征数据;基于中间特征数据和融合特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据。
或者,也可以基于第二特征点的特征数据和第一特征点的特征数据,生成融合特征数据;对融合特征数据进行特征提取,生成中间特征数据;基于中间特征数据和融合特征数据,生成第一特征点对应的语义加强后的加强特征点的特征数据。
在得到加强处理特征点的特征数据之后,可以对加强处理特征点的特征数据、和第一特征数据对应的多个特征点中除第一特征点外的其他特征点的特征数据进行特征提取,确定待检测视频帧中每个像素点对应的目标语义信息,得到待检测视频帧对应的语义分割结果。其中,语义分割结果中可以包括语义分割图,语义分割图中每个像素点对应一个语义标签,不同的语义标签可以使用不同的颜色进行标注。
一种可能的实施方式中,待检测视频帧中每个像素点对应的目标语义信息为利用训练后的语义分割神经网络得到的;语义分割神经网络包括:共享编码器、特征点选择模块、时序转换器、和分割解码器;
共享编码器用于分别对待检测视频帧和历史视频帧进行特征提取,获取待检测视频帧对应的第一特征数据和历史视频帧。特征点选择模块用于从第一特征数据对应的多个特征点中确定第一特征点。时序转换器用于基于历史视频帧对应的历史特征数据,对第一特征点的特征数据进行语义加强处理,生成第一特征点对应的加强特征点的特征数据。分割解码器用于基于加强特征点的特征数据、以及第一特征数据对应的多个特征点中除第一特征点外的其它特征点的特征数据,确定待检测视频帧中每个像素点对应的目标语义信息。
上述实施方式中,基于待检测视频帧和历史视频帧,使用语义分割神经网络,实现对视频数据中不同视频帧的时序一致的语义分割,在保障语义分割精准度的同时,提高了语义分割的效率。
参见图5所示的语义分割神经网络的结构示意图,结合图5对视频语义分割方法的过程进行说明。参见图6所示,该视频语义分割方法可以包括:
S601、获取视频数据中的待检测视频帧F T+1、和多帧历史视频帧。其中,多帧历史视频帧包括历史视频帧F T、历史视频帧F T-1、历史视频帧F T-2、历史视频帧F T-3、…、历史视频帧F 1
S602、通过语义分割神经网络中的共享编码器对待检测视频帧F T+1进行特征提取,得到第一特征数据;以及获取共享编码器对每个历史视频帧进行特征提取后生成的历史特征数据。即获取到的历史特征数据包括:历史视频帧F T对应的历史特征数据、历史视频帧F T-1对应的历史特征数据、…、历史视频帧F 1对应的历史特征数据。
S603、通过语义分割神经网络中的特征点选择模块,从第一特征数据对应的多个特征点中,确定第一特征点。
实施时,可以确定第一特征数据对应的相邻相似矩阵NSM,其中,NSM的确定过程,可以参考上述说明。再可以根据NSM,从第一特征数据对应的多个特征点中,确 定第一特征点。示例性的,可以设置选取比例(比如50%),按照NSM指示的第一特征数据中每个特征点对应的相邻相似度从大到小的顺序,选取第一特征点。比如,灰色方框对应的特征点为第一特征点51。
S604、通过语义分割神经网络中的特征点选择模块,基于第一特征点的位置数据、和历史特征数据对应的区域半径,从历史特征数据对应的多个特征点中,确定第二特征点。
S605、将第一特征点的特征数据和第二特征点的特征数据输入至时序转换器中,生成第一特征点对应的语义加强后的加强特征点的特征数据。
实施时,可以将第一特征点的特征数据,作为时序转换器的查询query输入;将第二特征点的特征数据,作为时序转换器的关键key输入;将第二特征点的特征数据,作为时序转换器的价值value输入;可知key输入与value输入相同。时序转换器中的多注意力机制模块对输入数据进行特征融合,生成第一融合特征数据,参见公式5(单头注意力机制模块的输出结果)和公式6(多注意力机制模块的输出结果),MH
Figure PCTCN2022120176-appb-000006
即为第一融合特征数据;再将第一融合特征数据和第一特征点的特征数据,输入至时序转换器中的特征处理层Add&Norm一进行特征融合,生成第二融合特征数据,参见公式7,X即为第二融合特征数据;再将第二融合特征数据输入至前馈处理层Feed Forward Layer进行特征提取,生成第三融合特征数据,参见公式8,FFN(X)即为第三融合特征数据;再将第三融合特征数据和第二融合特征数据输入至特征处理层Add&Norm二进行特征融合,生成第一特征点对应的语义加强后的加强特征点的特征数据,参见公式9,TFE
Figure PCTCN2022120176-appb-000007
即为加强特征点的特征数据。
Figure PCTCN2022120176-appb-000008
Figure PCTCN2022120176-appb-000009
Figure PCTCN2022120176-appb-000010
FFN(X)=max(0,XW 1+b 1)W 2+b 2     (8);
Figure PCTCN2022120176-appb-000011
其中,
Figure PCTCN2022120176-appb-000012
为第j-th个注意力头的投影矩阵,[,…,]表示串联,MH()是多头注意力的缩写,
Figure PCTCN2022120176-appb-000013
是quary,
Figure PCTCN2022120176-appb-000014
是key,
Figure PCTCN2022120176-appb-000015
是value,LN是Layer Normalization的缩写,W 1、W 2是权重,b 1、b 2是偏置。
S606、利用目标神经网络中的分割解码器,对加强特征点的特征数据,和第一特征数据对应的多个特征点中除第一特征点外的其他特征点的特征数据进行处理,确定待检测视频帧中每个像素点对应的目标语义信息,从而得到语义分割结果。
本公开实施方式提出的视频语义分割方法可以应用于自动驾驶、直播、增强现实(Augmented Reality,AR)等需要进行视频语义分割的场景中。
示例性的,在视频语义分割方法应用于自动驾驶领域时,视频数据可以为行驶装置在行驶过程中采集的道路视频数据时,基于上述的视频语义分割方法,对道路视频数据中的各个道路视频帧进行语义分割,生成每个道路视频帧对应的语义分割结果;再基于每个道路视频帧对应的语义分割结果,控制行驶装置。
示例性的,行驶装置可以为自动驾驶车辆、装有高级驾驶辅助系统(Advanced Driving Assistance System,ADAS)的车辆、或者机器人等。其中,在控制行驶装置时,可以控制行驶装置加速、减速、转向、制动等,或者可以播放语音提示信息,以提示驾驶员控制行驶装置加速、减速、转向、制动等。
通过利用视频语义分割方法对道路视频数据中的各个道路视频帧进行处理,生成每个道路视频帧对应的语义分割结果,提高了语义分割结果的准确度和确定效率,进而基 于每个道路视频帧对应的语义分割结果,能够较精准和较高效的控制行驶装置。
在视频语义分割方法应用于AR场景时,视频数据可以为AR设备实时场景的场景视频数据,利用本公开实施方式提出的视频语义分割方法,对场景视频数据中的各个场景视频帧进行语义分割,生成每个场景视频帧对应的语义分割结果;再根据每个场景视频帧对应的语义分割结果指示的目标对象的语义信息、以及预先设置的语义与虚拟对象之间的匹配关系,确定匹配的目标虚拟对象;并控制AR设备展示包含目标虚拟对象的场景视频。比如,目标对象的语义信息为行人,则匹配的目标虚拟对象可以为预先设置好的虚拟人物;目标对象的语义信息为建筑物,则匹配的目标虚拟对象可以为预先设置好的虚拟建筑物等。
在视频语义分割方法应用于直播场景时,视频数据可以为直播视频数据,利用本公开实施方式提出的视频语义分割方法,对直播视频数据中的各个直播视频帧进行语义分割,生成每个直播视频帧对应的语义分割结果;再根据每个直播视频帧对应的语义分割结果,对直播视频帧进行背景替换。比如,可以将直播视频帧中,语义分割结果指示的除了人类之外的其他语义的像素点的像素信息替换为预设值,生成背景替换后的直播视频帧。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于相同的构思,本公开实施例还提供了一种视频语义分割装置,参见图7所示,为本公开实施例提供的视频语义分割装置的架构示意图,包括获取模块701、第一确定模块702、处理模块703、第二确定模块704,其中:
获取模块701,配置为获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
第一确定模块702,配置为从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
处理模块703,配置为基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据;
第二确定模块704,配置为基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
一种可能的实施方式中,所述第一确定模块702,在从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点的情况下,配置为:
确定所述第一特征数据对应的每个特征点的相邻相似度;其中,所述相邻相似度用于表征所述特征点与多个周围特征点之间的特征相似度分布;
基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点。
一种可能的实施方式中,所述第一确定模块702,在确定所述第一特征数据对应的每个特征点的相邻相似度的情况下,配置为:
将所述第一特征数据对应的每个特征点分别作为当前特征点,基于所述当前特征点的位置数据和预先设置的邻域半径,确定所述当前特征点的邻域特征矩阵;其中,所述邻域特征矩阵包括位于所述当前特征点的邻域内的各个特征点的特征向量;
基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的 所述相邻相似度。
一种可能的实施方式中,所述第一确定模块702,在基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的所述相邻相似度的情况下,配置为:
基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的至少一种目标相似度;其中,所述目标相似度包括以下至少之一:用于表征所述当前特征点的邻域内的各个特征点与所述当前特征点之间的特征相似度分布、和均匀分布之间的相似程度的第一目标相似度;用于表征所述当前特征点的邻域内各个特征点与所述当前特征点之间的平均特征相似度的第二目标相似度;
基于所述至少一种目标相似度,确定所述当前特征点的所述相邻相似度。
一种可能的实施方式中,在所述目标相似度包括第一目标相似度的情况下,所述第一确定模块702,在基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的目标相似度的情况下,配置为:
确定所述当前特征点的邻域内每个特征点的特征向量与所述当前特征点的特征向量之间的特征相似度;
基于所述特征相似度,得到所述当前特征点对应的相似度分布向量;
基于所述相似度分布向量和确定的均匀分布向量,确定所述当前特征点对应的第一目标相似度。
一种可能的实施方式中,在所述目标相似度包括所述第二目标相似度的情况下,所述第一确定模块702,在基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的第二目标相似度的情况下,配置为:
确定所述邻域特征矩阵中的每个特征向量与所述当前特征点的特征向量之间的夹角余弦值;
基于所述邻域特征矩阵中的各个特征向量分别对应的夹角余弦值,确定所述目标特征点对应的第二目标相似度。
一种可能的实施方式中,所述第一确定模块702,在基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点的情况下,配置为:
基于所述第一特征数据对应的特征点的数量和预先设置的选取比例,确定第一特征点的选取数量;
确定所述选取数量的所述第一特征点的方法包括以下至少之一:
按照所述相邻相似度从大到小的顺序,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点;
基于所述相邻相似度和设置的相似度阈值,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点。
一种可能的实施方式中,所述处理模块703,在基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据的情况下,配置为:
基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点;
基于所述第二特征点的特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
一种可能的实施方式中,所述处理模块703,在基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点的情况下,配置为:
从所述历史特征数据中确定与所述第一特征点的位置数据匹配的中间特征点;
基于所述区域半径,以所述中间特征点为中心,确定所述历史特征数据中的目标区域;
将所述历史特征数据中位于所述目标区域内的各个特征点,确定为所述第二特征点。
一种可能的实施方式中,所述处理模块703,配置为:根据下述步骤确定所述历史特征数据对应的区域半径:
基于所述历史特征数据对应的目标帧数、和设置的半径起始值、帧数阈值、扩展系数,确定所述历史特征数据对应的候选半径;
在所述候选半径小于设置的半径截止值的情况下,将所述候选半径确定为所述历史特征数据对应的区域半径;
在所述候选半径大于或等于所述半径截止值的情况下,将所述半径截止值确定为所述历史特征数据对应的区域半径。
一种可能的实施方式中,所述处理模块703,在基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据的情况下,配置为:
基于所述历史特征数据和所述第一特征点的特征数据,生成融合特征数据;
对所述融合特征数据进行特征提取,生成中间特征数据;
基于所述中间特征数据和所述融合特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
一种可能的实施方式中,所述待检测视频帧中每个像素点对应的目标语义信息为利用训练后的语义分割神经网络得到的;所述语义分割神经网络包括:共享编码器、特征点选择模块、时序转换器、和分割解码器;
所述共享编码器用于分别对所述待检测视频帧和所述历史视频帧进行特征提取,获取所述待检测视频帧对应的第一特征数据和所述历史视频帧;所述特征点选择模块用于从所述第一特征数据对应的多个特征点中确定所述第一特征点;
所述时序转换器用于基于所述历史视频帧对应的历史特征数据,对所述第一特征点的特征数据进行语义加强处理,生成所述第一特征点对应的加强特征点的特征数据;
所述分割解码器用于基于所述加强特征点的特征数据、以及所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模板可以用于执行上文方法实施例描述的方法,其实现方式可以参照上文方法实施例的描述。
基于同一技术构思,本公开实施例还提供了一种电子设备。参照图8所示,为本公开实施例提供的电子设备800的结构示意图,包括处理器801、存储器802和总线803。其中,存储器802用于存储执行指令,包括内存8021和外部存储器8022;这里的内存8021也称内存储器,用于暂时存放处理器801中的运算数据,以及与硬盘等外部存储器8022交换的数据,处理器801通过内存8021与外部存储器8022进行数据交换,当电子设备800运行时,处理器801与存储器802之间通过总线803通信,使得处理器801在执行以下指令:
获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的 语义加强后的加强特征点的特征数据;
基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
其中,处理器801的处理流程可以参照上述方法实施例的记载。
此外,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的视频语义分割方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的视频语义分割方法的步骤,可参见上述方法实施例。
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
本公开涉及增强现实领域,通过获取现实环境中的目标对象的图像信息,进而借助各类视觉相关算法实现对目标对象的相关特征、状态及属性进行检测或识别处理,从而得到与应用场景匹配的虚拟与现实相结合的AR效果。
示例性的,目标对象可涉及与人体相关的脸部、肢体、手势、动作等,或者与物体相关的标识物、标志物,或者与场馆或场所相关的沙盘、展示区域或展示物品等。视觉相关算法可涉及视觉定位、即时定位与地图构建(Simultaneous Localization And Mapping,SLAM)、三维重建、图像注册、背景分割、对象的关键点提取及跟踪、对象的位姿或深度检测等。应用场景不仅可以涉及跟真实场景或物品相关的导览、导航、讲解、重建、虚拟效果叠加展示等交互场景,还可以涉及与人相关的特效处理,比如妆容美化、肢体美化、特效展示、虚拟模型展示等交互场景。可通过卷积神经网络,实现对目标对象的相关特征、状态及属性进行检测或识别处理。上述卷积神经网络是基于深度学习框架进行模型训练而得到的网络模型。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例是示意性的,例如,所述单元的划分,为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软 件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。

Claims (27)

  1. 一种视频语义分割方法,包括:
    获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
    从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
    基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据;
    基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
  2. 根据权利要求1所述的方法,其中,所述从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点,包括:
    确定所述第一特征数据对应的每个特征点的相邻相似度;其中,所述相邻相似度用于表征所述特征点与多个周围特征点之间的特征相似度分布;
    基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点。
  3. 根据权利要求2所述的方法,其中,所述确定所述第一特征数据对应的每个特征点的相邻相似度,包括:
    将所述第一特征数据对应的每个特征点分别作为当前特征点,基于所述当前特征点的位置数据和预先设置的邻域半径,确定所述当前特征点的邻域特征矩阵;其中,所述邻域特征矩阵包括位于所述当前特征点的邻域内的各个特征点的特征向量;
    基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的所述相邻相似度。
  4. 根据权利要求3所述的方法,其中,所述基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的所述相邻相似度,包括:
    基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的至少一种目标相似度;其中,所述目标相似度包括以下至少之一:用于表征所述当前特征点的邻域内的各个特征点与所述当前特征点之间的特征相似度分布、和均匀分布之间的相似程度的第一目标相似度;用于表征所述当前特征点的邻域内各个特征点与所述当前特征点之间的平均特征相似度的第二目标相似度;
    基于所述至少一种目标相似度,确定所述当前特征点的所述相邻相似度。
  5. 根据权利要求4所述的方法,其中,在所述目标相似度包括第一目标相似度的情况下,所述基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的目标相似度,包括:
    确定所述当前特征点的邻域内每个特征点的特征向量与所述当前特征点的特征向量之间的特征相似度;
    基于所述特征相似度,得到所述当前特征点对应的相似度分布向量;
    基于所述相似度分布向量和确定的均匀分布向量,确定所述当前特征点对应的第一目标相似度。
  6. 根据权利要求4或5所述的方法,其中,在所述目标相似度包括所述第二目标相似度的情况下,基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的第二目标相似度,包括:
    确定所述邻域特征矩阵中的每个特征向量与所述当前特征点的特征向量之间的夹角余弦值;
    基于所述邻域特征矩阵中的各个特征向量分别对应的夹角余弦值,确定所述目标特征点对应的第二目标相似度。
  7. 根据权利要求2至6任一项所述的方法,其中,所述基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点,包括:
    基于所述第一特征数据对应的特征点的数量和预先设置的选取比例,确定第一特征点的选取数量;
    确定所述选取数量的所述第一特征点的方法包括以下至少之一:按照所述相邻相似度从大到小的顺序,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点;基于所述相邻相似度和设置的相似度阈值,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点。
  8. 根据权利要求1至7任一项所述的方法,其中,所述基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据,包括:
    基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点;
    基于所述第二特征点的特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
  9. 根据权利要求8所述的方法,其中,所述基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点,包括:
    从所述历史特征数据中确定与所述第一特征点的位置数据匹配的中间特征点;
    基于所述区域半径,以所述中间特征点为中心,确定所述历史特征数据中的目标区域;
    将所述历史特征数据中位于所述目标区域内的各个特征点,确定为所述第二特征点。
  10. 根据权利要求8或9所述的方法,其中,根据下述步骤确定所述历史特征数据对应的区域半径:
    基于所述历史特征数据对应的目标帧数、和设置的半径起始值、帧数阈值、扩展系数,确定所述历史特征数据对应的候选半径;
    在所述候选半径小于设置的半径截止值的情况下,将所述候选半径确定为所述历史特征数据对应的区域半径;
    在所述候选半径大于或等于所述半径截止值的情况下,将所述半径截止值确定为所述历史特征数据对应的区域半径。
  11. 根据权利要求1至10任一项所述的方法,其中,所述基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据,包括:
    基于所述历史特征数据和所述第一特征点的特征数据,生成融合特征数据;
    对所述融合特征数据进行特征提取,生成中间特征数据;
    基于所述中间特征数据和所述融合特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
  12. 根据权利要求1至11任一项所述的方法,其中,所述待检测视频帧中每个像素点对应的目标语义信息为利用训练后的语义分割神经网络得到的;所述语义分割神经网络包括:共享编码器、特征点选择模块、时序转换器、和分割解码器;
    所述共享编码器用于分别对所述待检测视频帧和所述历史视频帧进行特征提取,获取所述待检测视频帧对应的第一特征数据和所述历史视频帧;所述特征点选择模块用于从所述第一特征数据对应的多个特征点中确定所述第一特征点;
    所述时序转换器用于基于所述历史视频帧对应的历史特征数据,对所述第一特征点的特征数据进行语义加强处理,生成所述第一特征点对应的加强特征点的特征数据;
    所述分割解码器用于基于所述加强特征点的特征数据、以及所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
  13. 一种视频语义分割装置,包括:
    获取模块,配置为获取视频数据中待检测视频帧对应的第一特征数据,以及所述视频数据中采集时间位于所述待检测视频帧之前的历史视频帧对应的历史特征数据;
    第一确定模块,配置为从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点;其中,所述复杂图像区域为包括多个不同语义的目标对象的至少部分像素点的区域;
    处理模块,配置为基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据;
    第二确定模块,配置为基于所述加强特征点的特征数据,和所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
  14. 根据权利要求13所述的装置,其中,所述第一确定模块在从所述第一特征数据对应的多个特征点中,确定与所述待检测视频帧中复杂图像区域的位置点匹配的第一特征点的情况下,配置为:
    确定所述第一特征数据对应的每个特征点的相邻相似度;其中,所述相邻相似度用于表征所述特征点与多个周围特征点之间的特征相似度分布;
    基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点。
  15. 根据权利要求14所述的装置,其中,所述第一确定模块在确定所述第一特征数据对应的每个特征点的相邻相似度的情况下,配置为:
    将所述第一特征数据对应的每个特征点分别作为当前特征点,基于所述当前特征点的位置数据和预先设置的邻域半径,确定所述当前特征点的邻域特征矩阵;其中,所述邻域特征矩阵包括位于所述当前特征点的邻域内的各个特征点的特征向量;
    基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的所述相邻相似度。
  16. 根据权利要求15所述的装置,其中,所述第一确定模块在基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的所述相邻相似度的情况下,配置为:
    基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的至少一种目标相似度;其中,所述目标相似度包括以下至少之一:用于表征所述当前特征点的邻域内的各个特征点与所述当前特征点之间的特征相似度分布、和均匀分布之间的相似程度的第一目标相似度;用于表征所述当前特征点的邻域内各个特征点与所述当前特征点之间的平均特征相似度的第二目标相似度;
    基于所述至少一种目标相似度,确定所述当前特征点的所述相邻相似度。
  17. 根据权利要求16所述的装置,其中,在所述目标相似度包括第一目标相似度的情况下,所述第一确定模块在基于所述邻域特征矩阵和所述当前特征点的特征向量, 确定所述当前特征点对应的目标相似度的情况下,配置为:
    确定所述当前特征点的邻域内每个特征点的特征向量与所述当前特征点的特征向量之间的特征相似度;
    基于所述特征相似度,得到所述当前特征点对应的相似度分布向量;
    基于所述相似度分布向量和确定的均匀分布向量,确定所述当前特征点对应的第一目标相似度。
  18. 根据权利要求16或17所述的装置,其中,在所述目标相似度包括所述第二目标相似度的情况下,所述第一确定模块在基于所述邻域特征矩阵和所述当前特征点的特征向量,确定所述当前特征点对应的第二目标相似度的情况下,配置为:
    确定所述邻域特征矩阵中的每个特征向量与所述当前特征点的特征向量之间的夹角余弦值;
    基于所述邻域特征矩阵中的各个特征向量分别对应的夹角余弦值,确定所述目标特征点对应的第二目标相似度。
  19. 根据权利要求14至18任一项所述的装置,其中,所述第一确定模块在基于所述相邻相似度,从所述第一特征数据对应的多个特征点中,确定所述第一特征点的情况下,配置为:
    基于所述第一特征数据对应的特征点的数量和预先设置的选取比例,确定第一特征点的选取数量;
    确定所述选取数量的所述第一特征点的方法包括以下至少之一:按照所述相邻相似度从大到小的顺序,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点;基于所述相邻相似度和设置的相似度阈值,从所述第一特征数据对应的多个特征点中,确定所述选取数量的所述第一特征点。
  20. 根据权利要求13至19任一项所述的装置,其中,所述处理模块在基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据的情况下,配置为:
    基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点;
    基于所述第二特征点的特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
  21. 根据权利要求20所述的装置,其中,所述处理模块在基于所述第一特征点的位置数据、和所述历史特征数据对应的区域半径,从所述历史特征数据对应的多个特征点中,确定第二特征点的情况下,配置为:
    从所述历史特征数据中确定与所述第一特征点的位置数据匹配的中间特征点;
    基于所述区域半径,以所述中间特征点为中心,确定所述历史特征数据中的目标区域;
    将所述历史特征数据中位于所述目标区域内的各个特征点,确定为所述第二特征点。
  22. 根据权利要求20或21所述的装置,其中,所述处理模块,配置为:根据下述步骤确定所述历史特征数据对应的区域半径:
    基于所述历史特征数据对应的目标帧数、和设置的半径起始值、帧数阈值、扩展系数,确定所述历史特征数据对应的候选半径;
    在所述候选半径小于设置的半径截止值的情况下,将所述候选半径确定为所述历史特征数据对应的区域半径;
    在所述候选半径大于或等于所述半径截止值的情况下,将所述半径截止值确定为所述历史特征数据对应的区域半径。
  23. 根据权利要求13至22任一项所述的装置,其中,所述处理模块在基于所述历史特征数据和所述第一特征点的特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据的情况下,配置为:
    基于所述历史特征数据和所述第一特征点的特征数据,生成融合特征数据;
    对所述融合特征数据进行特征提取,生成中间特征数据;
    基于所述中间特征数据和所述融合特征数据,生成所述第一特征点对应的语义加强后的加强特征点的特征数据。
  24. 根据权利要求13至23任一项所述的装置,其中,所述待检测视频帧中每个像素点对应的目标语义信息为利用训练后的语义分割神经网络得到的;所述语义分割神经网络包括:共享编码器、特征点选择模块、时序转换器、和分割解码器;
    所述共享编码器用于分别对所述待检测视频帧和所述历史视频帧进行特征提取,获取所述待检测视频帧对应的第一特征数据和所述历史视频帧;所述特征点选择模块用于从所述第一特征数据对应的多个特征点中确定所述第一特征点;
    所述时序转换器用于基于所述历史视频帧对应的历史特征数据,对所述第一特征点的特征数据进行语义加强处理,生成所述第一特征点对应的加强特征点的特征数据;
    所述分割解码器用于基于所述加强特征点的特征数据、以及所述第一特征数据对应的多个特征点中除所述第一特征点外的其它特征点的特征数据,确定所述待检测视频帧中每个像素点对应的目标语义信息。
  25. 一种电子设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当电子设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至12任一所述的视频语义分割方法的步骤。
  26. 一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至12任一所述的视频语义分割方法的步骤。
  27. 一种计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令被计算机设备的处理器运行时,实现权利要求1至12中任一项所述的视频语义分割方法的步骤。
PCT/CN2022/120176 2021-09-30 2022-09-21 视频语义分割方法、装置、电子设备、存储介质及计算机程序产品 WO2023051343A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111165458.9A CN114792106A (zh) 2021-09-30 2021-09-30 视频语义分割方法、装置、电子设备及存储介质
CN202111165458.9 2021-09-30

Publications (1)

Publication Number Publication Date
WO2023051343A1 true WO2023051343A1 (zh) 2023-04-06

Family

ID=82460396

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120176 WO2023051343A1 (zh) 2021-09-30 2022-09-21 视频语义分割方法、装置、电子设备、存储介质及计算机程序产品

Country Status (2)

Country Link
CN (1) CN114792106A (zh)
WO (1) WO2023051343A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117014126A (zh) * 2023-09-26 2023-11-07 深圳市德航智能技术有限公司 基于信道拓展的数据传输方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114792106A (zh) * 2021-09-30 2022-07-26 上海商汤智能科技有限公司 视频语义分割方法、装置、电子设备及存储介质
CN116030396B (zh) * 2023-02-27 2023-07-04 温州众成科技有限公司 一种用于视频结构化提取的精确分割方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919971A (zh) * 2017-12-13 2019-06-21 北京金山云网络技术有限公司 图像处理方法、装置、电子设备及计算机可读存储介质
CN110188754A (zh) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 图像分割方法和装置、模型训练方法和装置
US20200026928A1 (en) * 2019-09-26 2020-01-23 Intel Corporation Deep learning for dense semantic segmentation in video with automated interactivity and improved temporal coherence
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation
CN113191318A (zh) * 2021-05-21 2021-07-30 上海商汤智能科技有限公司 目标检测方法、装置、电子设备及存储介质
CN114792106A (zh) * 2021-09-30 2022-07-26 上海商汤智能科技有限公司 视频语义分割方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919971A (zh) * 2017-12-13 2019-06-21 北京金山云网络技术有限公司 图像处理方法、装置、电子设备及计算机可读存储介质
CN110188754A (zh) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 图像分割方法和装置、模型训练方法和装置
US20200026928A1 (en) * 2019-09-26 2020-01-23 Intel Corporation Deep learning for dense semantic segmentation in video with automated interactivity and improved temporal coherence
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation
CN113191318A (zh) * 2021-05-21 2021-07-30 上海商汤智能科技有限公司 目标检测方法、装置、电子设备及存储介质
CN114792106A (zh) * 2021-09-30 2022-07-26 上海商汤智能科技有限公司 视频语义分割方法、装置、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117014126A (zh) * 2023-09-26 2023-11-07 深圳市德航智能技术有限公司 基于信道拓展的数据传输方法
CN117014126B (zh) * 2023-09-26 2023-12-08 深圳市德航智能技术有限公司 基于信道拓展的数据传输方法

Also Published As

Publication number Publication date
CN114792106A (zh) 2022-07-26

Similar Documents

Publication Publication Date Title
WO2023051343A1 (zh) 视频语义分割方法、装置、电子设备、存储介质及计算机程序产品
Bukschat et al. EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach
Liao et al. DR-GAN: Automatic radial distortion rectification using conditional GAN in real-time
Zang et al. Attention-based temporal weighted convolutional neural network for action recognition
US10691978B2 (en) Optimal and efficient machine learning method for deep semantic segmentation
US20190122329A1 (en) Face Replacement and Alignment
An et al. Semantic segmentation–aided visual odometry for urban autonomous driving
Vankadari et al. Unsupervised monocular depth estimation for night-time images using adversarial domain feature adaptation
Li et al. Camera localization for augmented reality and indoor positioning: a vision-based 3D feature database approach
Jaus et al. Panoramic panoptic segmentation: Towards complete surrounding understanding via unsupervised contrastive learning
Zhang et al. Lightweight and efficient asymmetric network design for real-time semantic segmentation
CN111382647B (zh) 一种图片处理方法、装置、设备及存储介质
GB2572025A (en) Urban environment labelling
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
Wang et al. (2+ 1) D-SLR: an efficient network for video sign language recognition
Nguyen et al. Multi-camera multi-object tracking on the move via single-stage global association approach
Cao et al. CMAN: Leaning global structure correlation for monocular 3D object detection
Yu et al. Stacked generative adversarial networks for image compositing
WO2024082602A1 (zh) 一种端到端视觉里程计方法及装置
CN113821652A (zh) 模型数据处理方法、装置、电子设备以及计算机可读介质
Hastürk et al. DUDMap: 3D RGB-D mapping for dense, unstructured, and dynamic environment
Song et al. TransBoNet: Learning camera localization with Transformer Bottleneck and Attention
Wang et al. Dvt-slam: Deep-learning based visible and thermal fusion slam
KR20230078134A (ko) 제로샷 시맨틱 분할 장치 및 방법
CN115439922A (zh) 对象行为识别方法、装置、设备及介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE