CN116385947B

CN116385947B - Video target segmentation method, device, computer equipment and storage medium

Info

Publication number: CN116385947B
Application number: CN202310661219.5A
Authority: CN
Inventors: 刘鹏; 张真; 秦恩泉; 熊浪
Original assignee: Nanjing Innovative Data Technologies Inc
Current assignee: Nanjing Innovative Data Technologies Inc
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-08-25
Anticipated expiration: 2043-06-06
Also published as: CN116385947A

Abstract

The present application relates to the field of video image processing, and in particular, to a method and apparatus for dividing a video object, a computer device, and a storage medium. The method comprises the following steps: acquiring a video to be subjected to target segmentation; continuous frame dicing and merging processing are carried out on the video to be subjected to target segmentation, so that a plurality of image blocks are obtained; respectively extracting multi-level semantic features of the image blocks by using a pre-trained image coding model, outputting an image embedded vector, and generating various types of prompt vectors of the image blocks by using a pre-trained prompt coding model; inputting the image embedded vector and the prompt vector into a pre-trained decoding model for prompt decoding to obtain a target segmentation mask of the image block; and restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video. The method can quickly divide the targets of the large-scale video images, and can improve the target division efficiency while improving the target division precision of the large-scale video images.

Description

Video target segmentation method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of video image processing, and in particular, to a method and apparatus for dividing a video object, a computer device, and a storage medium.

Background

With the development of deep learning, neural network technology is applied in more and more scenes, and video object segmentation is taken as a popular research direction in the field of computer vision, and is also more and more emphasized and has been widely applied in a plurality of fields, such as object detection tracking, video retrieval, security monitoring, film and television later production, intelligent transportation and the like. The existing video target segmentation method mainly comprises two major types, namely a full-automatic video target segmentation method and an interactive video target segmentation method, wherein the former is used for automatically estimating target objects in video by utilizing characteristic points/region tracking based on optical flows, target suggestion regions and the like, and then problem solving is carried out by clustering, dynamic programming solving, graph cutting optimizing and the like, so that high-efficiency and high-precision segmentation are difficult to realize. The interactive video object segmentation method requires a user to provide a proper amount of interaction information, and takes the interaction information as a constraint condition, so that a video object segmentation result conforming to the user interaction is generated, and is the most common method for video object segmentation at present.

However, most of the existing interactive video object segmentation methods are realized based on CNN and RNN, the sequence modeling capability of video images is poor, image sequences cannot be processed well, and particularly when large-scale video image data are processed, a large amount of computing resources are required to be consumed, the processing time is too long, the cost is too high, and the object segmentation efficiency is low and the accuracy is not high.

Disclosure of Invention

The application aims to provide a video target segmentation method, a device, computer equipment and a storage medium, which can rapidly segment a large-scale video image through parallel image sequence processing capacity of a transform image coding model and an image decoding model, and improve the video target segmentation efficiency while improving the target segmentation precision of the large-scale video image.

In order to achieve the above purpose, the present application proposes the following technical scheme:

in a first aspect, the present application proposes a video object segmentation method, including the steps of:

acquiring a video to be subjected to target segmentation;

continuous frame dicing and merging processing are carried out on the video to be subjected to target segmentation, so that a plurality of image blocks are obtained;

respectively extracting multi-level semantic features of the image blocks by using a pre-trained image coding model, outputting image embedded vectors, and generating various types of prompt vectors of the image blocks by using a pre-trained prompt coding model;

inputting the image embedded vector and the prompt vector into a pre-trained decoding model for prompt decoding to obtain a target segmentation mask of the image block;

and restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video.

Further, the step of performing continuous frame dicing and merging processing on the video to be subject to object segmentation to obtain a plurality of image blocks specifically includes:

setting a window of a video frame as p;

continuously segmenting the video into a plurality of sub-sequence video frames with the length of p by utilizing the window of the video frame;

and respectively carrying out image combination on the plurality of sub-sequence video frames according to the frame sequence to correspondingly obtain a plurality of image blocks.

Further, the step of extracting the multi-level semantic features of the image blocks by using the pre-trained image coding model and then outputting the image embedded vector specifically includes:

constructing the image coding model through a plurality of transformers and a bottleneck network and pre-training;

performing position coding on the image block, and extracting original position information of a video frame in the image block;

inputting the image block and its original position information into a pre-trained image coding model to make multi-level language

And extracting the sense features to obtain an image embedded vector comprising the position information.

Further, the step of generating the plurality of types of hint vectors of the image block by using the pre-trained hint coding model specifically includes:

adopting a position coding mode, and correspondingly generating a first sparse prompt vector of the image block at least according to the prompt modes of points and rectangular frames;

vector embedding is carried out on a prompting mode of a text type through a CLIP model connected with the text and the image, so that a second sparse prompting vector of the image block is obtained;

prompting mode of mask type, carrying out dense vector on the image block through a preset mask model

Embedding to obtain dense prompt vectors of the image blocks.

Further, the step of inputting the image embedded vector and the hint vector into a pre-trained decoding model to perform hint decoding, and obtaining the target segmentation mask of the image block specifically includes:

constructing a decoding model through a decoder of a transducer and an MLP network and pre-training;

adding the image embedded vector and the dense hint vector of an image block to the first

The sparse prompt vector and the second sparse prompt vector are input into the pre-trained decoding model together to extract prompt-image and image-prompt bidirectional attention interaction characteristics, and a target segmentation mask corresponding to the image block is generated.

Further, the preset mask model is composed of 2 x 2 convolution layers, 1 1*1 convolution layer and 1 batch normalization layer.

In a second aspect, the present application also provides a video object segmentation apparatus, including:

the acquisition module is used for acquiring the video to be subjected to target segmentation;

the video slicing and merging module is used for carrying out continuous frame slicing and merging processing on the video to be subjected to target segmentation to obtain a plurality of image blocks;

the extraction and generation module is used for extracting multi-level semantic features of the image blocks by utilizing the pre-trained image coding model respectively, outputting image embedded vectors, and generating various types of prompt vectors of the image blocks by utilizing the pre-trained prompt coding model;

the decoding module is used for inputting the image embedded vector and the prompt vector into a pre-trained decoding model to carry out prompt decoding, so as to obtain a target segmentation mask of the image block;

and the restoring module is used for restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video.

Further, the video slicing and merging module includes:

a setting sub-module, configured to set a window of a video frame to p;

the sub-segmentation module is used for continuously segmenting the video into a plurality of sub-sequence video frames with the length of p by utilizing the window of the video frame;

and the merging sub-module is used for respectively merging the images of the plurality of sub-sequence video frames according to the frame sequence to correspondingly obtain a plurality of image blocks.

In a third aspect, the present application also provides a computer device, including a memory and a processor, where the memory stores computer readable instructions, and the processor implements the steps of the video object segmentation method when executing the computer readable instructions.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the video object segmentation method.

The beneficial effects are that:

as can be seen from the above technical solutions, the present application provides a video object segmentation method, which performs continuous frame dicing and merging processing on an acquired video to be subjected to object segmentation to obtain a plurality of image blocks, so that a subsequent model is convenient to batch process images, thereby accelerating image processing speed; then respectively extracting multi-level semantic features of the image blocks by using a pre-trained transducer image coding model, outputting image embedded vectors, and generating various types of prompt vectors of the image blocks by using a pre-trained prompt coding model; further inputting the image embedded vector and the prompt vector into a pre-trained transducer decoding model for prompt decoding, so as to obtain a target segmentation mask of the image block; finally, restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video; the sequence modeling can be better carried out on the images in the image blocks through the transform image coding model and the image decoding model, the image sequence can be better processed, and the parallel image sequence processing capability can carry out rapid target segmentation on a large-scale video image, so that the target segmentation precision of the large-scale video image is improved, and meanwhile, the video target segmentation efficiency is also improved.

It should be understood that all combinations of the foregoing concepts, as well as additional concepts described in more detail below, may be considered a part of the inventive subject matter of the present disclosure as long as such concepts are not mutually inconsistent.

The foregoing and other aspects, embodiments, and features of the present teachings will be more fully understood from the following description, taken together with the accompanying drawings. Other additional aspects of the application, such as features and/or advantages of the exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of the embodiments according to the teachings of the application.

Drawings

The drawings are not intended to be drawn to scale with respect to true references. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the application will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of one embodiment of a video object segmentation method provided in accordance with the present application;

FIG. 2 is a flow chart of one embodiment of step S104 of FIG. 1;

FIG. 3 is a flow chart of one embodiment of step S106 of FIG. 1;

FIG. 4 is a flow chart of another embodiment of step S106 of FIG. 1;

FIG. 5 is a flow chart of one embodiment of step S108 of FIG. 1;

fig. 6 is a flowchart of the whole video object segmentation method according to the present application;

FIG. 7 is a schematic diagram illustrating the structure of one embodiment of a video object segmentation apparatus provided in accordance with the present application;

fig. 8 is a schematic structural view of an embodiment of a computer device provided according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present application fall within the protection scope of the present application. Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs.

The terms "first," "second," and the like in the description and in the claims, are not used for any order, quantity, or importance, but are used for distinguishing between different elements. Also, unless the context clearly indicates otherwise, singular forms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The terms "comprises," "comprising," or the like are intended to cover a feature, integer, step, operation, element, and/or component recited as being present in the element or article that "comprises" or "comprising" does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. "up", "down", "left", "right" and the like are used only to indicate a relative positional relationship, and when the absolute position of the object to be described is changed, the relative positional relationship may be changed accordingly.

As shown in fig. 1, fig. 1 shows a flowchart of an embodiment of a video object segmentation method according to the present application. The video object segmentation method comprises the following steps:

s102, acquiring a video to be subjected to target segmentation;

s104, carrying out continuous frame dicing and merging processing on the video to be subjected to target segmentation to obtain a plurality of image blocks.

In the above step, the video to be subjected to object segmentation includes one or more objects to be segmented, and the objects may be objects in the foreground, or may be a background, etc.; before the video is subject to target segmentation, the video can be segmented into continuous image frames and then combined into a large image block, so that the position coding and the feature extraction of the image frames in the image block can be conveniently carried out by using an image coding model. Specifically, as shown in fig. 2, the step S104 includes:

s1041, setting a window of a video frame as p;

s1042, continuously segmenting the video into a plurality of sub-sequence video frames with the length of p by utilizing the window of the video frame;

s1043, respectively carrying out image combination on the plurality of sub-sequence video frames according to a frame sequence to correspondingly obtain a plurality of image blocks.

In the above steps, the size of the video frame window may be set to p for each video, the video is continuously segmented into a plurality of sub-sequence video frames with the length of p by utilizing the sliding of the video frame window, for example, the video frame window p is set to 10, a video with the duration of 1 minute and the frame rate of 24 frames per second may be continuously segmented into 144 sub-sequence video frames, and the length of each sub-sequence video frame is 10 frames; and then, respectively carrying out image merging on the obtained 144 sub-sequence video frames in each sub-sequence video frame according to the sequence of the image frames, namely merging 10 frames of images of each sub-sequence video frame into a larger image block, correspondingly obtaining 144 image blocks, facilitating batch processing of the images by a subsequent model, and being beneficial to accelerating the image processing speed so as to improve the efficiency of video target segmentation.

S106, respectively extracting multi-level semantic features of the image blocks by using a pre-trained image coding model, outputting image embedded vectors, and generating various types of prompt vectors of the image blocks by using the pre-trained prompt coding model.

Further, as shown in fig. 3, in the step S106, the step of extracting the multi-level semantic features of the image block by using the pre-trained image coding model and outputting the image embedded vector specifically includes:

s1061, constructing an image coding model through a plurality of transformers and a bottleneck network and pre-training;

s1062, performing position coding on the image block, and extracting original position information of a video frame in the image block;

s1063, inputting the image block and the original position information thereof into a pre-trained image coding model for multi-level semantic feature extraction to obtain an image embedded vector comprising the position information.

In the above steps, the image coding model may be constructed by using a plurality of transformers to perform cascading and finally connecting a bottleneck network, where each Transformer encoder includes a multi-head attention module, a feed-forward network module composed of MLP multi-layer perceptron, and a layer normalization module, and let the output of the nth Transformer encoder beThe output of the multi-head attention module MSA is +.>LN is layer normalization, and the calculation formula of the transducer encoder is: />。

The subsequent bottleneck network consists of a 1*1 convolution, 2 batch normalization layers and a 3*3 convolution, and can perform dimension reduction processing on the output of the transducer encoder.

And then, performing position coding on the image block obtained in the step S104 through sine and cosine position coding, extracting the original position information of each video frame in the image block to generate a corresponding position vector, embedding the position vector into the image block, finally, inputting the image block into an image encoder which consists of n convertors and 1 bottleneck network and is pre-trained to perform multi-level semantic feature extraction on the image block, outputting the image block into an image embedded vector comprising the position information, and performing sequence modeling on images in the image block better than the traditional CNN, RNN and other models, so that the image sequence can be processed better.

With continued reference to fig. 4, in the step S106, the step of generating the plurality of types of hint vectors of the image block by using the pre-trained hint coding model specifically includes:

s1064, correspondingly generating a first sparse prompt vector of the image block at least according to the prompt modes of points and rectangular frames by adopting a position coding mode;

s1065, carrying out vector embedding on a prompting mode of a text type by a CLIP model for connecting the text and the image to obtain a second sparse prompting vector of the image block;

s1066, implementing thickening on the image blocks through a preset mask model in a prompting mode of mask types

And embedding the dense vector to obtain a dense prompt vector of the image block.

In the embodiment of the application, the prompt mode for dividing the targets of the video is classified into points in the image,

Rectangular frames, texts, masks and the like, and corresponding prompt vectors are generated according to various prompt modes, so that a user can conveniently select different prompt modes to interact with the model according to requirements, and corresponding targets are segmented from videos, and various video target segmentation tasks of the user are met. The method comprises the steps of generating position vectors of points, rectangular frames and other prompting modes in an image block by using sine and cosine functions with different frequencies in a position coding mode, and then adding the position vectors with the image vectors in corresponding positions to generate a first sparse prompting vector of the image block; for the prompting mode of the text type, the second sparse prompting vector of the image block can be generated by carrying out vector embedding on the prompting mode of the text type through a CLIP model for connecting the text and the image; the prompting mode of the mask type can be used for embedding the dense vector of the image block through a preset mask model to obtain the dense prompting vector of the image block, wherein the preset mask model consists of 2 x 2 convolution layers, 1 1*1 convolution layer and 1 batch normalization layer.

S108, inputting the image embedded vector and the prompt vector into a pre-trained decoding model to carry out prompt decoding, and obtaining a target segmentation mask of the image block.

Further, as shown in fig. 5, the step S108 specifically includes:

s1081, constructing a decoding model through a decoder of a transducer and an MLP network and performing pre-training;

s1082 adding the image embedded vector and the dense hint vector of the image block to the image embedded vector and the dense hint vector

The first sparse prompt vector and the second sparse prompt vector are input into the pre-trained decoding model together to extract bidirectional attention interaction characteristics of prompt-image and image-prompt, and a target segmentation mask corresponding to the image block is generated.

In the above steps, a plurality of cascaded convertors decoders and a final connection MLP network may be used to construct the decoding model, where each convertor decoder is internally connected with a mask module, a multi-head attention module, a feedforward network module formed by a feedforward neural network layer, and a layer normalization module in sequence, and the cascaded convertors decoders take the image embedded vector, the first sparse hint vector, the second sparse hint vector, and the dense hint vector of an image block as inputs, extract bidirectional interaction features through two different directions, i.e., the hint-image direction, and the interaction attention of the image-hint direction, and finally predict the final segmentation mask (the target mask in the image) of the image through up-sampling and the MLP full connection network, and output the final segmentation mask as the target segmentation mask of the image block, thereby improving the target segmentation precision, and performing rapid target segmentation on the large-scale video image through the parallel image sequence processing capability of the convertors and the decoder, thereby being able to segment the large-scale video image data efficiently.

S110, restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video.

In the step S110, the target segmentation mask of the image block obtained in the step S108 may be restored to a video frame sequence according to the division manner and the sequence of the video frames in the step S104, that is, the multi-frame image in the image block and the target segmentation mask thereof are restored to a video image frame sequence including the target segmentation mask according to the position of the image in the video, so as to restore the target segmentation result of the image block to the target segmentation result of the video, thereby facilitating the subsequent retrieval, tracking and the like of the target in the video through the target segmentation mask.

The overall flow diagram of the video object segmentation method provided by the application is shown in fig. 6, and the method is characterized in that a plurality of image blocks are obtained by carrying out continuous frame dicing and merging processing on the obtained video to be subjected to object segmentation, so that the follow-up model is convenient for carrying out batch processing on the images, and the image processing speed is increased; then respectively extracting multi-level semantic features of the image blocks by using a pre-trained transducer image coding model, outputting image embedded vectors, and generating various types of prompt vectors of the image blocks by using a pre-trained prompt coding model; further inputting the image embedded vector and the prompt vector into a pre-trained transducer decoding model for prompt decoding, so as to obtain a target segmentation mask of the image block; finally, restoring the target segmentation mask of the image block according to the image sequence of the video to obtain a target segmentation result of the video; the sequence modeling can be better carried out on the images in the image blocks through the transform image coding model and the image decoding model, the image sequence can be better processed, and the parallel image sequence processing capability can carry out rapid target segmentation on a large-scale video image, so that the target segmentation precision of the large-scale video image is improved, and meanwhile, the video target segmentation efficiency is also improved.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 7, as an implementation of the method shown in fig. 1, the present application provides an embodiment of a video object segmentation apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 7, the video object segmentation apparatus 600 according to the present embodiment includes:

an acquisition module 601, configured to acquire a video to be subject to object segmentation;

the video slicing and merging module 602 is configured to perform continuous frame slicing and merging processing on the video to be subject to target segmentation, so as to obtain a plurality of image blocks;

the extracting and generating module 603 is configured to extract multi-level semantic features of the image blocks by using a pre-trained image coding model, output an image embedded vector, and generate multiple types of hint vectors of the image blocks by using a pre-trained hint coding model;

the decoding module 604 is configured to input the image embedded vector and the hint vector into a pre-trained decoding model for hint decoding, so as to obtain a target segmentation mask of the image block;

and the restoration module 605 is configured to restore the target segmentation mask of the image block according to the image sequence of the video, so as to obtain a target segmentation result of the video.

Further, the video slicing and merging module 602 includes:

a setting sub-module, configured to set a window of a video frame to p;

The video object segmentation apparatus provided by the embodiment of the present application can implement each implementation manner in the method embodiments of fig. 1 to 5, and corresponding beneficial effects, and in order to avoid repetition, a detailed description is omitted here.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 8, fig. 8 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 7 comprises a memory 71, a processor 72, a network interface 73 communicatively connected to each other via a system bus. It should be noted that only the computer device 7 with components 71-73 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 71 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 7, such as a hard disk or a memory of the computer device 7. In other embodiments, the memory 71 may also be an external storage device of the computer device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 7. Of course, the memory 71 may also comprise both an internal memory unit of the computer device 7 and an external memory device. In this embodiment, the memory 71 is typically used to store an operating system installed on the computer device 7 and various application software, such as computer readable instructions for a method of processing a message event. Further, the memory 71 may be used to temporarily store various types of data that have been output or are to be output.

The processor 72 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 7. In this embodiment, the processor 72 is configured to execute computer readable instructions stored in the memory 71 or process data, such as computer readable instructions for executing a method for processing a message event.

The network interface 73 may comprise a wireless network interface or a wired network interface, which network interface 73 is typically used for establishing a communication connection between the computer device 7 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a method for delay processing of a message event as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

While the application has been described with reference to preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present application. Accordingly, the scope of the application is defined by the appended claims.

Claims

1. A method for video object segmentation, comprising the steps of:

acquiring a video to be subjected to target segmentation;

respectively extracting multi-level semantic features of the image blocks by utilizing a pre-trained image coding model, outputting image embedded vectors, and correspondingly generating first sparse prompt vectors of the image blocks at least according to prompt modes of points and rectangular frames by adopting a position coding mode; vector embedding is carried out on a prompting mode of a text type through a CLIP model connected with the text and the image, so that a second sparse prompting vector of the image block is obtained; the method comprises the steps of carrying out dense vector embedding on an image block through a preset mask model in a prompting mode of a mask type to obtain a dense prompting vector of the image block;

constructing a decoding model through a decoder of a transducer and an MLP network and performing pre-training; adding the image embedded vector and the dense cue vector of an image block, and inputting the image embedded vector and the dense cue vector and the first sparse cue vector and the second sparse cue vector into the pre-trained decoding model together for extracting the bidirectional attention interaction characteristics of cue-image and image-cue, so as to generate a target segmentation mask corresponding to the image block;

2. The method for dividing a video object according to claim 1, wherein the step of performing continuous frame slicing and merging processing on the video to be subject to object division to obtain a plurality of image blocks specifically comprises:

setting a window of a video frame as p;

3. The method for segmenting a video object according to claim 1, wherein the step of extracting the multi-level semantic features of the image blocks by using the pre-trained image coding model and outputting the image embedded vectors comprises the following steps:

and inputting the image block and the original position information thereof into a pre-trained image coding model to extract multi-level semantic features, and obtaining an image embedded vector comprising the position information.

4. A video object segmentation method according to any one of claims 1-3, characterized in that the pre-set mask model consists of 2 x 2 convolutional layers, 1 1*1 convolutional layer and 1 batch normalization layer.

5. A video object segmentation apparatus, comprising:

the extraction and generation module is used for extracting multi-level semantic features of the image blocks by utilizing a pre-trained image coding model respectively, outputting image embedded vectors, and correspondingly generating first sparse prompt vectors of the image blocks at least according to prompt modes of points and rectangular frames by adopting a position coding mode; vector embedding is carried out on a prompting mode of a text type through a CLIP model connected with the text and the image, so that a second sparse prompting vector of the image block is obtained; the method comprises the steps of carrying out dense vector embedding on an image block through a preset mask model in a prompting mode of a mask type to obtain a dense prompting vector of the image block;

the decoding module is used for constructing a decoding model through a decoder of a transducer and an MLP network and performing pre-training; adding the image embedded vector and the dense cue vector of an image block, and inputting the image embedded vector and the dense cue vector and the first sparse cue vector and the second sparse cue vector into the pre-trained decoding model together for extracting the bidirectional attention interaction characteristics of cue-image and image-cue, so as to generate a target segmentation mask corresponding to the image block;

6. The video object segmentation apparatus as set forth in claim 5, wherein the video slicing and merging module comprises:

a setting sub-module, configured to set a window of a video frame to p;

7. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the video object segmentation method of any one of claims 1 to 4.

8. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the video object segmentation method according to any one of claims 1 to 4.