CN111325093A - Video segmentation method and device and electronic equipment - Google Patents

Video segmentation method and device and electronic equipment Download PDF

Info

Publication number
CN111325093A
CN111325093A CN202010040331.3A CN202010040331A CN111325093A CN 111325093 A CN111325093 A CN 111325093A CN 202010040331 A CN202010040331 A CN 202010040331A CN 111325093 A CN111325093 A CN 111325093A
Authority
CN
China
Prior art keywords
original video
video
original
characteristic information
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010040331.3A
Other languages
Chinese (zh)
Inventor
苏凯
王长虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN202010040331.3A priority Critical patent/CN111325093A/en
Publication of CN111325093A publication Critical patent/CN111325093A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The embodiment of the disclosure provides a video segmentation method, a video segmentation device and electronic equipment, belonging to the technical field of image processing, wherein the method comprises the following steps: receiving an original video to be processed; inputting the original video into a preset video segmentation model; calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video; and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video. By the scheme, the down-sampling operation of different scales is realized, the context characteristic information of large scale can be captured, the characteristic information of original resolution can be kept, and the video segmentation efficiency is improved.

Description

Video segmentation method and device and electronic equipment
Technical Field
The present disclosure relates to the field of image processing technologies, and in particular, to a video segmentation method and apparatus, and an electronic device.
Background
In the field of existing picture processing technology, in order to segment and track an object of interest from a video, a technology for automatically segmenting and tracking the object of interest in the video is very urgent. Video object segmentation and tracking are two basic tasks in the field of computer vision. Object segmentation divides pixels in a video frame into two subsets of foreground objects and background regions and generates object segmentation, which is a core problem for behavior recognition and video retrieval. Object tracking is used to determine the exact location of a target in a video image for intelligent monitoring, big data video analysis, etc.
Video segmentation is divided into unsupervised VOS (Video Object segmentation), semi-supervised VOS, interactive VOS, weakly supervised VOS and segment-based Tracking methods (Video Object Tracking). Unsupervised VOS can produce coherent spatio-temporal regions through a bottom-up process without any user input, i.e., without any video-specific tags. The interactive VOS uses a strongly supervised interactive method, requiring pixel-level accurate segmentation of the first frame, which is very time consuming for manual configuration, and human needs a cyclic error correction system. While the semi-supervised VOS is in between the two above, this requires the foreground object to be manually marked and then automatically segmented over the remaining frames.
In the existing semi-supervised video segmentation process, because a target object may be shielded or the target object is relatively fine, the technical problem of poor segmentation efficiency of the target object is caused by adopting a fixed-scale down-sampling scheme.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a video segmentation method, an apparatus and an electronic device, which at least partially solve the problems in the prior art.
In a first aspect, an embodiment of the present disclosure provides a video segmentation method, including:
receiving an original video to be processed;
inputting the original video into a preset video segmentation model;
calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video;
and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video.
According to a specific implementation manner of the embodiment of the present disclosure, the step of invoking a downsampling operation in the video segmentation model to extract network feature information corresponding to different scales in the original video includes:
calling pooling windows with different scales and a convolution kernel in the video segmentation model;
respectively performing pooling processing on the original video by using pooling windows with different scales;
and carrying out convolution processing on the original videos respectively by utilizing convolution kernels.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing pooling processing on the original video by using pooling windows of different scales includes:
performing pooling processing on the original video by using a pooling window corresponding to the original scale of the original video, and extracting feature information corresponding to the original scale of the original video;
performing pooling processing on the original video by using a pooling window corresponding to multiple scales of the original video, and extracting feature information smaller than the original scale of the original video;
and performing pooling processing on the original video by using a pooling window corresponding to the integral mean scale of the original video, and extracting context feature information of the original video in a global scope.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video and extracting feature information smaller than the original scale of the original video includes:
performing pooling processing on the original video through a 2 x 2 pooling window, and extracting characteristic information of 1/2 scales of the original video;
performing pooling processing on the original video through a 4-by-4 pooling window, and extracting characteristic information of 1/4 scales of the original video;
and performing pooling processing on the original video through 8-by-8 pooling windows, and extracting characteristic information of 1/8 scales of the original video.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video and extracting feature information smaller than the original scale of the original video includes:
and fusing the extracted network characteristic information of different scales pairwise according to the sequence of the resolution from low to high.
According to a specific implementation manner of the embodiment of the present disclosure, the step of performing convolution processing on the original video by using convolution kernels respectively includes:
and performing convolution processing on the original video by using a convolution kernel of 3 x 3.
According to a specific implementation manner of the embodiment of the present disclosure, before the step of inputting the original video into the preset video segmentation model, the method further includes:
providing a convolutional neural network;
configuring downsampling units with different scales for the convolutional neural network;
and repeatedly stacking the downsampling units with different scales to form the video segmentation model.
According to a specific implementation manner of the embodiment of the present disclosure, the step of configuring downsampling units with different scales for the convolutional neural network includes:
configuring pooling windows of different scales for the convolutional neural network;
and configuring a convolution kernel with preset parameters for the convolution neural network.
According to a specific implementation manner of the embodiment of the present disclosure, the pooling windows configured for the convolutional neural network with different scales include: the method comprises the steps of pooling windows of original scales, pooling windows of multiple scales and pooling windows of overall mean scales.
According to a specific implementation manner of the embodiment of the present disclosure, the step of outputting the fused network feature information as the feature information of the original video includes:
acquiring network characteristic information of a first frame image corresponding to the original video;
and carrying out video segmentation on all images of the original video by taking the network characteristic information as a basis.
In a second aspect, an embodiment of the present invention further provides a video segmentation apparatus, including:
the receiving module is used for receiving an original video to be processed;
the input module is used for inputting the original video into a preset video segmentation model;
the extraction module is used for calling the video segmentation model to perform downsampling operation on the original video and extracting network characteristic information corresponding to different scales in the original video;
and the fusion module is used for fusing the extracted network characteristic information with different scales and outputting the fused network characteristic information as the characteristic information of the original video.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video segmentation method of the first aspect or any implementation manner of the first aspect.
In a fourth aspect, the disclosed embodiments also provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the video segmentation method in the first aspect or any implementation manner of the first aspect.
In a fifth aspect, the disclosed embodiments also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the video segmentation method in the first aspect or any implementation manner of the first aspect.
The video segmentation scheme in the embodiments of the present disclosure includes: receiving an original video to be processed; inputting the original video into a preset video segmentation model; calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video; and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video. By the scheme, the down-sampling operation of different scales is realized, the context characteristic information of large scale can be captured, the characteristic information of original resolution can be kept, and the video segmentation efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video segmentation method according to an embodiment of the present disclosure;
fig. 2 is a partial schematic flow chart of another video segmentation method provided in the embodiment of the present disclosure;
fig. 3 is a partial flow chart of another video segmentation method provided in the embodiments of the present disclosure;
fig. 4 is a schematic diagram of a video segmentation model according to a video segmentation method provided by an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a video segmentation apparatus according to an embodiment of the present disclosure;
fig. 6 is a schematic view of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
The embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present disclosure, and the drawings only show the components related to the present disclosure rather than the number, shape and size of the components in actual implementation, and the type, amount and ratio of the components in actual implementation may be changed arbitrarily, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
The embodiment of the disclosure provides a video segmentation method. The video segmentation method provided by the embodiment can be executed by a computing device, the computing device can be implemented as software, or implemented as a combination of software and hardware, and the computing device can be integrated in a server, a terminal device and the like.
Referring to fig. 1, a video segmentation method provided in an embodiment of the present disclosure includes:
s101, receiving an original video to be processed;
the video segmentation method provided by this embodiment is used for segmenting an original video to be processed, that is, segmenting a desired foreground object from a multi-frame image of the original video, where the most important process is extraction of feature information corresponding to the foreground object.
S102, inputting the original video into a preset video segmentation model;
the electronic device is pre-loaded with a video segmentation model for implementing all or part of processing operations in the video segmentation process, such as a down-sampling operation for extracting image feature information. And after determining an original video to be processed, inputting the original video into the video segmentation model.
S103, calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video;
the method comprises the steps that a video segmentation model loaded in the electronic equipment is configured with downsampling windows with different scales, and the downsampling window with each scale is used for extracting network characteristic information with the corresponding scale. Inputting an original video into a video segmentation model, and calling the video segmentation model to perform downsampling operation on the original video so as to extract network characteristic information corresponding to different scales in the original video.
And S104, fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video.
And extracting the network characteristic information of the original video in different scales according to the steps, and fusing the network characteristic information in different scales, namely outputting the fused network characteristic information as the characteristic information of the original video.
According to a specific implementation manner of the embodiment of the present disclosure, as shown in fig. 2, the step of invoking a downsampling operation in the video segmentation model to extract network feature information corresponding to different scales in the original video may specifically include:
s201, calling pooling windows with different scales and a convolution kernel in the video segmentation model;
s202, pooling the original video by using pooling windows with different scales;
and S203, performing convolution processing on the original videos respectively by utilizing convolution kernels.
Further, the step of performing pooling processing on the original video by using pooling windows of different scales includes:
s301, performing pooling processing on the original video by using a pooling window corresponding to the original scale of the original video, and extracting feature information corresponding to the original scale of the original video;
s302, performing pooling processing on the original video by using a pooling window corresponding to the multiple scale of the original video, and extracting feature information smaller than the original scale of the original video;
s303, performing pooling processing on the original video by using a pooling window corresponding to the overall mean scale of the original video, and extracting context feature information of the original video in a global scope.
More specifically, as shown in fig. 4, the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video to extract feature information smaller than the original scale of the original video includes:
performing pooling processing on the original video through a 2 x 2 pooling window, and extracting characteristic information of 1/2 scales of the original video;
performing pooling processing on the original video through a 4-by-4 pooling window, and extracting characteristic information of 1/4 scales of the original video;
and performing pooling processing on the original video through 8-by-8 pooling windows, and extracting characteristic information of 1/8 scales of the original video.
In addition, the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video and extracting feature information smaller than the original scale of the original video includes:
and fusing the extracted network characteristic information of different scales pairwise according to the sequence of the resolution from low to high.
Optionally, the step of performing convolution processing on the original video by using a convolution kernel includes:
and performing convolution processing on the original video by using a convolution kernel of 3 x 3.
In a specific implementation mode, the video segmentation model is a unit module with scale invariance, the module captures network characteristic information under different scales through down-sampling operations of various different scales, and finally fuses the information of different scales together in a pairwise fusion mode to enhance the scale invariance of the whole network. As shown in fig. 4:
(1) input represents the Input of the module and Output represents the Output of the module.
(2) The module comprises 5 branches with different scales, such as B1-B5, for example, the B1 branch retains information of original Input scale to retain information of tiny objects such as spectacle frames, the B2 branch captures 1/2 scale information of original Input through 2 x 2 down-sampling, the B3 branch captures 1/4 scale information of original Input through 4 x 4 down-sampling, and the B5 branch down-samples the original image to the size of 1x1 through AveragePooling to capture context information of the global scope.
(3) The module adopts a bottom-up mode, namely from low resolution to high resolution, and feature information extracted under different scales is fused pairwise.
(4) The Output obtained by the module has feature fusion information of the Input under different scales.
(5) Finally, the network structure is formed by repeatedly stacking the unit modules and has the characteristic extraction capability of scale invariance.
In addition, according to another specific implementation manner of the embodiment of the present disclosure, before the step of inputting the original video into the preset video segmentation model, the method further includes:
providing a convolutional neural network;
configuring downsampling units with different scales for the convolutional neural network;
and repeatedly stacking the downsampling units with different scales to form the video segmentation model.
Optionally, the step of configuring downsampling units with different scales for the convolutional neural network includes:
configuring pooling windows of different scales for the convolutional neural network;
and configuring a convolution kernel with preset parameters for the convolution neural network.
Optionally, the different scales of pooling windows configured for the convolutional neural network include: the method comprises the steps of pooling windows of original scales, pooling windows of multiple scales and pooling windows of overall mean scales.
Optionally, the step of outputting the fused network feature information as the feature information of the original video includes:
acquiring network characteristic information of a first frame image corresponding to the original video;
and carrying out video segmentation on all images of the original video by taking the network characteristic information as a basis.
The video segmentation method in the embodiment of the present disclosure includes: receiving an original video to be processed; inputting the original video into a preset video segmentation model; calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video; and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video. According to the scheme, the unit modules capable of extracting the scale invariance features are provided, and the whole network has the feature extraction capability of the scale invariance by repeatedly stacking the unit modules. The down-sampling operation of different scales is realized, the context characteristic information of large scale can be captured, the characteristic information of original resolution can be retained, and the video segmentation efficiency is improved.
Corresponding to the above method embodiment, referring to fig. 5, the disclosed embodiment further provides a video segmentation apparatus 50, including:
a receiving module 501, configured to receive an original video to be processed;
an input module 502, configured to input the original video into a preset video segmentation model;
an extracting module 503, configured to invoke the video segmentation model to perform downsampling on the original video, and extract network feature information corresponding to different scales in the original video;
and the fusion module 504 is configured to fuse the extracted network feature information of different scales, and output the fused network feature information as feature information of the original video.
The apparatus shown in fig. 5 can correspondingly execute the content in the above method embodiment, and mainly includes: receiving an original video to be processed; inputting the original video into a preset video segmentation model; calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video; and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video. According to the scheme, the unit modules capable of extracting the scale invariance features are provided, and the whole network has the feature extraction capability of the scale invariance by repeatedly stacking the unit modules. The down-sampling operation of different scales is realized, the context characteristic information of large scale can be captured, the characteristic information of original resolution can be retained, and the video segmentation efficiency is improved. For parts not described in detail in this embodiment, reference is made to the contents described in the above method embodiments, which are not described again here.
Referring to fig. 6, an embodiment of the present disclosure also provides an electronic device 60, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video segmentation method of the foregoing method embodiments.
The disclosed embodiments also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the video segmentation method in the aforementioned method embodiments.
The disclosed embodiments also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the video segmentation method in the aforementioned method embodiments.
Referring now to FIG. 6, a schematic diagram of an electronic device 60 suitable for use in implementing embodiments of the present disclosure is shown. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 60 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 60 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 60 to communicate with other devices wirelessly or by wire to exchange data. While the figures illustrate an electronic device 60 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, enable the electronic device to implement the schemes provided by the method embodiments.
Alternatively, the computer readable medium carries one or more programs, which when executed by the electronic device, enable the electronic device to implement the schemes provided by the method embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present disclosure should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A method for video segmentation, comprising:
receiving an original video to be processed;
inputting the original video into a preset video segmentation model;
calling the video segmentation model to perform downsampling operation on the original video, and extracting network characteristic information corresponding to different scales in the original video;
and fusing the extracted network characteristic information with different scales, and outputting the fused network characteristic information as the characteristic information of the original video.
2. The method according to claim 1, wherein the step of invoking a downsampling operation in the video segmentation model to extract network feature information corresponding to different scales in the original video comprises:
calling pooling windows with different scales and a convolution kernel in the video segmentation model;
respectively performing pooling processing on the original video by using pooling windows with different scales;
and carrying out convolution processing on the original videos respectively by utilizing convolution kernels.
3. The method of claim 2, wherein the step of separately pooling the original videos with different scales of pooling windows comprises:
performing pooling processing on the original video by using a pooling window corresponding to the original scale of the original video, and extracting feature information corresponding to the original scale of the original video;
performing pooling processing on the original video by using a pooling window corresponding to multiple scales of the original video, and extracting feature information smaller than the original scale of the original video;
and performing pooling processing on the original video by using a pooling window corresponding to the integral mean scale of the original video, and extracting context feature information of the original video in a global scope.
4. The method according to claim 3, wherein the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video to extract feature information smaller than the original scale of the original video comprises:
performing pooling processing on the original video through a 2 x 2 pooling window, and extracting characteristic information of 1/2 scales of the original video;
performing pooling processing on the original video through a 4-by-4 pooling window, and extracting characteristic information of 1/4 scales of the original video;
and performing pooling processing on the original video through 8-by-8 pooling windows, and extracting characteristic information of 1/8 scales of the original video.
5. The method according to claim 4, wherein the step of performing pooling processing on the original video by using a pooling window corresponding to a multiple scale of the original video to extract feature information smaller than the original scale of the original video comprises:
and fusing the extracted network characteristic information of different scales pairwise according to the sequence of the resolution from low to high.
6. The method of claim 5, wherein the step of separately convolving the original video with convolution kernels comprises:
and performing convolution processing on the original video by using a convolution kernel of 3 x 3.
7. The method according to any of claims 1 to 6, wherein the step of inputting the original video into a preset video segmentation model is preceded by the method further comprising:
providing a convolutional neural network;
configuring downsampling units with different scales for the convolutional neural network;
and repeatedly stacking the downsampling units with different scales to form the video segmentation model.
8. The method of claim 7, wherein the step of configuring different scale downsampling units for the convolutional neural network comprises:
configuring pooling windows of different scales for the convolutional neural network;
and configuring a convolution kernel with preset parameters for the convolution neural network.
9. The method of claim 8, wherein configuring different scales of pooling windows for the convolutional neural network comprises: the method comprises the steps of pooling windows of original scales, pooling windows of multiple scales and pooling windows of overall mean scales.
10. The method according to claim 1, wherein the step of outputting the fused network feature information as the feature information of the original video comprises:
acquiring network characteristic information of a first frame image corresponding to the original video;
and carrying out video segmentation on all images of the original video by taking the network characteristic information as a basis.
11. A video segmentation apparatus, comprising:
the receiving module is used for receiving an original video to be processed;
the input module is used for inputting the original video into a preset video segmentation model;
the extraction module is used for calling the video segmentation model to perform downsampling operation on the original video and extracting network characteristic information corresponding to different scales in the original video;
and the fusion module is used for fusing the extracted network characteristic information with different scales and outputting the fused network characteristic information as the characteristic information of the original video.
12. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the video segmentation method of any one of the preceding claims 1-10.
13. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the video segmentation method of any of the preceding claims 1-10.
CN202010040331.3A 2020-01-15 2020-01-15 Video segmentation method and device and electronic equipment Pending CN111325093A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010040331.3A CN111325093A (en) 2020-01-15 2020-01-15 Video segmentation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010040331.3A CN111325093A (en) 2020-01-15 2020-01-15 Video segmentation method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111325093A true CN111325093A (en) 2020-06-23

Family

ID=71163251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010040331.3A Pending CN111325093A (en) 2020-01-15 2020-01-15 Video segmentation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111325093A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154192A (en) * 2018-01-12 2018-06-12 西安电子科技大学 High Resolution SAR terrain classification method based on multiple dimensioned convolution and Fusion Features
CN108268870A (en) * 2018-01-29 2018-07-10 重庆理工大学 Multi-scale feature fusion ultrasonoscopy semantic segmentation method based on confrontation study
CN109598269A (en) * 2018-11-14 2019-04-09 天津大学 A kind of semantic segmentation method based on multiresolution input with pyramid expansion convolution
CN110147763A (en) * 2019-05-20 2019-08-20 哈尔滨工业大学 Video semanteme dividing method based on convolutional neural networks
CN110263833A (en) * 2019-06-03 2019-09-20 韩慧慧 Based on coding-decoding structure image, semantic dividing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108154192A (en) * 2018-01-12 2018-06-12 西安电子科技大学 High Resolution SAR terrain classification method based on multiple dimensioned convolution and Fusion Features
CN108268870A (en) * 2018-01-29 2018-07-10 重庆理工大学 Multi-scale feature fusion ultrasonoscopy semantic segmentation method based on confrontation study
CN109598269A (en) * 2018-11-14 2019-04-09 天津大学 A kind of semantic segmentation method based on multiresolution input with pyramid expansion convolution
CN110147763A (en) * 2019-05-20 2019-08-20 哈尔滨工业大学 Video semanteme dividing method based on convolutional neural networks
CN110263833A (en) * 2019-06-03 2019-09-20 韩慧慧 Based on coding-decoding structure image, semantic dividing method

Similar Documents

Publication Publication Date Title
CN112184738B (en) Image segmentation method, device, equipment and storage medium
CN111414879B (en) Face shielding degree identification method and device, electronic equipment and readable storage medium
CN110298851B (en) Training method and device for human body segmentation neural network
CN110287810B (en) Vehicle door motion detection method, device and computer readable storage medium
CN110399847B (en) Key frame extraction method and device and electronic equipment
CN110852258A (en) Object detection method, device, equipment and storage medium
CN110287816B (en) Vehicle door motion detection method, device and computer readable storage medium
CN111325704A (en) Image restoration method and device, electronic equipment and computer-readable storage medium
CN112330788A (en) Image processing method, image processing device, readable medium and electronic equipment
CN110287817B (en) Target recognition and target recognition model training method and device and electronic equipment
CN111738316A (en) Image classification method and device for zero sample learning and electronic equipment
CN112988032B (en) Control display method and device and electronic equipment
CN111783632A (en) Face detection method and device for video stream, electronic equipment and storage medium
CN111696041B (en) Image processing method and device and electronic equipment
US20220245920A1 (en) Object display method and apparatus, electronic device, and computer readable storage medium
CN113255812B (en) Video frame detection method and device and electronic equipment
CN111832354A (en) Target object age identification method and device and electronic equipment
CN111340813B (en) Image instance segmentation method and device, electronic equipment and storage medium
CN113033552B (en) Text recognition method and device and electronic equipment
CN111325093A (en) Video segmentation method and device and electronic equipment
CN111339367B (en) Video processing method and device, electronic equipment and computer readable storage medium
CN114399696A (en) Target detection method and device, storage medium and electronic equipment
CN114004229A (en) Text recognition method and device, readable medium and electronic equipment
CN111738311A (en) Multitask-oriented feature extraction method and device and electronic equipment
CN111292329B (en) Training method and device of video segmentation network and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination