CN114565773A

CN114565773A - Method and device for semantically segmenting image, electronic equipment and storage medium

Info

Publication number: CN114565773A
Application number: CN202011367773.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-05-31

Abstract

The disclosure relates to a device, a board card, a method and a readable storage medium for semantic segmentation of an image, wherein a processing device of the disclosure is included in an integrated circuit device, and the integrated circuit device comprises a universal interconnection interface and a computing device. The computing device interacts with the processing device to jointly complete computing operations specified by the user. The integrated circuit device may further comprise a storage device, which is connected to the computing device and the processing device, respectively, for data storage of the computing device and the processing device.

Description

Method and device for semantically segmenting image, electronic equipment and storage medium

Technical Field

The present invention relates generally to the field of image processing. More particularly, the present invention relates to a method, an apparatus, an electronic device, and a storage medium for semantically segmenting an image.

Background

With the continuous development of social economy and scientific technology in China, the use and popularization of computers are effectively improved. Meanwhile, as the work and tasks faced by people are more and more complex, the corresponding requirements of people on the functions of computers are higher and higher, and the image semantic segmentation technology is a technology which is widely applied and has higher requirements by people at present. The image semantic segmentation technology is currently applied to various fields, including scene understanding, image analysis, robot perception, video monitoring, enhanced display, image compression and the like, is used for preprocessing tasks such as image recognition, scene analysis, object detection and the like, and is an obvious difference of one basic task in computer vision.

With the continuous development and demand of computer technology and computer use, the use of image semantic segmentation technology has also presented corresponding deficiencies and challenges. The traditional image semantic segmentation is to perform semantic segmentation on an image by taking a pixel as a unit. In some scenarios, such as video coding and decoding, the basic unit of processing is a macroblock. Therefore, the results obtained by the traditional semantic segmentation method cannot be directly used for the scenes, and difficulty is brought to the subsequent research.

Therefore, a method for performing semantic segmentation of an image in units of blocks is urgently needed.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, an aspect of the present invention provides a method of semantically segmenting an image, an electronic device and a readable storage medium.

In one aspect, the present disclosure discloses a method of semantically segmenting an image, the method comprising: one of the objects is selected as an object to be marked in sequence, and the following steps are executed: marking the area where the object to be marked is located to form a label graph; dividing the label graph into a plurality of blocks to generate a coding block graph; and superposing all the generated coding block images to obtain a semantic segmentation result so as to identify the categories of a plurality of objects in the image.

In another aspect, the present invention discloses a computer readable storage medium having stored thereon computer program instructions, which, when executed by a processing device, perform the aforementioned method.

In another aspect, the present invention discloses an electronic device, which includes a processor and a memory, wherein the memory stores a computer readable program, and the processor executes the program in the memory to perform the foregoing method.

In another aspect, the present invention discloses a device for semantically segmenting an image, the device comprising a selection module, a labeling module, a generation module and an overlay code module, wherein the selection module is configured to select one of the plurality of objects in sequence as an object to be labeled; the marking module is used for marking the area where the object to be marked is located to form a label graph; the dividing module is used for dividing the label graph into a plurality of blocks to generate a coding block graph, and the overlapping module is used for overlapping all the generated coding block graphs to obtain the result of semantic segmentation.

The method comprises the steps of sequentially selecting one of a plurality of objects in an image as a to-be-marked object to form a label image with a single object, dividing the label image with the single object into a plurality of rectangular blocks, classifying the rectangular blocks by judging the pixel ratio of the single object in each rectangular block until all the objects generate corresponding coding block images, overlapping all the coding block images to obtain the class of the object corresponding to each coding block image in the whole image. The purpose of dividing the image by taking the coding block as a unit is achieved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

fig. 1 is a schematic structural diagram showing a board card according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 3 illustrates a flow diagram of a method of semantically segmenting an image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a mask flag of an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a mask flag of another embodiment of the present invention;

FIG. 6 is a block diagram illustrating the generation of coded blocks in an embodiment of the invention;

FIG. 7 is a diagram illustrating superposition of encoded blocks in an embodiment of the present invention;

FIG. 8 is a training flow diagram illustrating one embodiment of the present invention;

FIG. 9 is a flow chart illustrating another embodiment of the present invention;

FIG. 10 is a flow diagram illustrating a video codec framework according to an embodiment of the invention;

FIG. 11 illustrates a flow chart of a method of encoding and decoding video images according to an embodiment of the present invention;

fig. 12 is a schematic diagram of an apparatus showing another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, such as image segmentation algorithm, video coding and decoding algorithm, and the like, and meeting the intelligent processing requirements in complex scenes in the fields of computer vision, speech, natural language processing, data mining, and the like. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also comprises a memory device 104 for storing data, such as video sequences, image data, etc., which comprises one or more memory units 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input video image data from the processing device 203 via the interface device 202, and write the data to a storage device on the computing device 201. Further, the computing apparatus 201 may obtain the control instruction from the processing apparatus 203 via the interface apparatus 202, and write the control instruction into the control cache on the computing apparatus 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), Graphics Processing Unit (GPU) or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The DRAM 204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.

The image segmentation means that the image is divided into a plurality of non-overlapping sub-regions, so that the features in the same sub-region have certain similarity, and the features in different sub-regions are different. The traditional image segmentation is to divide an object in a picture by dividing each pixel in the image into a class label. For example, in a picture, objects of three categories of sky, grass and tree are included, and the purpose of semantic segmentation of the image is to identify which pixel points are the sky, which pixel points are the grass and which pixel points are the tree in the picture. However, in some scenarios, the basic unit of processing is a block of area, and only the category corresponding to the block of area needs to be known, and it is not necessary to divide a category label for each pixel. For example, in video coding and decoding technology, whether the coding standard of the MPEG series or the h.26x series is adopted to perform mixed coding on a block basis. When encoding, firstly, the video image is divided into blocks with the same size, then the blocks are subjected to motion compensation, and the video image is encoded and decoded according to the result of the motion compensation.

The invention provides an image semantic segmentation method taking a region as a segmentation unit, aiming at a scene processed by taking a block or the region as a unit.

In the present invention, description of hardware, software, and firmware is referred to, wherein the hardware includes various devices, units, devices, and apparatuses, the software includes various operating systems, machines, programs, tools, and the like, and the firmware includes functions and the like, and when referring to the general names of hardware, software, and firmware, they are described by components. Such an arrangement is only for the purpose of more clearly describing the technique of the present invention and does not set any limit to the technique of the present invention.

The method of the embodiment of the present invention may be applied to the board 10 and the single or multiple processor cores in the combination device 20. Fig. 3 shows a flowchart of a method for semantically segmenting an image, wherein the image displays a plurality of objects, according to an embodiment of the present invention, in a specific method:

step 301, one of the objects is sequentially selected as an object to be marked.

The image comprises a plurality of objects, each object has a corresponding category, and when the image is segmented, the objects of each category are processed separately, and one of the objects is used as an object to be marked for subsequent processing.

And 302, marking the area where the object to be marked is located to form a label graph.

A plurality of objects are included in the image, each object having a corresponding category. In this embodiment, the classes of each object in the image are labeled one by one, and a corresponding encoded block diagram is generated. The objects of multiple categories may also be marked in parallel to generate multiple encoded block maps, which is not limited in this disclosure, but each marked image has only one category of object, and accordingly, each generated encoded block map also has only one category of object. Therefore, other classes need to be masked, and only one object class needs to be reserved as an object to be marked before generating the coded block map.

In one possible implementation, a MASK (MASK) is used to keep one of the classes of objects as the object to be marked. The MASK may mark the area where the object to be marked is located as 1 and the area where other objects are located as 0.

Fig. 4 is a schematic diagram showing a mask flag of this embodiment.

In one image, objects of three types including (i), (ii) and (iii) represent days, grasslands and buildings respectively, for example, in an automatic driving scene, a user needs to take the days as objects to be marked to obtain a label map corresponding to the days, Mask marks a part corresponding to the region (i) as 1, other regions as 0, and the image passes through the Mask to obtain the label map only containing (i). In the same way, the image is masked to obtain label images only containing two or three. This example does not set any limit to the manner in which the label graph is obtained.

In addition, each type of object in the image may include a plurality of regions, and the corresponding resulting label map may also include a plurality of regions. Fig. 5 is a schematic diagram of a mask flag in this embodiment.

In one image in fig. 5, there are two types of objects, i.e., objects of two types representing grassland and a day, wherein the grassland is distributed in two discontinuous regions, and after i passes through Mask, two regions with i's labels are obtained.

Step 303, dividing the label map into a plurality of blocks to generate a coded block map.

In this embodiment, the basic unit of image segmentation is a rectangular block instead of pixels, and the image needs to be first divided into blocks. The shape of the block can be a rectangular block, a circle or other shapes according to actual requirements. For example, in a video coding and decoding scene, the basic unit of processing is a macroblock, and when the scene is applied, the tag map is divided into the shape of the macroblock adapted to the scene, that is, the shape of a rectangular block. The invention is not limited to the specific shape of the block.

Further, step 303 further includes: extracting a sub-block from a block of at least a part of the tag map included in the area; sequentially judging whether the proportion of the pixels with the marked areas in the sub-blocks to the pixels of the sub-blocks is larger than or equal to a threshold value; in this embodiment, the threshold is set to 80%, which can be determined according to actual needs, and the size of the threshold is not limited in any way in the present invention.

If the ratio of the pixels with the marked areas in the sub-block to the pixels of the sub-block is greater than or equal to a threshold value, marking the rectangular block in a coding block diagram; if the ratio of the pixels with the marked areas in the sub-block to the pixels of the sub-block is not larger than or equal to the threshold value, the sub-block is not marked.

For example, in a video coding and decoding scenario, the minimum unit of processing is a macroblock, such as an 8 × 8 macroblock or a 16 × 16 macroblock, and may also be a smaller or larger macroblock. Therefore, before encoding and decoding the video image, the video image is divided into rectangular blocks to generate encoding blocks. If a coding block with a label needs to be obtained, semantic segmentation processing needs to be carried out on the video image in advance. Specifically, a label map obtained by masking a video image is divided into a plurality of rectangular blocks. In the semantic segmentation process of the image, the size of a rectangular block for dividing the image is determined according to the actual situation. The size of the plurality of divided rectangular blocks may be the same size rectangular block or different sizes rectangular blocks. Such as a relatively small rectangular block at the border of the mark area and a relatively large rectangular block in the middle portion of the mark area. The result of the image semantic segmentation can be directly used for video coding and decoding as a coding block to continue calculation.

In the label map, since only one kind of object, that is, a part of the region to be marked is included, the kinds of other objects are ignored. I.e. the kind of objects in the other area parts is different from the kind of objects in the area to be marked, in particular which kind of objects need not be handled at present. According to the above description, the region to be marked is marked as 1 and the other region parts are marked as 0. Therefore, only the rectangular block related in the area marked as 1 needs to be judged, in the rectangular blocks containing the area to be marked, the proportion of the pixels of the area to be marked in each rectangular block to the total pixels of the rectangular block is judged, when the proportion is larger than or equal to the threshold value, the rectangular block is marked as the type of the object, and when the proportion of the pixels with the marked area in the extracted rectangular block to the pixels of the rectangular block is smaller than the threshold value, the rectangular block is judged not to belong to the type of the object.

Fig. 6 is a diagram showing the generation of the coded block map in this embodiment. The label graph is exemplarily divided into 4 × 4 rectangular block graphs, in this embodiment, taking the size of the divided rectangular blocks as an example, according to the divided rectangular block graphs, five rectangular blocks (the rectangular blocks in which the

numbers

0, 1, 2, 3, and 4 are located in the graph) include a mark region, for example, the region is a tree, and the proportion of pixels of the region to be marked in the five rectangular blocks to pixels of the rectangular blocks is sequentially determined.

The specific determination method will be described below.

Let h be the height and width of the label map (i.e., the video image), respectively_iAnd w_iThe height and width of the output image are h_fAnd w_fThe label graph is divided into n × n (i.e., the aforementioned 4 × 4) rectangular frames, and the height of each rectangular frame is set to h_dkAnd a width of w_dk. The height h of each rectangular box_dkAnd width w_dkIt should satisfy:

i.e., the length or width of each rectangular box is equally assigned to the length or width of the label map, and if not evenly divisible, the length and width of the last rectangular block is optionally rounded down.

Assuming that coordinates of an irregular boundary in the label image are (x, y), coordinates of the mapping coding block map are (x)_f，y_f)，f(x_i，y_i) Is the coordinate (x) in the coded block map_f，y_f) The pixel value at which the location is located. Each rectangular box contains a pixel h_dk×w_dkThe pixels of the region surrounded by the rectangular frame and the boundary are

Therefore, the condition for determining whether the ratio of the pixels with the mark area to the pixels of the rectangular block in the extracted rectangular block is greater than or equal to the threshold value is:

if the above formula is satisfied, it indicates that the region is

If the rectangular frame in which the tree is located meets the generation condition of the coding block diagram, marking the rectangular block in the coding block diagram, wherein the rectangular block is marked as a tree, otherwise, the rectangular frame is not marked in the coding block diagram.

And step 304, overlapping all the generated coding block images to obtain the result of semantic segmentation so as to identify the categories of a plurality of objects in the image.

Each of the encoded block maps generated in the

steps

301 and 303 only includes one type of object in the image, and all the generated encoded block maps cover each block in the input image. And superposing all the generated coding block images to obtain the category of each object in the image. Each tile in the image has a corresponding category label. The included category labels include categories of objects in the image and/or other labels, wherein the area block labeled with other labels represents categories in which the area block includes multiple objects, and the pixel ratio of each category does not satisfy a condition of a preset threshold, so that the area block does not belong to any one of the categories, and the other categories are used as labels for labeling. Optionally, the blocks without the tags may also be labeled according to actual situations, for example, the size of the corresponding pixel is compared among a plurality of tags in the area block where the other tags are located, and the tag with the largest pixel is used as the tag of the tag.

Fig. 7 shows a schematic diagram of superposition of encoded blocks in an embodiment of the invention. Assuming that an image contains two classes of objects, a1 region represents a grassland, and a2 region represents a sky, the objects of each class generate corresponding coded block maps, and finally two coded block maps are generated. The two coded block maps respectively identify the blocks in which the objects of the two categories are located. And overlapping the two obtained area blocks to obtain a final image segmentation result, wherein each block has a corresponding label, in the embodiment, no label mark is made on the blocks which do not meet the preset threshold, and the blocks which meet the preset threshold respectively identify corresponding object type labels.

Further, the process of generating the coding block map in step 301 and step 303 is implemented based on a neural network model. Aiming at the fact that the traditional image segmentation algorithm depends on manual design description, the operation efficiency is low, the algorithm is limited by priori knowledge, the network can automatically make up for the deficiency of the priori knowledge by utilizing the expression capability and the feature extraction capability of the neural network, and the operation efficiency and the segmentation accuracy are improved.

When the step 301-303 generation of the coding block map is implemented based on the neural network model, firstly, the image data is input into the neural network model, and the neural network model extracts the image features in the image and generates the coding block map. Generally, the neural network model for processing the codes comprises a backhaul network, a hack network and a Head network, and the method further comprises the following steps: extracting image characteristics under different scales in the video image in a backhaul network; and fusing the image features in the Neck network to obtain the fused features of the image. And step 301 is to generate the coded block map according to the fusion characteristics in the Head network.

The backhaul network is a classification network for extracting image features, and generally comprises a series of convolutional layers, normalization layers and activation layers. Its main function is to extract useful information from the picture and then use it for the following neural network calculation. The backhaul network generally selects mature classification networks, and these networks generally prove that the feature extraction capability on the classification problem is strong. Classification networks commonly used in backhaul networks are VGG16, VGG19, google lenet, ResNet50, ResNet101, mixmets, HarDNet, and the like. When the networks are used as a backhaul, generally, the trained model parameters of the authority are directly loaded, then the network structure designed by the author is followed, and the network designed by the author is also used for extracting image features, namely a Head network in the figure, which is equivalent to migrating the classification backhaul network trained and converged on the Imagenet into a segmentation task. Since the loaded Backbone model already has the capability of extracting features, the loaded Backbone model can be finely adjusted in the training process, so that the loaded Backbone model is more suitable for tasks.

The Neck network fuses the features extracted by the backhaul network under different scales to obtain richer image information expression. Feature Pyramid Network (FPN) networks are commonly used in hack. The FPN is an efficient CNN feature extraction method, and different features of each dimension in an image output by a backhaul network can be extracted.

The Head network design is used to compute the network for coded block map generation in the present invention. Step 301-.

The image passes through the three networks in sequence, so that the image characteristics in the image can be rapidly and effectively extracted, the category of an object in the image is obtained, and a coding block diagram is generated.

After the feature is extracted through the network structure to generate the coding block diagram, the neural network training needs to be further performed on the generated coding block diagram, and the training condition is observed through the variation trend of a Loss function curve (Loss curve) in the training process. The neural network training is a process of continuously adjusting parameters and continuously optimizing the network, and the result calculated by the neural network is as close as possible to the real result by adjusting the parameters of the hidden layer and the output layer, and mainly comprises two processes of forward propagation and backward propagation. The forward propagation is based on a trained neural network model, the input image is calculated through a weight and an activation function, and classification results are output through layer-by-layer iteration; the back propagation calculates the loss function according to the forward propagation result and the true value, and generally adopts gradient descent to update the weight and the offset. The loss function is used for judging the difference degree between the training result and the real result, and is a non-negative real number exponential function. And continuously adjusting training parameters according to the loss function in the training process until the loss function is not changed, which indicates that the training network is good.

In the neural network training process, because the feature map of the coding block map is large, great calculation power is consumed in the training process, and the types of objects in each rectangular frame in the coding block map are the same, in order to reduce the calculation power in the training process and accelerate the training process, the embodiment maps the coding block map into pixel point maps for training and calculating loss function values, each pixel point replaces each rectangular frame, the size of the feature map required by training is reduced, and the training efficiency is improved.

Fig. 8 shows a flowchart of training a coded block diagram in the present embodiment, where the specific method includes:

step 801, mapping the coding block map into a pixel point map;

step 802, calculating a loss function according to the pixel point diagram; and

and 803, training the coding block diagram according to the loss function.

Specifically, in step 801, a rectangular block in the coding block map, which indicates whether the ratio of pixels in the coded block map with the marked region to pixels in the rectangular block is greater than or equal to a threshold, is mapped as pixel 1, and other portions are mapped as pixel 0. That is to say, is provided with (x)_f，y_f) Coordinates of pixel points corresponding to the code blocks, g (x)_f，y_f) Is the output coordinate (x)_f，y_f) The pixel value of the position is corresponding to the coordinate in the coding block, and then the pixel value is represented in the area

And the rectangular block meets the threshold value, the coding block is mapped into a pixel point, the pixel value is set to be 1, the other parts do not meet the threshold value, and the pixel value is set to be 0.

In order to analyze whether the object type corresponding to each rectangular block in the generated coding block map is accurate, it is often necessary to perform precision analysis on the generated coding block map. Fig. 9 shows a flowchart of the precision analysis in the present embodiment, wherein the method further includes:

step 901, performing inverse mapping on the pixel point diagram to obtain an inverse mapping diagram of the size of the video image;

and 902, carrying out precision analysis on the result according to the inverse mapping map.

Specifically, when the result is subjected to the precision analysis, it is necessary to analyze the result in a map of the image size, and since the mapped pixel point map is a reduced version of the coded block map, it is also necessary to perform inverse mapping on the pixel point map to obtain an inverse map of the video image size, and to perform the precision analysis on the result based on the inverse map.

The semantic segmentation result of the image in units of blocks obtained by the invention can be applied to a scene processed in units of areas. The application scenario of the present disclosure is described by taking a video encoding/decoding scenario as an example.

The encoding is performed according to the encoding block in each frame, and the encoding block changes from frame to frame, but in the images of continuous multiple frames, because the background is constant, the previous frame and the next frame have many repeated contents, and the constant background is not required to be encoded in the video encoding and decoding. However, due to the disturbance of the external environment, the disturbance can be ignored in a specific application scene, but the change is encoded. Since useless encoding of such duplicated contents wastes a large amount of encoding effort, a target object in an image is detected before encoding, that is, only an object that is important and causes a change in previous and subsequent frames is focused on.

Video images have strong correlation and have a large amount of redundant information, including spatial redundant information and temporal redundant information. The compression technique is to remove redundant information therein, and does not consider the correlation between the redundant information.

In video coding and decoding, relevant parameters of motion are obtained according to predictive coding, and the predictive coding comprises two prediction methods, namely intra-frame prediction and inter-frame prediction, which respectively correspond to spatial redundancy and temporal redundancy in video images. Some motion parameters, i.e. residual blocks, are obtained from predictive coding. After transformation and quantization, the code stream information is finally obtained by rearranging and combining and then entering entropy coding. Similarly, the quantized image data is subjected to a series of column inversion operations such as inverse quantization and inverse transformation, and the encoded image data can be decoded to obtain a reconstructed frame.

Specifically, a video image to be encoded is first partitioned into blocks, where the blocks do not overlap with each other, and a basic unit of a macroblock in the embodiment of the present invention is 16 × 16, and may be smaller or larger, and the macroblock is used as a basic encoding unit for encoding. Each macroblock may in turn be divided into small sub-blocks.

The intra-frame prediction is to predict an uncoded area aiming at an already coded area in the same frame, find a macro block which is most similar to an intra-frame to-be-predicted coded macro block from the already coded macro blocks in the current frame to perform prediction so as to obtain a residual block, and then eliminate the spatial redundancy in the frame according to the residual. Inter-frame prediction is to judge the correlation between frames according to the change of previous and next frames, judge the similarity of spatial structures between frames according to the correlation between frames, encode the frames and reduce the spatial repeated configuration. Inter prediction is actually to find similar macro blocks through motion estimation and motion compensation, and spatial compression is performed according to residual macro blocks between the similar macro blocks.

In order to reduce the encoding effort, in the video encoding and decoding process, all data of a certain frame is not transmitted, but the relative position between frames is transmitted. Fig. 10 shows a flow chart of video encoding and decoding according to the present invention. The video sequence is used as an input signal, and a required compressed code stream is finally output through the following steps:

in step 1001, predictive coding is performed. The prediction comprises intra-frame prediction and inter-frame prediction, wherein the intra-frame prediction is a method for predicting an uncoded area according to a certain rule by means of the video content of a current image frame and a coded area adjacent to the intra-frame, and the method can eliminate spatial redundancy; inter-frame prediction is a method of predicting an uncoded current frame using video content of a coded frame to remove temporal redundancy. The video sequence is used as an input signal, and parameters representing the motion relation between adjacent frames are obtained after predictive coding.

In step 1002, transform coding. The resulting parameters are transformed, i.e. the time domain representation of the signal is transformed. Specifically, the space domain signals are converted into frequency domain signals through orthogonal transformation, and frequency domain correlation is effectively removed.

In step 1003, the code is quantized. The quantization operation is performed on the frequency domain signal obtained in the above step, and in video image coding, scalar quantization is usually adopted to act on the frequency domain coefficient obtained by transforming the prediction residual.

In step 1004, entropy coding is performed. Entropy coding is coding without losing any information according to the principle of entropy in the coding process. Information entropy is the average amount of information (a measure of uncertainty) of a source. Common entropy coding methods are: shannon (Shannon) coding, Huffman (Huffman) coding and arithmetic coding (arithmetric coding). And finally, the processed data is sent to an entropy coding part for eliminating the statistical redundancy of the information source, and finally, the required compressed code stream is output.

The result of the semantic segmentation of the image in this embodiment is mainly applied to the part of the predictive coding in step 1001 described above. After a video image to be processed is transmitted into a coding system, the system divides the image frame into different coding blocks according to image information, the subsequent coding processing is carried out by taking the coding blocks as units, after the coding units are divided, intra-frame prediction or inter-frame prediction is selected, and different prediction modes carry out different operations on the coding blocks. The method mainly comprises the steps of analyzing the relevance of adjacent intra-frame coding blocks through a target detection algorithm, judging whether the positions of the intra-frame coding blocks are changed or not so as to judge whether the intra-frame coding blocks are backgrounds or objects to be coded, and coding the objects to be coded; or by using a coded block prediction algorithm to infer pixel values within a coded block from different reference frames. In some scenes, the coding block where the background is located may also be suddenly changed in an adjacent frame due to external factors, for example, in an automatic driving scene, there are sky, grass, trees, vehicles, and the like at the same time, and ideally, the sky, grass, and trees are still, and only moving vehicles need to be detected and the corresponding frames thereof need to be coded when target detection is performed, but wind in a real environment may cause swinging of trees, grass, and the like, and when target detection is performed, the background is easily mistakenly recognized as a target object and coded, and coding calculation power is consumed.

In particular, background images or other static images of no interest are theoretically free of abrupt changes from frame to frame. After the algorithm identifies that the position of the coding block is not changed, the coding block cannot be repeatedly coded in the subsequent coding process, but the position of an originally static object is changed due to potential external disturbance, for example, the originally static tree is subjected to position change under the interference of the environment due to the wind, the tree is mistakenly considered as a target object to be coded due to the change, so that an error occurs when the algorithm judges whether the frame is suddenly changed, and the waste of calculation power is caused.

In the video coding and decoding, objects which are not interested, such as backgrounds and the like in video images, are identified in advance from the generation of coding blocks, and coding blocks with labels are generated, namely, the result of image semantic segmentation in the invention can be directly applied to the video coding and decoding to generate corresponding coding blocks. When video is subjected to encoding and decoding processing, a coded block with a tag is generated first, instead of arbitrarily blocking a video image. According to the irregular object shape size in the video image, the size of the block is determined, and the semantic category of the object is labeled for each coding block, for example, category 1 is a tree, category 2 is a day, and the like. When the coding block is coded according to the position movement of the frame, if the coding block is judged to be an object such as a background which is not concerned by a user according to the label of the coding block, the coding block is not directly coded, and even if the static object is subjected to position movement due to disturbance in the environment, the accuracy and the efficiency of coding are not influenced. Therefore, the label of the coding block can avoid the calculation waste caused by the disturbance brought by the environment, thereby greatly improving the coding efficiency and the coding accuracy in the video coding and decoding process. According to the invention, the coded block diagram with the label is coded in the frame or the interframe, so that the influence of a non-target object group on the coding can be reduced.

The processing method of the encoding and decoding video images according to the embodiment of the present invention can be applied to the board card 10 and the single or multiple processor cores in the combination device 20. Fig. 11 is a flowchart of a method applied to encoding and decoding a video image scene according to an embodiment of the present invention, where the method specifically includes:

step 1101, the image object in the video coding and decoding scene is distinguished into a non-target object group and a target object group by the result of the image semantic segmentation.

The non-target object group refers to an object portion that does not need to be repeatedly coded and decoded during coding and decoding, and the target object group refers to a portion that needs to be coded and decoded during coding and decoding.

In a possible embodiment, because the video codec is based on the change of the object in the two frames before and after the video, the object with position shift is encoded. Thus, the non-target object group may be an object that is always stationary, while the target object group is an object that may change position.

In another possible embodiment, the object portion that does not need to be coded and decoded may be a portion that is not focused by the user in the actual scene, and the portion that needs to be coded and decoded is a portion that is focused according to the actual scene. For example, in an automatic driving scene, the positions of the background, sky, grass, trees are not changed all the time, and thus belong to a non-target object group, while the positions of vehicles, pedestrians, and the like may be changed in different frames, and thus belong to a target object group. The present invention is not limited in any way to the concept of a target object and a non-target object.

Step 1102, calculating position change values of the front and rear frames of the object in the target object group.

Based on the objects in the target object group obtained in step 1102, the position change of the preceding and following frames of the object is calculated. This calculation method is a common method in the art and is not specifically described herein.

Step 1103, performing encoding and decoding operations on the image according to the position change value.

And when the position change values of the front frame and the rear frame of the object meet a certain condition, indicating that the object moves, and performing encoding and decoding operation on the object. The condition is a threshold value set by a skilled person according to actual conditions or a variation value recognized in the field, and the invention is not limited in any way.

Fig. 12 shows another embodiment of the present invention, which is an apparatus 1200 for semantically segmenting an image, and is used for performing the above method for semantically segmenting an image. The device 1200 comprises a selection module 1201, a marking module 1202, a generation module 1203 and an overlapping module 1204, wherein the selection module 1201 is used for sequentially selecting one of the plurality of objects as an object to be marked; the marking module 1202 is configured to mark an area where the object to be marked is located to form a label map; the generating module 1203 is configured to divide the label map into a plurality of rectangular blocks to generate an encoded block map, and the superimposing module 1204 is configured to superimpose all generated encoded block maps to obtain the result of semantic segmentation.

The generating module 1203 includes an extracting module 1213 and a judging module 1223, where the extracting module 1213 is configured to extract the rectangular blocks included in the label map, and the judging module 1223 is configured to sequentially judge whether a ratio of pixels in the rectangular blocks included in the label map, which are provided with the label area, to pixels in the rectangular blocks is greater than or equal to a threshold; if the ratio of the pixels of the marked region to the pixels of the rectangular block in the marked region is greater than or equal to a threshold value, marking the rectangular block in the coding block map; and if the proportion of the pixels of the marked region in the rectangular blocks in the marked region to the pixels of the rectangular blocks is less than a threshold value, not marking in the coded block map.

Another embodiment of the invention is a computer readable storage medium having stored thereon computer program instructions for semantically segmenting an image, the computer program instructions when executed by a server comprising a processor and a memory, the memory having stored therein the computer program instructions as previously described, the processor executing the computer program instructions in the memory. In some implementation scenarios, the integrated units may be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated units may be stored in a computer readable memory. In this regard, when the aspects of the present invention are embodied in a software product (e.g., a computer-readable storage medium), the software product may be stored in a memory, which may include instructions for causing a computer device (e.g., a personal computer, a server, or a network device, etc.) to perform some or all of the steps of the methods described in the embodiments of the present invention. When the solution of the invention is embodied in the form of an electronic device comprising a processor and a memory for storing processor-executable instructions, the processor being configured to invoke the instructions stored by said memory to perform the above-described method of encoding and decoding video images. The Memory may include, but is not limited to, a usb disk, a flash disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

The invention identifies the object type in the video image in advance, generates the coded block diagram with the label, and only carries out subsequent coding processing on the coding block where the object type to be coded is located according to the coded block diagram with the label, thereby not only reducing the time required by coding and decoding, but also avoiding repeated coding and decoding caused by external interference, saving the calculation power consumed in the coding process and improving the coding and decoding efficiency.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series and combination of acts, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one of ordinary skill in the art will appreciate that the several embodiments disclosed herein can be practiced in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the foregoing direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors and like devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause a1, a method of semantically segmenting an image, the method comprising: one of the objects is selected as an object to be marked in sequence, and the following steps are executed: marking the area where the object to be marked is located to form a label graph; dividing the label graph into a plurality of blocks to generate a coding block graph; and superposing all the generated coding block images to obtain the result of semantic segmentation so as to identify the categories of a plurality of objects in the image.

Clause a2, the method of clause a1, wherein the tiles contained in the region in the map are sub-tiles, the dividing step comprising: extracting the sub-block; sequentially judging whether the proportion of the pixels with the marked areas in the sub-blocks to the pixels of the sub-blocks is greater than or equal to a threshold value; and if so, marking the sub-block in the coded block map.

Clause A3, the method of clause a2, wherein the threshold is 80%.

Clause a4, the method of clause a2, wherein the method is implemented based on a neural network model comprising a backhaul network, a hack network, a Head network, the method further comprising: extracting image features under different scales in the video image in the backhaul network; fusing the image features in the Neck network to obtain the fusion features of the video image; wherein the dividing step is to generate the coded block map according to the fusion feature in the Head network.

Clause a5, the method of clause a4, wherein the method further comprises: mapping the encoded block map into a pixel point map; calculating a loss function from the pixel point map; and training the encoded block map according to the loss function.

Clause a6, the method of clause a5, wherein the method further comprises: performing inverse mapping on the pixel point diagram to obtain an inverse mapping diagram of the size of the video image; and carrying out precision analysis on the result according to the inverse mapping chart.

Clause a7, the method of any of clauses a1-a4, wherein when the tile is a rectangular block diagram, the result is used in a video codec scene.

Clause A8, the method of clause a7, wherein the method comprises: the result distinguishes the image object in the video coding and decoding scene into a non-target object group and a target object group; calculating position change values of front and rear frames of the object in the target object group; and carrying out coding and decoding operation on the image according to the position change value.

Clause a9, an electronic device, comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of clauses A1-A8.

Clause a10, a computer-readable storage medium having computer program instructions stored thereon which, when executed by a processor, implement the method of any one of clauses a 1-A8.

Clause a11, a device for semantically segmenting an image, the device comprising a selection module, a labeling module, a generation module and an overlay code module, wherein the selection module is configured to select one of the plurality of objects in sequence as an object to be labeled; the marking module is used for marking the area where the object to be marked is located to form a label graph; the dividing module is used for dividing the label graph into a plurality of blocks to generate a coding block graph, and the overlapping module is used for overlapping all the generated coding block graphs to obtain the result of semantic segmentation.

Clause a12, the apparatus according to clause a11, the dividing module comprising an extracting module configured to extract a tile included in the label map and including the label region, and a determining module configured to sequentially determine whether a ratio of label region pixels to rectangular tile pixels in the tile included in the label region is greater than or equal to a threshold; if the ratio of the marked region pixels to the rectangular block pixels in the rectangular block containing the marked region is greater than or equal to a threshold value, marking the block in the coding block map.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of semantically segmenting an image, the image displaying a plurality of objects, the method comprising:

one of the objects is selected in sequence as an object to be marked, and the following steps are executed:

marking the area where the object to be marked is located to form a label graph; and

dividing the label graph into a plurality of blocks to generate a coding block graph; and

and superposing all the generated coding block images to obtain the result of semantic segmentation so as to identify the categories of a plurality of objects in the image.

2. The method of claim 1, wherein at least a portion of the blocks included in the region in the label graph are sub-blocks, and the dividing step further comprises:

extracting the sub-block;

sequentially judging whether the proportion of the pixels with the marked areas in the sub-blocks to the pixels of the blocks is greater than or equal to a threshold value; and

if so, the sub-block is marked in the coded block map.

3. The method of claim 2, wherein the threshold is 80%.

4. The method of claim 2, wherein the method is implemented based on a neural network model, the neural network model comprising a backhaul network, a hack network, a Head network, the method further comprising:

extracting image features under different scales in the image from the Backbone network; and

fusing the image features in the Neck network to obtain fused features of the image;

wherein the dividing step is to generate the coded block map according to the fusion feature in the Head network.

5. The method of claim 1, further comprising:

mapping the encoded block map into a pixel point map;

calculating a loss function from the pixel point map; and

training the encoded block map according to the loss function.

6. The method of claim 5, further comprising:

performing inverse mapping on the pixel point diagram to obtain an inverse mapping diagram of the image size; and

and carrying out precision analysis on the result according to the inverse mapping chart.

7. The method according to any of claims 1-4, wherein the result is used in a video codec scene when the block is a rectangular block map.

8. The method of claim 7, wherein the method comprises:

the result distinguishes the image object in the video coding and decoding scene into a non-target object group and a target object group;

calculating position change values of front and rear frames of the object in the target object group;

and carrying out coding and decoding operation on the image according to the position change value.

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 8.

10. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 8.

11. An apparatus for semantically segmenting an image, the apparatus comprising a selection module, a labeling module, a generation module, and an overlay code module, wherein,

the selection module is used for sequentially selecting one of the plurality of objects as an object to be marked;

the marking module is used for marking the area where the object to be marked is located to form a label graph;

the dividing module is used for dividing the label map into a plurality of blocks to generate a coding block map, an

And the superposition module is used for superposing all the generated coding block images to obtain the result of semantic segmentation.