CN112818955B

CN112818955B - Image segmentation method, device, computer equipment and storage medium

Info

Publication number: CN112818955B
Application number: CN202110294414.XA
Authority: CN
Inventors: 黄少飞; 王飞; 钱晨; 刘偲
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2023-09-15
Anticipated expiration: 2041-03-19
Also published as: CN112818955A

Abstract

The present disclosure provides an image segmentation method, apparatus, computer device, and storage medium, the method comprising: acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; respectively extracting target image characteristics corresponding to an image to be processed, target video characteristics corresponding to a target video fragment and first text characteristics corresponding to a descriptive text; respectively fusing the target image features and the target video features with the first text features to obtain fused image features and fused video features; and dividing the image to be processed according to the fused image features and the fused video features to obtain an image division result matched with the description text. According to the method and the device for processing the image segmentation, the image segmentation result of the image to be processed is determined by combining the target image characteristics of the image to be processed and the target video characteristics of the target video segment, so that the problem of poor accuracy of the image segmentation result caused by confusion of the characteristics of the target frame in the video segment by other frame characteristics can be solved.

Description

Image segmentation method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the technical field of image processing, and in particular, to an image segmentation method, an image segmentation apparatus, a computer device, and a storage medium.

Background

The video target segmentation technology based on language retrieval is to retrieve the target conforming to the language description in the video according to the action and appearance attributes described in the language, and obtain the complete segmentation mask.

Because the content between the images of different frames has slight differences in space, and the time sequence modeling method based on the video clips does not consider the slight differences, the visual characteristics of target frames in the video clips are confused, thereby disturbing the segmentation network and generating inaccurate image segmentation results.

Disclosure of Invention

The embodiment of the disclosure at least provides an image segmentation method, an image segmentation device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an image segmentation method, including: acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; respectively extracting target image characteristics corresponding to the image to be processed, target video characteristics corresponding to the target video segment and first text characteristics corresponding to the description text; respectively fusing the target image features and the target video features with the first text features to obtain fused image features and fused video features; and dividing the image to be processed according to the fusion image features and the fusion video features to obtain an image division result matched with the description text.

As can be seen from the foregoing description, in the embodiments of the present disclosure, by determining the image segmentation result of the image to be processed by combining the target image feature of the image to be processed and the target video feature of the target video segment including the image to be processed, the image segmentation of the image to be processed by combining the spatial information and the time sequence information of the video can be implemented.

In an alternative embodiment, the descriptive text includes a plurality of descriptive character segments; the fusing the target image feature and the first text feature to obtain a fused image feature comprises the following steps: according to the target image characteristics and the first text characteristics, determining the matching degree between each description character segment and the image to be processed, and obtaining a plurality of target matching degrees; determining language feature information of the descriptive text according to the target matching degrees and the first text features; the language characteristic information is used for describing appearance attribute characteristics of the image to be processed; and fusing the language characteristic information and the target image characteristic to obtain the fused image characteristic.

As can be seen from the above description, language feature information may be understood as a feature associated with an appearance attribute of an image to be processed in a description text, and therefore, in the embodiment of the present disclosure, effective information in the description text may be captured quickly and effectively, so as to adaptively extract a portion associated with the appearance of the image to be processed in the description text. When the language characteristic information and the target image characteristic are fused to obtain the fused image characteristic, the target to be segmented can be retrieved from the image to be processed more accurately according to the fused image characteristic, so that the accuracy of image segmentation is improved.

In an optional implementation manner, the determining the matching degree between each description character segment and the image to be processed according to the target image feature and the first text feature to obtain a plurality of target matching degrees includes: determining cross-modal attention information between the target image feature and the first text feature; the cross-modal attention information is used for representing the matching degree between each description character segment and each image position in the image to be processed; and calculating the matching degree between each description character segment and the image to be processed according to the cross-modal attention information to obtain a plurality of target matching degrees.

In an alternative embodiment, the first text feature includes: feature information of each description character segment in a plurality of character segments of the description text; the determining language feature information of the descriptive text according to the target matching degrees and the first text features comprises the following steps: and carrying out weighted summation on the target matching degrees and the characteristic information of each description character segment to obtain the language characteristic information.

According to the description, the attention value between each description character segment and each image position is calculated, and then the attention value between the description character segment and the image to be processed is determined according to the attention value, so that language characteristic information is determined according to the attention value, the attention mechanism can be utilized to automatically extract the characteristics associated with the appearance attribute of the image to be processed in the description text, effective information in the description text can be effectively captured, and result prediction is guided, so that the retrieval accuracy is improved.

In an optional implementation manner, the fusing the language feature information and the target image feature to obtain the fused image feature includes: filtering the target image features according to the language feature information to obtain image features matched with the language feature information; and merging the determined matched image features and the target image features to obtain the fusion image features.

According to the description, the image features matched with the language feature information can be obtained by filtering the target image features, so that the target to be segmented can be accurately segmented, and then the determined matched image features and the target image features are subjected to summation operation, so that the target image features of the image to be processed can be reserved in the fused image features, and the algorithm optimization of the image segmentation method provided by the disclosure is facilitated.

In an optional implementation manner, the target image features include a plurality of levels of image features obtained by processing the image to be processed by a plurality of network layers of a first neural network; the target video features comprise a plurality of levels of video features obtained by processing the target video segments by a plurality of network layers of a second neural network; the fusing the target image feature and the target video feature with the first text feature to obtain a fused image feature and a fused video feature, including: fusing the image features of each level in the image features of the multiple levels with the first text features to obtain the fused image features; and fusing the video features of each level in the video features of the multiple levels with the first text features to obtain the fused video features.

In the embodiment of the disclosure, the target image features and the first text features are fused layer by layer in a hierarchical manner, so that more comprehensive image features can be obtained, and the accuracy of image segmentation is further improved. The target video features and the first text features of the target video segments are fused layer by layer in a hierarchical mode, so that more comprehensive video features can be obtained, and the accuracy of image segmentation is further improved.

In an optional implementation manner, the segmentation of the image to be processed according to the fused image feature and the fused video feature to obtain an image segmentation result matched with the description text includes: determining fusion image features and fusion video features corresponding to the same level in the multi-level fusion image features and the multi-level fusion video features to obtain a plurality of fusion feature groups; the multi-level fusion image features comprise image features of various levels obtained by processing the image to be processed through a plurality of network layers of a first neural network and fusion image features of a plurality of levels obtained by fusing the first text features; the multi-level fusion video features comprise a plurality of levels of fusion image features obtained by fusing video features of each level obtained by processing the target video segment through a plurality of network layers of a second neural network and the first text features; fusing the fusion features of each fusion feature group with the second text features to obtain a target fusion result of each level; the second text feature is used for characterizing all the description character segments in the description text; and dividing the image to be processed according to the target fusion result of each level in the multiple levels to obtain the image division result.

As can be seen from the above description, by fusing the fused image feature, the fused video feature, and the second text feature in a hierarchical manner, a target fusion result including more comprehensive features can be obtained, so as to obtain an image segmentation result including a complete segmentation mask.

In an optional implementation manner, the segmenting the image to be processed according to the target fusion result of each level in the multiple levels to obtain the image segmentation result includes: performing up-sampling processing on the target fusion result of each level to obtain a target sampling result; and dividing the image to be processed through the target sampling result to obtain the image dividing result.

As can be seen from the above description, by performing up-sampling processing on the target fusion result, a target sampling result having the same size as the image to be processed can be obtained, and meanwhile, the target sampling result includes features of each level, so that the features included in the target sampling result are more comprehensive, and when determining the image segmentation result according to the target sampling result, a complete segmentation result describing the target to be segmented can be obtained.

In an optional implementation manner, the segmenting the image to be processed according to the fused image feature and the fused video feature to obtain an image segmentation result matched with the description text includes: respectively determining text features matched with the fused image features and the fused video features according to the first text features to obtain third text features matched with the fused image features and fourth text features matched with the fused video features; performing bit multiplication operation on the fusion image features and the third text features to obtain a first operation result; performing bit multiplication operation on the fusion video feature and the fourth text feature to obtain a second operation result; and summing the first operation result and the second operation result, and determining the image segmentation result according to the summation operation result.

In an alternative embodiment, the determining text features matching the fused image features and the fused video features according to the first text features respectively, to obtain a third text feature matching the fused image features and a fourth text feature matching the fused video features includes: calculating the average value of the feature information of each description character segment contained in the first text feature to obtain a target feature average value; respectively determining the fusion image characteristics and the full-connection layers corresponding to the fusion video characteristics to obtain a first full-connection layer and a second full-connection layer; and processing the target feature mean value through the first full-connection layer and the second full-connection layer in sequence to obtain the third text feature matched with the fusion image feature and the fourth text feature matched with the fusion video feature.

As can be seen from the foregoing description, in the embodiment of the present disclosure, since the target video feature and the target image feature are feature data with different modalities, when the target video feature (or the target image feature) and the first text feature are fused, the first text feature needs to be converted into a text feature with a different structure, and by this processing manner, the accuracy of the image segmentation result can be improved, so as to obtain the image segmentation result including the complete mask of the object to be segmented.

In a second aspect, an embodiment of the present disclosure provides an image segmentation apparatus including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; the extraction unit is used for respectively extracting target image features corresponding to the image to be processed, target video features corresponding to the target video segments and first text features corresponding to the description text; the fusion unit is used for respectively fusing the target image features and the target video features with the first text features to obtain fusion image features and fusion video features; and the determining unit is used for dividing the image to be processed according to the fusion image characteristics and the fusion video characteristics to obtain an image division result matched with the description text.

In a third aspect, embodiments of the present disclosure further provide a computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.

In a fourth aspect, the presently disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementations of the first aspect.

The foregoing objects, features and advantages of the disclosure will be more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of an image segmentation method provided by an embodiment of the present disclosure;

Fig. 2 is a schematic diagram illustrating an image to be processed and an image segmentation result thereof according to an embodiment of the disclosure;

FIG. 3 illustrates a flow frame diagram of an image segmentation method provided by an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of an image segmentation apparatus provided by an embodiment of the present disclosure;

fig. 5 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

The term "and/or" is used herein to describe only one relationship, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

According to research, a convolutional neural network for processing video clips in the prior art fuses information of multi-frame images. Because the content between different frame images has slight difference in space, the technical scheme can lead the visual characteristics of target frames in video fragments to be confused, thereby disturbing a segmentation network and generating inaccurate image segmentation results.

Based on the above-mentioned study, an embodiment of the present disclosure provides an image segmentation method, in which a target video segment including an image to be processed is first obtained, and a description text corresponding to the image to be processed is obtained, and then, a target image feature corresponding to the image to be processed, a target video feature corresponding to the target video segment, and a first text feature corresponding to the description text are respectively extracted; and finally, segmenting the image to be processed according to the fused image features and the fused video features to obtain an image segmentation result matched with the descriptive text.

As can be seen from the above description, in the embodiments of the present disclosure, by determining the image segmentation result of the image to be processed by combining the target image feature of the image to be processed and the target video feature of the target video segment including the image to be processed, it is possible to achieve image segmentation of the image to be processed by combining the spatial information and the timing information of the video. By the processing mode, the image features of the image to be processed are not confused by the features of other video frames in the target video segment, so that the accuracy of an image segmentation result is improved, and the problem of poor accuracy of the image segmentation result caused by the fact that the features of the target frame in the video segment are confused by the features of other video frames is solved.

For the sake of understanding the present embodiment, first, an image segmentation method disclosed in the embodiments of the present disclosure will be described in detail, and an execution subject of the image segmentation method provided in the embodiments of the present disclosure is generally a computer device with a certain computing capability, where an image capturing device capable of capturing video is preset in the computer device.

Referring to fig. 1, a flowchart of an image segmentation method according to an embodiment of the disclosure may be applied to the above-described computer device, where the method includes the following steps:

S101: and acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed.

In the embodiment of the disclosure, in the computer device, an image capturing device may be provided, and further, a video clip is collected by the image capturing device. The target video segment may be a part of the video segment collected by the camera device.

For example, the image to be processed is an nth frame image in a video clip acquired by the camera device, and then the image frames included in the target video clip are: the N-th frame image in the video clip may be a first N-th frame image of the N-th frame image in the video clip, and the N-1-th frame image of the N-th frame image in the video clip, where N may take a value of 4-10, which is not specifically limited in this disclosure.

The description text may be a text representation of voice information input by a user and matched with the image to be processed, and may also be subtitle information in the image to be processed, and the specific form of the description voice is not limited in this disclosure.

S103: and respectively extracting target image characteristics corresponding to the image to be processed, target video characteristics corresponding to the target video segment and first text characteristics corresponding to the descriptive text.

S105: and respectively fusing the target image features and the target video features with the first text features to obtain fused image features and fused video features.

Specifically, the target image feature may be fused with the first text feature to obtain a fused image feature, and the target video feature may be fused with the first text feature to obtain a fused video feature.

S107: and dividing the image to be processed according to the fusion image features and the fusion video features to obtain an image division result matched with the description text.

For example, as shown in fig. 2, an image to be processed and a description text thereof, and a target video clip containing the image to be processed are acquired, wherein the description text of the image to be processed may be "a white brown cat jumps backward". Then, an image segmentation result of the image to be processed may be determined according to the image to be processed, the description text and the target video clip, for example, the image segmentation result may be a segmentation result as shown in fig. 2, and the image segmentation result may be a segmentation result including a segmentation mask of a target to be segmented, where the target to be segmented is a segmentation target indicated in the description text, for example: a white brown cat jumping backward.

As can be seen from the above description, in the embodiment of the present disclosure, a target video clip including an image to be processed and a description text corresponding to the image to be processed are first acquired. Then, the target image features corresponding to the image to be processed, the target video features corresponding to the target video segments and the first text features corresponding to the descriptive text can be extracted respectively.

In the embodiment of the disclosure, the image to be processed and the target video segment can be processed through a convolutional neural network to respectively obtain the target image characteristic and the target video characteristic. Specifically, the characteristics of the image to be processed can be extracted through a 2D convolutional neural network (for example, the admission-V3), so as to obtain the characteristics of the target image; extracting characteristics of a target video segment through a 3D convolutional neural network (for example, I3D) to obtain target video characteristics; the descriptive text may be processed by a gating loop unit (Gated Recurrent Unit, GRU) to obtain a first text feature. The gating cycle unit is a gating cycle neural network (gated recurrent neural network). Besides the gating loop unit, the description text can be processed through other types of loop neural networks to obtain a first text feature.

It is to be appreciated that the first text feature may be each description character segment in the description text or feature information of the description character segment, wherein the description character segment may be understood as each word in the description text, and the description character segment may be understood as each phrase in the description text, which is not specifically limited in this disclosure.

After the target image feature, the target video feature and the first text feature are extracted in the above-described manner, the target image feature and the target video feature can be respectively fused with the first text feature to obtain a fused image feature and a fused video feature.

In the embodiment of the present disclosure, in the case that the description text includes a plurality of description character segments, step S105 fuses the target image feature with the first text feature to obtain a fused image feature, and specifically includes the following steps:

and S11, determining the matching degree between each description character segment and the image to be processed according to the target image characteristics and the first text characteristics, and obtaining a plurality of target matching degrees.

In this step, an attention value between each description character segment and the image to be processed may be determined, and then a matching degree between each description character segment and the image to be processed is determined by the attention value, thereby obtaining a plurality of target matching degrees.

It should be appreciated that before determining the first text feature, word segmentation may be performed on the descriptive text to obtain a plurality of word segmentation phrases; and then screening the word groups to filter useless word groups in the word groups to obtain a plurality of description character segments, wherein the useless word groups can be word groups containing exclamation words and human-called pronouns, and each description character segment can be a single character or a word group consisting of a plurality of characters.

In the embodiment of the disclosure, the attention value between each description character segment and the image to be processed can be determined through the following process, specifically including:

firstly, determining cross-modal attention information between a target image feature and a first text feature; the cross-modal attention information is used to characterize the degree of matching between each descriptive character segment and each image location in the image to be processed. Wherein each image position may be the position of each pixel in the image to be processed.

Specifically, the formula can be usedComputing cross-modal attention information, wherein +.>Representing a matrix multiplication, a being the above-described cross-modal attention information, each element in a being used to characterize the attention value (i.e., degree of matching) between each descriptive character segment and each image location; v (V) _S Representing the target image feature, L representing the first text feature, V _S The symbol "S" in (c) represents an image to be processed.

After the cross-modal attention information A is determined, the matching degree between each description character segment and the image to be processed can be calculated according to the cross-modal attention information, so that a plurality of target matching degrees are obtained.

Specifically, all attention values of each description character segment in a may be subjected to a summation operation, and a softmax normalization process may be performed on the summation operation result, so as to obtain an attention value w between each description character segment and an image to be processed (i.e., a matching degree between each description character segment and the image to be processed), so as to obtain a plurality of attention values w (i.e., a plurality of target matching degrees).

Step S12, determining language feature information describing the text according to a plurality of target matching degrees and the first text features; the language characteristic information is used for describing appearance attribute characteristics of the image to be processed.

In the embodiment of the disclosure, after determining the plurality of target matching degrees, the plurality of target matching degrees and the first text feature may be weighted and summed to obtain language feature information describing the text. Language characteristic information may be understood as characteristic information in descriptive text describing appearance attributes of a corresponding image (e.g., an image to be processed).

As can be seen from the above description, the first text feature includes: the feature information of each descriptive character segment, based on which the weighted summation of the plurality of target matches and the first text feature may be described as the following process:

by the formulal _S And carrying out weighted summation on the multiple target matching degrees and the feature information of each description character segment by using the delta sigma wL to obtain language feature information.

In the above formula, l _S Representing features in the descriptive text associated with appearance attributes of the image to be processed, w representing an attention value between each descriptive character segment and the image to be processed, L representing a first text feature.

Step S13, fusing the language feature information and the target image feature to obtain a fused image feature, wherein the fusion process is described as follows:

after the language feature information is obtained according to the method, the target image feature can be filtered according to the language feature information, and the image feature matched with the language feature information in the target image feature can be obtained.

Specifically, the target image feature and the language feature information may be subjected to bit-wise multiplication, so that the target image feature is subjected to filtering processing according to a processing manner of the bit-wise multiplication. The filtering processing is performed on the target image features to filter out features, which are not matched with the language feature information, in the target image features, so as to obtain image features matched with the language feature information, for example, the image features matched with the appearance attributes corresponding to the language feature information can be filtered out from the target image features. By extracting the appearance attribute characteristics of the image to be processed through the language characteristic information, the object to be segmented can be accurately retrieved from the image to be processed, and an image segmentation result which does not contain useless information is obtained, so that the segmentation accuracy of image segmentation is improved, wherein the useless information is information irrelevant to the object to be segmented.

After obtaining the image features of the target image features that match the language feature information, the determined matching image features and the target image features may be combined, for example, a summation operation may be performed. After merging, a fused image feature may be obtained, wherein the target image feature may be understood as a residual feature in the fused image feature.

Because the determined matched image features are part of the target image features, in order to improve the robustness and stability of the technical scheme of the disclosure, the determined matched image features and the target image features need to be combined, so that the processing performance of the technical scheme of the disclosure is improved by setting residual features.

From the above description, language feature information may be understood as feature information describing appearance attributes of a corresponding image (for example, an image to be processed) in a text. Therefore, the method for obtaining language characteristic information by carrying out weighted summation on the matching degree of the plurality of targets and the characteristic information of each description character segment can quickly and effectively capture the effective information associated with the appearance attribute in the description text, so that the part associated with the appearance attribute of the image to be processed in the description text is adaptively extracted. When the language characteristic information and the target image characteristic are fused to obtain the fused image characteristic, the target to be segmented can be more accurately searched in the image to be processed according to the fused image characteristic, so that the accuracy of image segmentation is improved.

In the embodiment of the present disclosure, in the case that the description text includes a plurality of description character segments, step S105 fuses the target video feature with the first text feature, and a specific process for obtaining the fused video feature is the same as the process described in the above steps S11 to S13, and is specifically described as follows:

and S21, determining the matching degree between each description character segment and the target video segment according to the target video features and the first text features, and obtaining a plurality of target matching degrees.

For the above step S21, first, cross-modal attention information between the target video feature and the first text feature is determined; the cross-modal attention information is used for representing the matching degree between each description character segment and each image position in each video frame in the target video segment; and then, calculating the matching degree between each description character segment and the target video segment according to the cross-modal attention information to obtain a plurality of target matching degrees.

Step S22, determining language feature information of the descriptive text according to the target matching degrees and the first text features; the language characteristic information is used for describing action characteristics of the target video clip.

And for step S22, carrying out weighted summation on the target matching degrees and the characteristic information of each description character segment to obtain the language characteristic information.

And S23, fusing the language feature information and the target video feature to obtain the fused video feature.

Aiming at step S23, filtering the target video features according to the language feature information to obtain video features matched with the language feature information; and combining the determined matched video features with the target video features, for example, carrying out summation operation to obtain the fusion video features.

According to the description, the language feature information can be understood as the feature associated with the action feature of the target video segment in the description text, so that the language feature information associated with the action feature of the video in the description text can be rapidly and effectively captured by means of carrying out weighted summation on a plurality of target matching degrees and feature information of each description character segment to obtain the language feature information. After the language feature information and the target video feature are fused to obtain a fused video feature, the video feature which is matched with the video action feature described by the language feature information in the target video feature can be filtered. When the image to be processed is segmented according to the fusion video features and the fusion image features, the action features in the target video segments can be extracted according to the fusion video features, the appearance attribute features in the image to be processed can be extracted according to the fusion image features, and when the action features and the appearance attribute features are fused to obtain an image segmentation result, the target to be segmented can be accurately positioned, and then the image segmentation result containing the complete mask of the target to be segmented is obtained, so that the accuracy of image segmentation is improved.

In an alternative embodiment, the target image feature includes a plurality of levels of image features obtained by processing the image to be processed by a plurality of network layers of the first neural network, in which case, the target image feature is fused with the first text feature to obtain a fused image feature, including the following procedures:

and fusing the image features of each level in the image features of the multiple levels with the first text features to obtain the fused image features.

In the embodiment of the disclosure, the first neural network may be selected as a 2D convolutional neural network (acceptance-V3), and further, the features of the image to be processed are extracted according to the 2D convolutional neural network, so as to obtain a plurality of features of the image with sequentially reduced scales, where each scale corresponds to a level.

At this time, the image features of each scale and the first text feature may be fused to obtain a fused image feature corresponding to the image feature of each scale, and the fusion process may be described as the following process:

from the image feature and the first text feature of each scale, an attention value (i.e., a degree of matching) between each descriptive character segment and the image feature of each scale is determined, resulting in a plurality of attention values A1. Then, language feature information describing the text is determined according to the plurality of attention values A1 and the first text feature, and the language feature information and the image feature of each scale are fused to obtain fused image features corresponding to the image features of each scale.

For example, the formula can be usedCalculating a degree of matching between each descriptive character segment and each image position in the image feature of the ith scale, wherein +.>Representing the image feature of the ith scale in the target image feature. For the matching degree between each descriptive character segment and each image position in the image characteristic of each scale, the image characteristic can be subjected to summation normalization processing so as to obtain each descriptive characterDegree of matching w between the segment and the image features for each scale. Then, for the image features of the ith scale, the degree of matching w between each descriptive character segment and the image features of the ith scale can be calculated ⁱ And the first text feature is weighted and summed, and the specific formula is described as: />From this formula, language feature information +_for describing appearance attribute features of image features of each scale in the text can be determined>In obtaining language characteristic information->After that, the language characteristic information can be +.>Carrying out bit multiplication on the image features of the ith scale, and carrying out summation operation on the calculation result and the image features of the ith scale so as to obtain fusion image features +. >

In the disclosed embodiments, the larger the scale the more blurred the image, i.e., the lower the resolution of the image; the smaller the scale the sharper the image, i.e. the higher the resolution of the image. By means of hierarchical processing of the image to be processed, image features with different resolutions can be obtained, for example, features of targets contained in the image to be processed can be obtained, features of each pixel point in the image to be processed can be obtained, and the target image features and the first text features are fused layer by layer in a hierarchical manner, so that more comprehensive image features can be obtained, and accuracy of image segmentation is further improved.

In an optional implementation manner, the target video feature comprises a plurality of levels of video features obtained by processing the target video segment by a plurality of network layers of the second neural network; in this case, fusing the target video feature with the first text feature to obtain a fused video feature, including the following steps:

and fusing the video features of each level in the video features of the multiple levels with the first text features to obtain the fused video features.

In an alternative embodiment, the second neural network may be selected as a 3D convolutional neural network (acceptance-V3), where the features of the target video segment may be extracted by the 3D convolutional neural network, so as to obtain a plurality of features of sequentially reduced scales, where each scale corresponds to a level.

At this time, the video feature of each scale and the first text feature may be fused to obtain a fused video feature corresponding to the video feature of each scale, where in this case, the above-mentioned fusion process may be described as the following process:

a degree of matching (i.e., an attention value) between each descriptive character segment and each scale video feature is determined based on the target video feature and the first text feature, resulting in a plurality of target degrees of matching. And then, according to the multiple target matching degrees and the first text features, language feature information of the descriptive text is determined, and the language feature information and the video features of each scale are fused to obtain fused video features corresponding to the video features of each scale.

In the disclosed embodiments, the larger the scale the more blurred the image, i.e., the lower the resolution of the image; the smaller the scale the sharper the image, i.e. the higher the resolution of the image. The feature extraction is carried out on the target video segment in a hierarchical mode, and the target video feature and the first text feature are fused layer by layer, so that more comprehensive video features can be obtained, and the accuracy of image segmentation is further improved.

In the embodiment of the disclosure, after the target image feature and the target video feature are respectively fused with the first text feature according to the above-described process to obtain the fused image feature and the fused video feature, the image to be processed may be segmented according to the fused image feature and the fused video feature, so as to obtain an image segmentation result matched with the description text.

In an alternative embodiment, where the fused image features comprise multi-level fused image features and the fused video features comprise multi-level fused video features, the above steps may be described as the following procedure:

step S1071, determining the fusion image features and the fusion video features corresponding to the same level in the multi-level fusion image features and the multi-level fusion video features to obtain a plurality of fusion feature groups.

It can be understood that the above-mentioned multi-level fusion image features include a plurality of levels of fusion image features obtained by fusing the image features of each level obtained by processing the image to be processed through a plurality of network layers of the first neural network and the first text features; the multi-level fusion video features comprise a plurality of levels of fusion image features obtained by fusing video features of each level obtained by processing the target video segment through a plurality of network layers of the second neural network and the first text features.

In the embodiment of the disclosure, the number of the layers corresponding to the multi-layer fusion image features and the multi-layer fusion video features is the same, and the resolution of the features corresponding to the fusion image features and the fusion video features under the same layer is the same.

Based on the above, the fusion image features and the fusion video features corresponding to the same level can be determined in the multi-level fusion image features and the multi-level fusion video features, so as to obtain a plurality of fusion feature groups.

For example, the multiple levels are L1 to L5, at this time, the fused image features and the fused video features of the level L1 may be determined as a fused feature group, and the processing procedure for the levels L2 to L5 is the same as the processing procedure for the level L1, which is not described in detail herein.

Step S1072, fusing the fusion features of each fusion feature group with the second text features to obtain a target fusion result of each level; the second text feature is used for characterizing all descriptive character segments in the descriptive text.

In the embodiment of the disclosure, after determining the multiple fusion feature groups, feature information of each description character segment included in the first text feature may be averaged to obtain a second text feature for characterizing all description character segments in the description text.

And then fusing the fusion features in each fusion feature group with the second text features to obtain a target fusion result of each level.

And step S1073, dividing the image to be processed according to the target fusion result of each of the multiple layers to obtain the image division result.

After the target fusion result of each level is obtained, up-sampling processing can be carried out on the target fusion result of each level according to the sequence from the high resolution to the low resolution, so as to obtain a target sampling result; and dividing the image to be processed according to the target sampling result to obtain an image dividing result.

After the target sampling result is obtained, the target sampling result may be subjected to convolution processing through a preset convolution neural network to obtain an image segmentation result of the image to be processed, for example, an image segmentation result as shown in fig. 2 may be obtained.

According to the description, the target fusion result with more comprehensive characteristics can be obtained by fusing the fusion image characteristics, the fusion video characteristics and the second text characteristics in a hierarchical manner, so that the image segmentation result with the complete segmentation mask of the target to be segmented is obtained.

In the embodiment of the present disclosure, the process of fusing the fused image feature and the fused video feature corresponding to each hierarchy may be described as the following process:

(1) And respectively determining text features matched with the fused image features and the fused video features according to the first text features to obtain a third text feature matched with the fused image features and a fourth text feature matched with the fused video features.

Specifically, the feature information of each description character segment included in the first text feature may be averaged to obtain the target feature average value. And then, respectively determining the fusion image characteristics and the full-connection layers corresponding to the fusion video characteristics to obtain a first full-connection layer and a second full-connection layer.

The first full-connection layer and the second full-connection layer are full-connection layers with different parameters. Because the target image features are spatial features, the target video features are temporal features, and the spatial features and the temporal features are features of two modes, corresponding full-connection layers are required to be respectively arranged for the image to be processed and the target video segment, namely: a first fully-connected layer and a second fully-connected layer.

And then, sequentially processing the target feature mean value through the first full-connection layer and the second full-connection layer to obtain a third text feature matched with the fused image feature and a fourth text feature matched with the fused video feature.

Specifically, the above process can be described as the following formula:

g _S ＝Linear _S (l)；g _T ＝Linear _T (l) The method comprises the steps of carrying out a first treatment on the surface of the Wherein g _S Representing a third text feature g _T Represent the fourth text feature, l represents the target feature mean, linear _S (. Cndot.) represents the first full link layer, linear _T (. Cndot.) represents the second fully attached layer.

(2) Performing bit multiplication operation on the fusion image features and the third text features to obtain a first operation result; and performing bit-wise multiplication operation on the fusion video feature and the fourth text feature to obtain a second operation result.

It should be noted that, in the embodiment of the present disclosure, after the third text feature and the fourth text feature are calculated, normalization processing may be further performed on the third text feature and the fourth text feature according to a formula, where the formula is described as follows:

in the embodiment of the disclosure, after the third text feature after normalization processing and the fourth text feature after normalization processing are obtained, the formula can be calculated Performing bit multiplication operation on the fusion image features and the third text features to obtain a first operation result; and according to the formula-> And carrying out bit multiplication operation on the fused video feature and the fourth text feature to obtain a second operation result. Wherein (1)>Fused image features representing level i, +.>Representing the fused video features of the ith hierarchy.

(3) And carrying out summation operation on the first operation result and the second operation result, and determining the image segmentation result according to the summation operation result.

In the embodiments of the present disclosure, it is possible toAnd->Performing a summation operation to obtain a summation operation result +.>Afterwards, the sum operation result is->And performing convolution calculation to obtain an image segmentation result.

The above process is described below in conjunction with fig. 3. A flow chart of an image segmentation method is shown in fig. 3. As shown in fig. 3, a speech encoder, a spatial visual encoder, a temporal visual encoder, and a multi-modal decoder are included in the framework. As shown in fig. 3, the speech coder may be selected as a gated loop unit (GRU); the space vision encoder comprises a 2D convolutional neural network (acceptance-V3) and a language information integration module; the time sequence visual encoder comprises a 3D convolutional neural network and a language information integration module, and the multi-mode decoder comprises an Up-sampling layer Up2x and a data selector. The language information integration module in the spatial visual encoder and the language information integration module in the time sequence visual encoder can be the modules with the same structure or the modules with different structures.

The language encoder is configured to acquire descriptive text corresponding to the image to be processed and extract first text features corresponding to the descriptive text.

In the embodiment of the present disclosure, the vector sequence of the description character segments of the description text may be processed by a gating loop unit (GRU) in the speech encoder to obtain a coding sequence L of all the description character segments, where the coding sequence includes feature information of the description character segments, and the gating loop unit adopted herein may be replaced by another loop neural network, which is not specifically limited in this disclosure.

The space vision encoder is configured to acquire an image to be processed, extract target image features corresponding to the image to be processed, and fuse the target image features with the first text features to obtain fused image features.

The data input into the spatial vision encoder are 3-channel RGB pictures (i.e. the above-mentioned image to be processed) and all the coded sequences L describing the character segments. The spatial vision encoder extracts layered target image features using a 2D convolutional neural network (acceptance-V3), and inserts a language information integration module at each layer in the 2D convolutional neural network to fuse the language features (i.e., the first text features) and the target image features, resulting in fused image features as shown in fig. 3To->The acceptance-V3 network employed herein may be replaced with any other 2D convolutional neural network, as this disclosure is not specifically limited.

The time sequence visual encoder is configured to acquire a target video segment containing an image to be processed, extract target video features corresponding to the target video segment, and fuse the target video features with the first text features to obtain fused video features.

The data input to the time sequential visual encoder is a target video segment containing an image to be processed, for example, a 3-channel RGB picture of the image to be processed and 8 frames of images around the image to be processed and a coding sequence L of all description character segments. The spatial vision encoder extracts layered target video features using a 3D convolutional neural network (I3D), and inserts a language information integration module at each layer in the convolutional neural network to fuse the language features (i.e., the first text features) and the target video features, resulting in a fused video feature as shown in fig. 3 To->The I3D network employed herein may be replaced by any other 3D convolutional neural network, this disclosureThis is not particularly limited.

And the multi-mode decoder is configured to determine an image segmentation result matched with the descriptive text in the image to be processed according to the fused image characteristics and the fused video characteristics.

The data input into the multi-mode decoder is the fused image characteristic and the fused video characteristic of each level and the characteristic l used for representing all the description character segments in the description text, the multi-mode decoder adopts a layer-by-layer upsampling decoding mode, and the size of the characteristic diagram is gradually restored until the characteristic diagram is consistent with the size of the input original diagram.

A data selector in the multi-mode decoder for selecting features to be calculated from the spatial and temporal visual encoders. For example, in accordance with the formulaAnd performing bit multiplication operation on the fused image features and the third text features to obtain a first operation result, selecting the fused image features and the first text features output by the space vision encoder, and performing processing calculation on the fused image features and the first text features to obtain the first operation result. In accordance with the formulaAnd performing bit multiplication operation on the fusion video feature and the fourth text feature to obtain a second operation result, selecting the fusion video feature and the first text feature output by the time sequence visual encoder, and performing processing calculation on the fusion video feature and the first text feature to obtain the second operation result.

According to the embodiment of the disclosure, language information integration can be performed adaptively, components related to actions and appearances in the language are respectively and automatically extracted by using an attention mechanism, effective information in the language is more effectively captured, result prediction is guided, and retrieval accuracy is improved.

The embodiment of the disclosure can also divide the image by combining the spatial information of the image to be processed and the time sequence information of the target video segment, thereby obtaining an accurate division result and having better performance than single information modeling. The embodiment of the disclosure has lower size requirement on the input video, greatly reduces the overall calculation amount, has strong expandability and has more abundant application scenes. For example, the application scene may monitor video processing scenes and video editing tool scenes.

Scene one, monitoring video processing scene.

The user can input the description text of the agent or the vehicle to be tracked in advance, and then the monitoring device acquires the acquired monitoring video, and retrieves and tracks the agent or the vehicle in the monitoring video according to the description text, so that the segmentation mask of the agent or the vehicle to be tracked is obtained. The method can reduce labor cost by using the algorithm, speed up retrieval, input the feature description of the agent or the vehicle, and automatically locate the agent or the vehicle and track the track of the agent or the vehicle in the video.

Scene two, video editing tool scene.

The user edits a particular object in the video, e.g., removes the object from the video, pastes a decoration on the object, etc. At this time, the user may input a description of the object, then obtain a complete segmentation mask of the object using the method provided in the present embodiment, and then perform a corresponding operation.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiments of the present disclosure further provide an image segmentation apparatus corresponding to the image segmentation method, and since the principle of the apparatus in the embodiments of the present disclosure for solving the problem is similar to that of the image segmentation method described in the embodiments of the present disclosure, the implementation of the apparatus may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 4, a schematic diagram of an image segmentation apparatus according to an embodiment of the disclosure is shown, where the apparatus includes: an acquisition unit 41, an extraction unit 42, a fusion unit 43, and a determination unit 44; wherein,,

An obtaining unit 41, configured to obtain a target video segment including an image to be processed, and a description text corresponding to the image to be processed;

an extracting unit 42, configured to extract a target image feature corresponding to the image to be processed, a target video feature corresponding to the target video segment, and a first text feature corresponding to the descriptive text, respectively;

a fusion unit 43, configured to fuse the target image feature and the target video feature with the first text feature, respectively, to obtain a fused image feature and a fused video feature;

and the determining unit 44 is configured to segment the image to be processed according to the fused image feature and the fused video feature, so as to obtain an image segmentation result that matches the description text.

In a possible embodiment, the fusion unit 43 is further configured to: under the condition that the description text contains a plurality of description character segments, determining the matching degree between each description character segment and the image to be processed according to the target image characteristics and the first text characteristics to obtain a plurality of target matching degrees; determining language feature information of the descriptive text according to the target matching degrees and the first text features; the language characteristic information is used for describing appearance attribute characteristics of the image to be processed; and fusing the language characteristic information and the target image characteristic to obtain the fused image characteristic.

In a possible embodiment, the fusion unit 43 is further configured to: determining cross-modal attention information between the target image feature and the first text feature; the cross-modal attention information is used for representing the matching degree between each description character segment and each image position in the image to be processed; and calculating the matching degree between each description character segment and the image to be processed according to the cross-modal attention information to obtain a plurality of target matching degrees.

In a possible embodiment, the fusion unit 43 is further configured to: the first text feature includes: and under the condition that the characteristic information of each description character segment in the plurality of character segments of the description text, carrying out weighted summation on the plurality of target matching degrees and the characteristic information of each description character segment to obtain the language characteristic information.

In a possible embodiment, the fusion unit 43 is further configured to: filtering the target image features according to the language feature information to obtain image features matched with the language feature information; and merging the determined matched image features and the target image features to obtain the fusion image features.

In a possible embodiment, the fusion unit 43 is further configured to: the target image features comprise a plurality of levels of image features obtained by processing the image to be processed through a plurality of network layers of a first neural network; under the condition that the target video feature comprises a plurality of levels of video features obtained by processing the target video segment by a plurality of network layers of a second neural network, fusing the image feature of each level in the plurality of levels of image features with the first text feature to obtain the fused image feature; and fusing the video features of each level in the video features of the multiple levels with the first text features to obtain the fused video features.

In a possible implementation, the determining unit 44 is further configured to: determining fusion image features and fusion video features corresponding to the same level in the multi-level fusion image features and the multi-level fusion video features to obtain a plurality of fusion feature groups; the multi-level fusion image features comprise image features of various levels obtained by processing the image to be processed through a plurality of network layers of a first neural network and fusion image features of a plurality of levels obtained by fusing the first text features; the multi-level fusion video features comprise a plurality of levels of fusion image features obtained by fusing video features of each level obtained by processing the target video segment through a plurality of network layers of a second neural network and the first text features; fusing the fusion features of each fusion feature group with the second text features to obtain a target fusion result of each level; the second text feature is used for characterizing all the description character segments in the description text; and dividing the image to be processed according to the target fusion result of each level in the multiple levels to obtain the image division result.

In a possible implementation, the determining unit 44 is further configured to: performing up-sampling processing on the target fusion result of each level to obtain a target sampling result; and dividing the image to be processed through the target sampling result to obtain the image dividing result.

In a possible implementation, the determining unit 44 is further configured to: respectively determining text features matched with the fused image features and the fused video features according to the first text features to obtain third text features matched with the fused image features and fourth text features matched with the fused video features; performing bit multiplication operation on the fusion image features and the third text features to obtain a first operation result; performing bit multiplication operation on the fusion video feature and the fourth text feature to obtain a second operation result; and summing the first operation result and the second operation result, and determining the image segmentation result according to the summation operation result.

In a possible implementation, the determining unit 44 is further configured to: calculating the average value of the feature information of each description character segment contained in the first text feature to obtain a target feature average value; respectively determining the fusion image characteristics and the full-connection layers corresponding to the fusion video characteristics to obtain a first full-connection layer and a second full-connection layer; and processing the target feature mean value through the first full-connection layer and the second full-connection layer in sequence to obtain the third text feature matched with the fusion image feature and the fourth text feature matched with the fusion video feature.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

Corresponding to the image segmentation method in fig. 1, the embodiment of the present disclosure further provides a computer device 500, as shown in fig. 5, which is a schematic structural diagram of the electronic device 500 provided in the embodiment of the present disclosure, including:

a processor 51, a memory 52, and a bus 53; memory 52 is used to store execution instructions, including memory 521 and external storage 522; the memory 521 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 51 and data exchanged with the external memory 522 such as a hard disk, and the processor 51 exchanges data with the external memory 522 through the memory 521, and when the electronic device 500 is operated, the processor 51 and the memory 52 communicate with each other through the bus 53, so that the processor 51 executes the following instructions:

acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; respectively extracting target image characteristics corresponding to the image to be processed, target video characteristics corresponding to the target video segment and first text characteristics corresponding to the description text; respectively fusing the target image features and the target video features with the first text features to obtain fused image features and fused video features; and dividing the image to be processed according to the fusion image features and the fusion video features to obtain an image division result matched with the description text.

The disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the image segmentation method described in the method embodiments above. Wherein the storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform the steps of the image segmentation method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An image segmentation method, comprising:

acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; the image to be processed is a designated frame image in the target video segment;

Respectively extracting target image characteristics corresponding to the image to be processed, target video characteristics corresponding to the target video segment and first text characteristics corresponding to the description text; the first text feature is used for representing feature information of each phrase in the descriptive text;

respectively fusing the target image features and the target video features with the first text features to obtain fused image features and fused video features; wherein,,

the fusing the target image feature with the first text feature includes:

acquiring image features matched with first language feature information in the target image features, and combining the matched image features with the target image features;

the fusing the target video feature with the first text feature includes:

acquiring video features matched with second language feature information in the target video features, and combining the matched video features with the target video features;

the first language characteristic information comprises information describing appearance attribute characteristics of the image to be processed in the description text; the second language characteristic information comprises information describing action characteristics of the target video clip in the description text;

And dividing the image to be processed according to the fusion image characteristics and the fusion video characteristics to obtain an image division result matched with the description text.

2. The method of claim 1, wherein the descriptive text comprises a plurality of descriptive character segments; the method further comprises the steps of:

determining the first language characteristic information according to the following steps:

according to the target image characteristics and the first text characteristics, determining the matching degree between each description character segment and the image to be processed, and obtaining a plurality of first target matching degrees;

determining the first language characteristic information of the descriptive text according to the plurality of first target matching degrees and the first text characteristics;

determining the second language characteristic information according to the following steps:

according to the target video features and the first text features, determining the matching degree between each description character segment and the target video segment, and obtaining a plurality of second target matching degrees;

and determining the second language characteristic information of the descriptive text according to the plurality of second target matching degrees and the first text characteristics.

3. The method of claim 2, wherein determining a degree of matching between each descriptive character segment and the image to be processed based on the target image feature and the first text feature results in a plurality of first target degrees of matching, comprising:

Determining cross-modal attention information between the target image feature and the first text feature; the cross-modal attention information is used for representing the matching degree between each description character segment and each image position in the image to be processed;

and calculating the matching degree between each description character segment and the image to be processed according to the cross-modal attention information to obtain a plurality of first target matching degrees.

4. A method according to claim 3, wherein the first text feature comprises: feature information of each description character segment in a plurality of character segments of the description text;

the determining the first language feature information of the descriptive text according to the plurality of first target matching degrees and the first text features comprises the following steps:

and carrying out weighted summation on the plurality of first target matching degrees and the characteristic information of each description character segment to obtain the first language characteristic information.

5. The method of claim 4, wherein the obtaining the image feature of the target image feature that matches the first language feature information comprises:

and filtering the target image features according to the first language feature information to obtain image features matched with the first language feature information.

6. The method according to any one of claims 1 to 5, wherein the target image features comprise a plurality of levels of image features obtained by processing the image to be processed by a plurality of network layers of a first neural network; the target video features comprise a plurality of levels of video features obtained by processing the target video segments by a plurality of network layers of a second neural network;

the fusing the target image feature and the target video feature with the first text feature to obtain a fused image feature and a fused video feature, including:

fusing the image features of each level in the image features of the multiple levels with the first text features to obtain the fused image features;

7. The method according to claim 6, wherein the segmenting the image to be processed according to the fused image feature and the fused video feature to obtain the image segmentation result matched with the description text comprises:

Determining fusion image features and fusion video features corresponding to the same level in the multi-level fusion image features and the multi-level fusion video features to obtain a plurality of fusion feature groups; the multi-level fusion image features comprise image features of various levels obtained by processing the image to be processed through a plurality of network layers of a first neural network and fusion image features of a plurality of levels obtained by fusing the first text features; the multi-level fusion video features comprise a plurality of levels of fusion video features obtained by fusing the video features of each level obtained by processing the target video segment through a plurality of network layers of a second neural network and the first text features;

fusing the fusion features of each fusion feature group with the second text features to obtain a target fusion result of each level; the second text feature is used for characterizing all the description character segments in the description text;

and dividing the image to be processed according to the target fusion result of each level in the multiple levels to obtain the image division result.

8. The method according to claim 7, wherein the segmenting the image to be processed according to the target fusion result of each of the multiple levels to obtain the image segmentation result comprises:

Performing up-sampling processing on the target fusion result of each level to obtain a target sampling result;

and dividing the image to be processed through the target sampling result to obtain the image dividing result.

9. The method according to any one of claims 1 to 5, wherein the segmenting the image to be processed according to the fused image features and the fused video features to obtain an image segmentation result matched with the descriptive text comprises:

respectively determining text features matched with the fused image features and the fused video features according to the first text features to obtain third text features matched with the fused image features and fourth text features matched with the fused video features;

performing bit multiplication operation on the fusion image features and the third text features to obtain a first operation result; performing bit multiplication operation on the fusion video feature and the fourth text feature to obtain a second operation result;

and summing the first operation result and the second operation result, and determining the image segmentation result according to the summation operation result.

10. The method of claim 9, wherein determining text features that match the fused image feature and the fused video feature from the first text feature, respectively, results in a third text feature that matches the fused image feature and a fourth text feature that matches the fused video feature, comprising:

calculating the average value of the feature information of each description character segment contained in the first text feature to obtain a target feature average value;

respectively determining the fusion image characteristics and the full-connection layers corresponding to the fusion video characteristics to obtain a first full-connection layer and a second full-connection layer;

and processing the target feature mean value through the first full-connection layer and the second full-connection layer in sequence to obtain the third text feature matched with the fusion image feature and the fourth text feature matched with the fusion video feature.

11. An image dividing apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target video segment containing an image to be processed and a description text corresponding to the image to be processed; the image to be processed is a designated frame image in the target video segment;

The extraction unit is used for respectively extracting target image features corresponding to the image to be processed, target video features corresponding to the target video segments and first text features corresponding to the description text; the first text feature is used for representing feature information of each phrase in the descriptive text;

the fusion unit is used for respectively fusing the target image features and the target video features with the first text features to obtain fusion image features and fusion video features; wherein,,

the fusing the target image feature with the first text feature includes:

the fusing the target video feature with the first text feature includes:

And the determining unit is used for dividing the image to be processed according to the fusion image characteristics and the fusion video characteristics to obtain an image division result matched with the description text.

12. A computer device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the electronic device is running, said machine readable instructions when executed by said processor performing the steps of the image segmentation method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the image segmentation method according to any one of claims 1 to 10.