CN117523200A

CN117523200A - Image segmentation method and device for application interface, electronic equipment and storage medium

Info

Publication number: CN117523200A
Application number: CN202311500101.0A
Authority: CN
Inventors: 李士新; 邱雨; 帅朝春
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-02-06

Abstract

The application discloses an image segmentation method and device of an application interface, electronic equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: the method comprises the steps that electronic equipment obtains an operation interface screenshot of a target application, wherein the target application is an application associated with a target task; and extracting at least part of entity objects contained in the operation interface screenshot by using a pre-trained image segmentation model to obtain an image segmentation result, and executing the operation corresponding to the target task according to the image segmentation result, wherein the image segmentation model is obtained by performing image segmentation training by using a plurality of applied operation interface screenshot samples in advance. Therefore, the image segmentation model specially trained for the running interface of the application is utilized to segment the running interface screenshot of the target application, so that a more accurate image segmentation result can be obtained, and the execution effect of the operation corresponding to the target task executed based on the accurate image segmentation result is better and more accurate.

Description

Image segmentation method and device for application interface, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image segmentation method and apparatus for an application interface, an electronic device, and a storage medium.

Background

Image segmentation is a technique and process of dividing an image into several specific regions with unique properties and presenting objects of interest. It is a key step from image processing to image analysis. The existing image segmentation methods are mainly divided into the following categories: a threshold-based segmentation method, a region-based segmentation method, an edge-based segmentation method, a segmentation method based on a specific theory, and the like.

However, in the related art, the accuracy and effect of image segmentation on the running interface of the application are still poor, so that the effect of downstream application using the image segmentation result is affected.

Disclosure of Invention

The application provides an image segmentation method and device of an application interface, electronic equipment and a storage medium, so as to improve the accuracy of image segmentation of an application running interface.

In a first aspect, an embodiment of the present application provides an image segmentation method of an application interface, applied to an electronic device, where the method includes: acquiring an operation interface screenshot of a target application, wherein the target application is an application associated with a target task; and extracting at least part of entity objects contained in the operation interface screenshot by using a pre-trained image segmentation model to obtain an image segmentation result so as to execute the operation corresponding to the target task according to the image segmentation result, wherein the image segmentation model is obtained by performing image segmentation training by using a plurality of applied operation interface screenshot samples in advance.

In a second aspect, an embodiment of the present application provides an image segmentation apparatus for an application interface, which is applied to an electronic device, and the apparatus includes: the screenshot acquisition module is used for acquiring an operation interface screenshot of a target application, wherein the target application is an application associated with a target task; the image segmentation module is used for extracting at least part of entity objects contained in the operation interface screenshot by utilizing a pre-trained image segmentation model to obtain an image segmentation result so as to execute the operation corresponding to the target task according to the image segmentation result, and the image segmentation model is obtained by carrying out image segmentation training by utilizing a plurality of applied operation interface screenshot samples in advance.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, the program code being callable by a processor to perform the method described above.

In the scheme provided by the application, the electronic equipment acquires an operation interface screenshot of a target application, wherein the target application is an application associated with a target task; and extracting at least part of entity objects contained in the operation interface screenshot by using a pre-trained image segmentation model to obtain an image segmentation result, and executing the operation corresponding to the target task according to the image segmentation result, wherein the image segmentation model is obtained by performing image segmentation training by using a plurality of applied operation interface screenshot samples in advance. In this way, an image segmentation model specially trained for an application running interface is utilized to segment an operation interface screenshot of a target application, so that at least part of entity objects in the screenshot can be segmented more accurately, and a more accurate image segmentation result is obtained; further, the execution effect of the operation corresponding to the target task executed based on the image division result can be made better and more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart illustrating an image segmentation method of an application interface according to an embodiment of the present application.

Fig. 2 is a flow chart illustrating an image segmentation method of an application interface according to another embodiment of the present application.

Fig. 3 is a flow chart of a training method of an image segmentation model according to an embodiment of the present application.

Fig. 4 shows a schematic diagram of a model architecture of an image segmentation model according to an embodiment of the present application.

Fig. 5 is a schematic flow chart of verification and light-weight processing of an image segmentation model according to an embodiment of the present application.

Fig. 6 is a block diagram of an image segmentation apparatus for an application interface according to an embodiment of the present application.

Fig. 7 is a block diagram of an electronic device for performing an image segmentation method of an application interface according to an embodiment of the present application.

Fig. 8 is a storage unit for storing or carrying program code for implementing an image segmentation method of an application interface according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

It should be noted that, in some of the processes described in the specification, claims and drawings above, a plurality of operations appearing in a specific order are included, and the operations may be performed out of the order in which they appear herein or in parallel. The sequence numbers of operations such as S110, S120, etc. are merely used to distinguish between the different operations, and the sequence numbers themselves do not represent any execution order. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. And the terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or sub-modules is not necessarily limited to those steps or sub-modules that are expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or sub-modules that are not expressly listed.

In the related art, the image segmentation schemes are basically proprietary models, that is, the models have artificial intelligence models in the fields of medicine, automatic driving and the like, but the models have good performance only when segmenting images in the proprietary fields, and the models have poor performance in images in other fields. This is because the labeling of image segmentation generally requires a lot of cost and cannot contain too many classes (typically, image segmentation datasets have tens of classes, large scale, i.e., just hundreds, and thousands of imagenet are very difficult to compare) because of the need to label classes pixel by pixel. There is no proprietary model to solve this problem in the image segmentation scenario for the running interface of the application.

In order to solve the above problems, the inventor proposes an image segmentation method, an image segmentation device, an electronic device and a storage medium of an application interface. The image segmentation method of the application interface provided in the embodiment of the present application is described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of an image segmentation method of an application interface according to an embodiment of the present application, which is applied to an electronic device. The image segmentation method of the application interface provided in the embodiment of the present application will be described in detail below with reference to fig. 1. The image segmentation method of the application interface may include the steps of:

Step S110: and acquiring a running interface screenshot of a target application, wherein the target application is an application associated with a target task.

In this embodiment, the electronic device includes, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, a smart band, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert compression standard audio layer 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio layer 4) player, and the like, which have a data processing function and a display function. Target applications include, but are not limited to, navigation applications, taxi taking applications, payment applications, video applications, shopping applications, instant messaging applications, gaming applications, weather information, memo applications, news applications, antivirus applications, and browser applications. The operation interface screenshot can be understood as an image obtained by screenshot processing of the display interface in the foreground operation process of the target application.

Optionally, the target task may be a user behavior analysis task, an instant message pushing task, a simple screen recognition task, or the like, which is not limited in this embodiment.

In some embodiments, in the case of a user behavior analysis task or an instant message push task at the time of a target task, the target task may be preset to be associated with a plurality of preset applications; of course, the various preset applications associated with two different tasks may or may not be the same, which is not limited in this embodiment. Based on the above, if the electronic device detects that the application running in the foreground is any one of a plurality of preset applications, determining that the application running in the foreground is a target application; and capturing the operation interface of the target application every other target duration to obtain an operation interface screenshot. That is, the electronic device executes the operation corresponding to the target task and needs to use the screenshot of the operation interface of the target application, so that when any preset application is detected to be operated in the foreground, the operation interface of the preset application can be screenshot every other target duration, and the screenshot of the operation interface is obtained. Therefore, a plurality of operation interface screenshots can be provided for the user behavior analysis task, and real-time operation interface screenshots can also be provided for the instant information pushing task. The target duration may be a preset duration value, for example, 1 second or 2 seconds, which is not limited in this embodiment.

In other embodiments, the target task is a screen recognition task, the target application is a foreground running application, and a running interface of the foreground running application is intercepted as the running interface screenshot in response to a screen capturing operation input for the target task. In the method, a user can input screen capturing operation aiming at the electronic equipment when seeing any running interface of the application running in the foreground of interest according to own screen recognition requirements so as to trigger the electronic equipment to respond to the screen capturing operation, thereby capturing the running interface of the application running in the foreground as a running interface screenshot.

Step S120: and extracting at least part of entity objects contained in the operation interface screenshot by using a pre-trained image segmentation model to obtain an image segmentation result so as to execute the operation corresponding to the target task according to the image segmentation result, wherein the image segmentation model is obtained by performing image segmentation training by using a plurality of applied operation interface screenshot samples in advance.

In this embodiment, the image segmentation model is a model obtained by performing image segmentation training by using a plurality of application running interface screenshot samples in advance, that is, the image segmentation performance of the image segmentation model for the application running interface screenshot is trained in advance, that is, the image segmentation model has good image segmentation performance for the application running interface screenshot, and accurate image segmentation can be achieved.

It can be understood that, because the computing resources of the electronic device are limited, the training process of the image segmentation model may be performed on the server, after the image segmentation model converges on the server, the server compresses (i.e. lightens) the converged image segmentation model, and then deploys the compressed image segmentation model on the electronic device. Therefore, the model training processing requiring a large amount of computation resources is avoided from being directly carried out on the electronic equipment, the corresponding model training is carried out in a server with higher efficiency, the converged image segmentation model obtained by training is compressed and then deployed on the electronic equipment, and the problems of excessive storage space occupation, power consumption expenditure and the like of the electronic equipment caused by overlarge model are avoided.

Further, the electronic device may extract at least a portion of the physical objects included in the running interface screenshot by using a pre-trained image segmentation model deployed on the electronic device to obtain an image segmentation result. In other words, all entity objects contained in the running interface screenshot can be extracted, or only specified entity objects in the running interface screenshot can be extracted, and the specified entity objects can be extracted by the image segmentation model in a targeted manner according to segmentation prompt text input by a user. Wherein the entity objects include, but are not limited to: the present embodiment is not limited to the objects such as animals, fruits, automobiles, plants, rivers, lakes, seas, and household appliances.

In some embodiments, the target task is a user behavior analysis task, and after step S120, a plurality of historical image segmentation results may be obtained, where the plurality of historical image segmentation results are extracted according to an operation interface of the target application obtained by multiple screen shots within a target historical time; and carrying out behavior analysis processing on the user of the electronic equipment according to the plurality of historical image segmentation results to obtain behavior analysis results of the user. The target historical time length may be a preset time length value, for example, 7 days, 30 days, or 60 days, and it should be noted that the target historical time length is far greater than the preset time length, so that a sufficient number of historical image segmentation results can be obtained, and thus a more accurate behavior analysis result is obtained.

Optionally, the multiple operation interface screenshots obtained by the electronic device in the target history duration may be finally subjected to image segmentation by using the image segmentation model in a unified manner, so as to obtain multiple history image segmentation results. Therefore, the problems that the power consumption of the electronic equipment is overlarge, the standby time of the electronic equipment is reduced and the like because the electronic equipment uses an image segmentation model to segment the operation interface screenshot after the operation interface screenshot is obtained by screen capturing each time can be avoided. That is, the electronic device may store the running interface screenshots after intercepting the multiple running interface screenshots each time, and may timestamp each running interface screenshot according to the screenshot time, so as to perform more accurate behavior analysis processing according to the multiple running interface screenshots in the stored target history duration.

Optionally, the electronic device may also extract at least part of the physical objects from the running interface screenshot captured in real time by using the image segmentation model immediately after each screenshot reaches the running interface screenshot, so as to obtain an image segmentation result corresponding to the running interface screenshot. Storing the image segmentation results correspondingly, wherein each image segmentation result carries a time stamp for intercepting the operation interface screenshot, and deleting the operation interface screenshot; based on the above, when the electronic device performs the subsequent user behavior analysis task, a plurality of image segmentation results in the target history duration can be directly obtained, and the corresponding user behavior analysis task is performed based on the plurality of image segmentation results. Therefore, only the image segmentation result of each operation interface screenshot is stored, the pressure of the storage space of the electronic equipment is greatly reduced, and the problems of flash back and the like of operation card sections or foreground applications caused by insufficient memory of the electronic equipment due to excessive storage of operation interface screenshots are avoided.

In this manner, the user behavior analysis task may be further divided into finer subtasks, such as a user purchase behavior analysis task, a user travel behavior analysis task, a user song listening behavior analysis task, and the like, and obviously, different subtasks may be preset with different associated preset applications. For example, if the user behavior analysis task is performed, a plurality of different shopping applications may be preset as the plurality of preset applications, so that the user may learn that the user likes the entity object to be searched or purchased, such as a cricket cap, a leisure wear and sports pants, in combination with the running interface screenshot of the user in the plurality of shopping applications; therefore, the behavior analysis processing of the purchasing behavior analysis task of the user is realized, and an accurate behavior analysis result is obtained. Further, after completing the task of analyzing the behavior of the user, the user can be further subjected to accurate user portrayal analysis according to one or more behavior analysis results, so that accurate user portrayal results are obtained.

Alternatively, when the user behavior analysis task includes a plurality of different subtasks, corresponding user behavior analysis processing may be performed for each subtask at the same time. Of course, the corresponding user analysis processing can be sequentially performed on each subtask according to the sequence of the processing priority levels preset by the user for a plurality of different subtasks from high to low, so that the use pressure of the computing resources of the electronic device can be reduced, and the problems of interruption or failure of the user behavior analysis processing and the like caused by insufficient computing resources are avoided.

In still other embodiments, the target task is an instant message push task, and after step S120, a prompt message matching the image segmentation result is output. That is, the matched prompt information can be timely output for the real-time image segmentation result of the operation interface screenshot intercepted in real time.

Optionally, the prompt information may be an object category of the entity object represented by the image segmentation result, that is, output which categories of entity objects are included in the running interface screenshot, for example, drinks, bread, and apples.

Optionally, the prompt information may be output by the target application, for example, the image segmentation result of the running interface screenshot acquired for the target application includes the Beijing office, and correspondingly, the target application may further acquire ticket purchasing information and route information of the Beijing office as the prompt information; further, the prompt message may be output in a target output form, including, but not limited to, a pop-up form, a short message form, or a mail form.

Optionally, the prompt information may also be output by other applications of the same type as the target application, where for convenience of description, the target application is described as a first target application, and other applications of the same type as the target application are described as a second target application. Wherein the second target application and the first target application may be regarded as applications of the same application type developed by different companies, i.e. belonging to competing applications. The electronic device, after identifying that the running interface screenshot of the first target application contains cat food of a certain brand by using the image segmentation model; the second target application can further acquire coupon information and/or minimum price information of cat foods of the same brand as prompt information; and outputting the prompt information in the form of output of the popup window. Therefore, if the user finds that cat food of the brand in the second target application is cheaper through the prompt information displayed by the popup window, the user can switch to the second target application to purchase, namely, the drainage effect on the user is achieved to a certain extent.

In the embodiment, an image segmentation model specially trained for an application running interface is utilized to segment an operation interface screenshot of a target application, so that at least part of entity objects in the screenshot can be segmented more accurately, and a more accurate image segmentation result is obtained; moreover, the execution effect of the operation corresponding to the target task executed based on the image segmentation result can be better and more accurate, for example, the behavior analysis task of the user based on the image segmentation result is more comprehensive and accurate; for another example, the more matched prompt information can be timely output based on the image segmentation result.

Referring to fig. 2, fig. 2 is a flowchart of an image segmentation method of an application interface according to another embodiment of the present application, which is applied to an electronic device. The image segmentation method of the application interface provided in the embodiment of the present application will be described in detail below with reference to fig. 2. The image segmentation method of the application interface may include the steps of:

step S210: and acquiring a running interface screenshot of a target application, wherein the target application is an application associated with a target task.

In this embodiment, the specific implementation of step S210 may refer to the content in the foregoing embodiment, and will not be described herein.

Step S220: and inputting the operation interface screenshot into an image feature extraction module in the image segmentation model to extract features so as to obtain target image features.

The image feature extraction module may be obtained by training based on the vision converter (Vision Transformer, VIT) in advance, or may be obtained by training based on other network structures that can be used for image feature extraction, which is not limited in this embodiment.

Because the image feature extraction module performs image segmentation amount iterative training along with the image segmentation model, after the screen capture obtains the running interface screenshot of the target application, the image feature extraction module can be utilized to extract the image feature of the running interface screenshot to obtain the target image feature of the running interface screenshot.

Step S230: if the target task contains target prompt text information, acquiring preset prompt text characteristics matched with the target prompt text information from a plurality of preset prompt text characteristics, and taking the preset prompt text characteristics as target text prompt characteristics.

It can be understood that the target task may be only required to extract a certain class or designate several classes of entity objects in the running interface screenshot, and based on this, the target task may include target prompt text information, where the target prompt text information may be understood as semantic description text of the entity objects that need to be segmented in the running interface screenshot. For example, if the target task is to search for apples from the running interface screenshot, the target prompt text message may be a sentence of "search for apples in an image" or "extract apples in an image", or may be a word of "apples", which is not limited in this embodiment.

The plurality of preset prompt text features can be obtained by extracting text features of the plurality of preset prompt texts by a server through a pre-trained text encoder, and each preset prompt text corresponds to at least one entity object. That is, the preset prompt text may include descriptive semantic text for one or more entity objects. Further, the server may send the pre-extracted plurality of preset prompt texts and the plurality of preset prompt text features to the electronic device for storage. Therefore, the electronic equipment can search the preset prompt text characteristics corresponding to the matched preset prompt text according to the target prompt text information, and the preset prompt text characteristics are used as the target text prompt characteristics. Therefore, the text encoder which is trained in advance on the server side is not required to be compressed and then arranged on the electronic equipment side, the storage space of the electronic equipment is further saved, meanwhile, the text features are not required to be extracted in real time, the matched text features are inquired out from the features extracted in advance and serve as target text prompt features, the feature acquisition speed is greatly improved, and therefore the image segmentation efficiency can be improved.

In other embodiments, to avoid that text features of the target prompt text information included in the target task may not be extracted in advance, the image segmentation model may further include a text feature extraction module, that is, the electronic device may also be deployed with the text feature extraction module. The text feature extraction module may be obtained by performing compression processing on the text encoder trained in advance on the server side. Based on the above, if the preset prompt text features matched with the target prompt text information do not exist in the preset prompt text features, the target prompt text information can be input into the text feature extraction module to perform feature extraction, and the target text prompt features are obtained. Therefore, when the characteristics which are not matched with the target prompt text information in the plurality of preset prompt text characteristics are not present, the text characteristic extraction module can be timely called to extract the text characteristics of the target prompt text information; the problems that the subsequent image segmentation cannot be smoothly performed and the like due to the fact that the matched text features are queried are avoided.

In still other embodiments, the image segmentation model may further include a text feature extraction module, that is, the electronic device may also be deployed with the text feature extraction module, and the electronic device may also have the aforementioned plurality of preset alert text features stored therein in advance. Based on the method, preset prompt text features matched with the target prompt text information can be obtained from a plurality of preset prompt text features, and a first text feature is obtained; simultaneously, text feature extraction is carried out on the target prompt text information by using the text feature extraction module, so as to obtain second text features; further, feature fusion is carried out on the first text feature and the second text feature, and the target text prompt feature is obtained. Thus, the accuracy of text feature extraction of the target prompt text information can be improved, and the follow-up more accurate image segmentation processing is facilitated.

Step S240: and inputting the target image features and the target text prompt features into a mask prediction module in an image segmentation model to perform mask prediction to obtain a position mask corresponding to the target prompt text information, wherein the position mask is used as a target position mask.

The mask prediction module may be pre-trained based on a transducer decoder and some full-connection layer network structures. Based on the above, the mask prediction module can perform attention cross computation on the target image features and the target text prompt features, then perform mask probability computation on the attention cross computation result by using the full connection layer, and finally perform aggregation and generation based on the computed mask probability to obtain a position mask corresponding to the final target prompt text information as a target position mask. Specifically, the target image features are image feature sequences, and each sub-image feature in the image feature sequences corresponds to an image of a partial region in the running interface screenshot; the mask prediction module obtains feature similarity between each sub-image feature and the target text prompt feature to obtain similarity corresponding to each sub-image feature; further, the similarity corresponding to each sub-image feature is normalized by using the full connection layer, so as to obtain a mask probability corresponding to each sub-image feature, and for the pixel points with the mask probability larger than the preset probability threshold, determining that the position mask value of the pixel point is 1, and if the mask probability is smaller than or equal to the pixel point with the preset probability threshold, determining that the position mask value of the pixel point is 0.

In popular terms, a simple classification task is performed on each pixel in the running interface screenshot, and mask values of 0 or 1 are generated by performing label prediction on all pixels. For example, the mask value of the pixel point in the image area corresponding to the target prompt text information in the running interface screenshot is 1, and the mask value of the pixel point in the other areas outside the image area is 0.

Step S250: and if the target task does not contain the target prompt text information, inputting the target image characteristics into a mask prediction module in the image segmentation model to perform mask prediction to obtain a target position mask.

Optionally, the target task may not include target prompt text information, and since the mask prediction module in the image segmentation model has already learned enough image information in the application interface screenshot and semantic information guaranteed by the prompt text. Therefore, even if the target prompt text information is not available, the mask prediction module can directly perform image segmentation on all the entity objects in the running interface screenshot, namely, although the mask prediction module does not know which entity objects are specifically included, the mask prediction module can determine the target position mask except possibly having the semantic information of the entity objects according to the pre-trained image segmentation capability. That is, at this time, the mask prediction module outputs the target location masks corresponding to all entities in the running interface screenshot. For example, if the running interface screenshot includes apples and bananas, but at this time, the target prompt text information is not prompted to extract and segment the apples or to extract and segment the bananas, at this time, the mask prediction module outputs a target position mask for extracting the apples and a target position mask for extracting the bananas.

Step S260: and extracting an image area corresponding to the target position mask from the operation interface screenshot to serve as the image segmentation result, so as to execute the operation corresponding to the target task according to the image segmentation result.

The target location mask is used to select which pixels allow copying and which pixels do not allow copying, if the mask value of the pixel is 1, the pixel is allowed to be copied, and if the mask value of the pixel is 0, the pixel is not allowed to be copied. Based on this, after the target position mask is acquired, the operation interface screenshot and the target position mask are subjected to an and (& gt) operation, so that only the image region with the mask value of 1 is extracted as an image segmentation result. For example, the operation interface screenshot simultaneously contains bananas and apples, but the target prompt text information is used for indicating banana extraction, at this time, the mask value corresponding to the pixel points of the banana area in the calculated target position mask is 1, and the mask value corresponding to the pixel points of other areas is 0; i.e. only the pixel values of the pixels of the banana area are allowed to be copied, whereby the banana area image is taken as the above image segmentation result.

Alternatively, if the image segmentation result is not the image segmentation result of the specified entity object extracted based on the target prompt text information, the image segmentation result includes the image segmentation result of a plurality of unknown entity objects. The electronic device may further perform similarity matching on the image feature corresponding to each entity object in the image segmentation result and a plurality of preset prompt text features, and find out the entity object category corresponding to the preset prompt text feature with similarity greater than the preset similarity threshold, where the entity object category is used as the entity object category in the image segmentation result. In other words, when the image segmentation result does not include the corresponding semantic prompt, the electronic device may query the entity objects in the image segmentation result, specifically, which objects, by using the stored plurality of preset prompt text features.

In this embodiment, firstly, the image feature of the operation interface screenshot can be extracted more accurately by using the pre-trained image feature extraction module, and secondly, the position masks of all the entity objects can be generated more accurately by using the pre-trained mask prediction module and the accurate image feature, so as to realize accurate segmentation of all the entity objects in the operation interface screenshot. Under the condition that the target task contains target prompt text information, a plurality of preset prompt text features which are pre-generated from a server and sent to the electronic equipment are used for quickly inquiring the target prompt text features matched with the target prompt text information; and accurately predicting the position mask of the entity object matched with the preset prompt text information in the operation interface screenshot by utilizing the pre-trained image feature extraction module, the pre-trained mask prediction module and the target prompt text feature, thereby realizing the segmentation of the appointed entity object more pertinently. Obviously, the embodiment can accurately meet the requirements of multiple image segmentation, namely the segmentation requirements of all entity objects and the segmentation requirements of the entity objects pointed out by the target prompt text information, so that the execution effect of the operation corresponding to the target task executed based on the accurate image segmentation result is better and more accurate, for example, the execution effect of the downstream tasks such as user analysis behaviors and instant message prompts is better and more accurate.

Referring to fig. 3, fig. 3 is a flowchart of a training method of an image segmentation model according to an embodiment of the present application, which is applied to a server. The training of the image segmentation model provided in the embodiment of the present application will be described in detail with reference to fig. 3. The training method of the image segmentation model can comprise the following steps:

step S310: the method comprises the steps of obtaining a target sample set, wherein the target sample set comprises a plurality of operation interface screenshot samples, each operation interface screenshot sample carries a prompt text and a preset position mask, and the operation interface screenshot samples are obtained by screenshot of a plurality of applications in a foreground operation process.

In this embodiment, the cloud server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an image processing platform.

Because the image encoder and the text encoder in the embodiment are both pre-trained large models, for example, the image encoder may be a pre-trained VIT model, and the text encoder may be a pre-trained bert model, it is not necessary to obtain millions of running interface shots to construct a target sample set; the operation interface screenshot of more 100 applications is only needed, and the screenshot number is about 3 kilosheets, so that the service effect requirement can be met. Therefore, the labeling workload of manual operation on the screenshot sample of the operation interface is greatly reduced.

In some embodiments, every other target duration, the running interfaces of the applications in the foreground running process may be screen-captured, so as to obtain multiple running interface screenshots, which are used as multiple running interface screenshot samples. Further, the user checks each operation interface screenshot sample, manually selects some interested entity objects in each operation interface screenshot sample, and adds description text for the entity objects, namely the prompt text, for the operation interface screenshot sample; meanwhile, each pixel point corresponding to the entity object needs to be marked in the operation interface screenshot sample, in other words, an image area of the entity object in the operation interface screenshot sample is manually circled by a user, and then a preset position mask corresponding to the entity object can be generated based on the image area. The prompt text may be a sentence, for example, "this is a shopping page including some drinks, etc." or may be a word "drinks".

In other embodiments, the operation interfaces of the applications in the foreground operation process may be captured every a target duration, so as to obtain a plurality of operation interface captures. Considering that in practical application, the running interface screenshot is generally of an image size of 2412×1080 pixels (px), in order to reduce the workload of manual labeling and improve the training efficiency of the model, the image sizes of the running interface shots can be reduced according to a preset reduction ratio, so as to obtain a plurality of reduced running interface shots, which are used as the running interface screenshot samples. For example, the running interface screenshot is reduced from the image size of 2412×1080 pixels to the image size of 336×336px or 224×224px capable of containing enough image information, so that the total number of pixels is reduced from 2604960 to 112896 (336×336 px) or 50176 (224×224 px), the labeling workload is greatly reduced, and meanwhile, in the subsequent model training process, the calculation amount of the mask probability corresponding to each pixel point is reduced, and the model training efficiency is improved.

Step S320: and extracting the image characteristics of each operation interface screenshot sample through a pre-trained image encoder to obtain sample image characteristics corresponding to each operation interface screenshot sample.

In this embodiment, the VIT may be selected as the pre-trained image encoder, and since the VIT has an advantage that the semantic information of the image can be understood to some extent compared with the conventional convolutional neural network (Convolutional Neural Network, CNN) and cyclic neural network structure (Recurrent Neural Network, RNN) structures, it is convenient for some subsequent practical use.

In other embodiments, an image encoder identified in advance for an application function of a running interface screenshot of an application may be directly multiplexed as the above-described pre-trained image encoder. Therefore, the pre-trained image encoder has a part of image feature extraction capability for the running interface screenshot of the application, and the image encoder can extract the image features required by the application more accurately only by training a small amount of running interface screenshot samples.

Step S330: and extracting text characteristics of prompt texts carried by each operation interface screenshot sample through a pre-trained text encoder to obtain sample text characteristics corresponding to each operation interface screenshot sample.

In this embodiment, the pre-trained text encoder may be a pre-trained bert model, and because both the bert model and the VIT model are based on a transformer network structure, the model is more convenient to learn semantics, and alignment of image features and text features can be ensured to a certain extent. Therefore, the text characteristics of the prompt text carried by each operation interface screenshot sample can be better extracted.

In other embodiments, a text encoder that is pre-identified for application functions of the running interface screenshot of the application may be directly multiplexed as the pre-trained text encoder described above. Therefore, the text encoder is capable of extracting text features required by the application more accurately by training a small amount of running interface screenshot samples because the text encoder is already provided with a part of text feature extraction capability for text information.

In this embodiment, the training process of the text encoder and the image encoder for the application function identification of the running interface screenshot of the application may be: the method comprises the steps of obtaining a specified sample set, wherein the specified sample set comprises a plurality of screenshot samples, each screenshot sample carries a function description text, and the screenshot samples are obtained by screenshot of a plurality of preset applications in the running process of a foreground application. Further, extracting the image characteristics of each screenshot sample through a pre-trained image characteristic extraction network to obtain sample image characteristics corresponding to each screenshot sample; extracting text features of the functional description text carried by each screenshot sample through a pre-trained text feature extraction network to obtain sample text features corresponding to each screenshot sample; determining a target loss value according to the degree of difference between the sample image characteristics and the sample text characteristics corresponding to each screenshot sample; and according to the target loss value, carrying out iterative updating on network parameters of the image feature extraction network and the text feature extraction network until specified training conditions are met, obtaining an updated image feature extraction network as a multiplexed image encoder, and obtaining an updated text feature extraction module as a multiplexed text encoder.

Wherein, the specified training conditions may be: the target loss value is smaller than a preset value, the target loss value is not changed any more, or the training times reach the preset times, etc. It can be understood that after performing iterative training for a plurality of training periods on the pre-trained image feature extraction network and the pre-trained text feature extraction network according to the target sample set, where each training period includes a plurality of iterative training, parameters in the pre-trained image feature extraction network and the pre-trained text feature extraction network are continuously optimized, so that the target loss value is smaller and smaller, and finally becomes a fixed value or smaller than the preset value, where the target loss value indicates that the image feature extraction network and the text feature extraction network have converged; of course, after the training times reach the preset times, it may be determined that the image feature extraction network and the text feature extraction network have converged, and at this time, the updated image feature extraction network may be used as an image encoder, and the updated text feature extraction module may be used as a text encoder. The preset value and the preset times are preset, and the numerical value of the preset value and the preset times can be adjusted according to different application scenes, which is not limited in this embodiment.

Optionally, the operation interface screenshot in the target sample set mentioned in the application can directly multiplex the screenshot samples in the specified sample set, because both are operation interface screenshots for the application of the electronic device in the operation process, and the quality of the batch of data sets is higher, the requirement can be met, so that the collection is not necessary to be repeated at all in order to save the workload, and the time cost of sample set construction is greatly reduced. Based on the method, the preset position mask and the manual annotation of the prompt text can be directly carried out for each screenshot sample in the designated sample set.

Step S340: and inputting the sample image features and the sample text features corresponding to each operation interface screenshot sample to an initial mask predictor for mask prediction to obtain a prediction position mask corresponding to each operation interface screenshot sample.

In this embodiment, the initial mask predictor may be some simpler network structure, such as a decoder structure including only a transducer and a simple full-connection layer structure. In this way, the size of the initial mask predictor itself is reduced, facilitating model deployment on the electronic device side.

It can be appreciated that, by using the initial mask predictor, and according to the sample image features and the sample text features, the specific implementation manner of performing the mask prediction is similar to the content principle in the foregoing embodiment, and will not be repeated again.

Step S350: and carrying out iterative updating on the pre-trained image encoder, the pre-trained text encoder and the initial mask predictor according to the difference degree between the predicted position mask corresponding to each operation interface screenshot sample and the preset position mask carried by each operation interface screenshot sample until a target training condition is met, so as to obtain an updated pre-trained image encoder as the target image encoder and an updated initial mask predictor as the target mask predictor.

Further, according to the difference degree between the predicted position mask corresponding to each operation interface screenshot sample and the preset position mask carried by each operation interface screenshot sample, namely, the predicted mask value of each pixel point in each operation interface screenshot sample and the preset mask value of each pixel point in each operation interface screenshot sample, calculating a difference degree value to obtain a difference degree value corresponding to each pixel point; and determining a mask prediction loss value based on the difference degree values of all the pixel points. As shown in fig. 4, the model parameters of the pre-trained image encoder, the pre-trained text encoder, and the initial mask predictor are iteratively updated according to the mask prediction loss value until the target training condition is satisfied, thereby obtaining an updated pre-trained image encoder as the target image encoder shown in fig. 5, obtaining an updated initial mask predictor as the target mask predictor shown in fig. 5, and obtaining an updated pre-trained text encoder as the target text encoder shown in fig. 5.

Aiming at the condition that the number of pixels in the screenshot sample of the running interface occupied by the entity image to be segmented is smaller than the number of preset pixels, the weight of the calculated loss value can be adjusted, namely, the correct prediction rewards are improved when the mask prediction loss value is calculated, and the false prediction punishment is reduced; in other words, in the case where the prediction is correct, the size of the weight is adjusted so that the smaller the calculated mask prediction loss value is. Otherwise, the correct prediction rewards are reduced, and the wrong prediction penalties are increased.

Alternatively, the target training conditions may be: the mask predictive loss value is less than a preset value, the mask predictive loss value is not changed, or the training times reach a preset number of times, etc. It can be appreciated that after performing iterative training for a plurality of training periods according to the target sample set, where each training period includes a plurality of iterative training, parameters in the pre-trained image encoder, the pre-trained text encoder, and the initial mask predictor are continuously optimized, so that the mask prediction loss value is smaller and smaller, and finally becomes a fixed value or smaller than the preset value, where the parameters indicate that the pre-trained image encoder, the pre-trained text encoder, and the initial mask predictor have converged; it is also possible to determine that the pre-trained image encoder, the pre-trained text encoder and the initial mask predictor have converged after the training times reach the preset times, and at this time, the pre-trained image encoder may be used as the target image encoder and the updated initial mask predictor may be used as the target mask predictor. The preset value and the preset times are preset, and the numerical value of the preset value and the preset times can be adjusted according to different application scenes, which is not limited in this embodiment.

In this embodiment, after the model training shown in fig. 4 is completed through steps S310 to S350, the target image encoder, the target text encoder and the target predictor can be obtained. Further, the target image encoder, the target text encoder and the target predictor which are obtained through training can be firstly deployed in a server, a test sample set is obtained, and the accuracy of mask prediction capability of the target image encoder, the target text encoder and the target mask predictor for the test screenshot samples in the test sample set is verified; if the verification accuracy is greater than the preset accuracy threshold, the target image encoder may be subjected to model compression processing to obtain a lighter image feature extraction module for deployment in the electronic device shown in fig. 5, and the target mask predictor may be also subjected to model compression processing to obtain a lighter mask prediction module for deployment in the electronic device shown in fig. 5. Thus, the image feature extraction module and the mask prediction module in fig. 5 may then constitute the image segmentation model mentioned in the foregoing embodiments. In addition, a target encoder in the server can be utilized in advance to extract text features for a plurality of preset prompt texts, so that a plurality of preset prompt text features are obtained, the plurality of preset prompt text features are in one-to-one correspondence with the plurality of preset prompt texts, and each preset prompt text corresponds to at least one entity object. The specific implementation of performing mask prediction and image segmentation on the running interface screenshot by the electronic device side based on the deployed image feature extraction module, the mask prediction module and the plurality of preset prompt text features shown in fig. 5 may refer to the content in the foregoing embodiments, which is not described herein again.

Specifically, the mask prediction capabilities of the target image encoder, the target text encoder and the target mask predictor are tested and verified by using test screenshot samples in the test sample set. The test screenshot samples are also running interface screenshots of the application, and similarly, part of the test screenshot samples can be marked with prompt texts, and part of the test screenshot samples can be unmarked with prompt texts. In this way, for the partial test screenshot sample marked with the prompt text, the prediction accuracy of the position mask can be determined according to whether the predicted position mask corresponding to the prompt text output by the target mask predictor is the image area corresponding to the prompt text; and if part of the test screenshot samples which are not marked and prompt the text are not marked, determining the prediction accuracy of all the position masks according to whether all the position masks output by the target mask predictor are matched with the image areas where all the entity objects in the test screenshot samples are positioned. And under the condition that the verification accuracy is larger than a preset accuracy threshold, determining that the test verification is passed, so that the target image encoder, the target text encoder and the target mask predictor which pass the verification can be deployed in a server and/or electronic equipment for use. Otherwise, it is determined that the test verification is not passed, and a further round of iterative training is required for the target image encoder, the target text encoder and the target mask predictor, until the final test verification is passed, the test verification cannot be deployed to the server and/or the electronic device for use.

In some implementations, the test-verified target image encoder, target text encoder, and target mask predictor may also be utilized to mask predict for running interface screenshot samples; the automatic labeling of the operation interface screenshot sample is realized by utilizing the target mask predictor, so that only manual fine adjustment and supplementation of the operation interface screenshot sample with the automatic labeling position mask are needed, the manual labeling time is shortened, the labor cost is reduced, and the labeling difficulty is greatly reduced.

In other embodiments, as the computing power of the electronic device increases, the more sophisticated the deployment optimization support for the model is; in addition to the lighter image feature extraction module and the lighter mask prediction module shown in fig. 5 being disposed in the electronic device, the target text encoder in fig. 5 may also be subjected to model compression processing at the same time, and the compressed target text encoder may be disposed in the electronic device (not shown in fig. 5). Likewise, the specific implementation manner of performing mask prediction and image segmentation on the running interface screenshot by the electronic device using the image feature extraction module, the text feature extraction module and the mask prediction module may refer to the content in the foregoing embodiment, which is not described herein again.

In some embodiments, an image encoder for application function recognition of running an interface screenshot is already deployed in the electronic device, and in order to reduce memory overhead in the electronic device, the deployed image encoder can be directly multiplexed as an image feature extraction module to be used in the application. In this manner, the target mask predictor is deployed by only performing the lightweight process, which is equivalent to the lightweight deployment stage, and the workload of the lightweight process is greatly reduced.

In some embodiments, after the model is lightweight and deployed on the electronic device, the electronic device may generate a new training sample set according to the new application interface screenshot, and fine-tune and update model parameters of the lightweight image segmentation model in the electronic device using the new training sample set. And the model parameters obtained by fine adjustment updating are fed back to the server side, federal learning is carried out with the server side, and synchronous updating of the image segmentation models of the electronic equipment and the server side is realized. Therefore, the operation interface screenshot of the electronic equipment end is not required to be sent to the server side, the privacy safety of a user of the electronic equipment end is protected, the problem that the server side needs to collect and label data again when performing model updating is avoided, and model optimization of the server side is better facilitated.

In the embodiment, through the training scheme of carrying out high-efficiency fine adjustment by dividing the large model through the image and combining a small amount of sample data, the model after training is selectively and light-weighted, and the image feature extraction module and the mask prediction module after light-weighted are deployed in the electronic equipment, under the condition of ensuring the mask prediction precision, the problems of occupying excessive storage space of the electronic equipment and the like due to the deployment of the large model are greatly reduced, namely the overhead of the additional storage space and the power consumption overhead of the electronic equipment are saved, and meanwhile, the accuracy of image division on the electronic equipment side is also ensured.

Referring to fig. 8, a block diagram of an image segmentation apparatus 400 of an application interface according to an embodiment of the present application is shown and applied to an electronic device. The apparatus 400 may include: a screenshot capture module 410 and an image segmentation module 420.

The screenshot obtaining module 410 is configured to obtain a running interface screenshot of a target application, where the target application is an application associated with a target task.

The image segmentation module 420 is configured to extract at least a part of the physical objects included in the running interface screenshot by using a pre-trained image segmentation model, so as to obtain an image segmentation result, so as to execute an operation corresponding to the target task according to the image segmentation result, where the image segmentation model is obtained by performing image segmentation training by using a plurality of applied running interface screenshot samples in advance.

In some embodiments, the target task is associated with a plurality of preset applications, and the screenshot obtaining module 410 may be specifically configured to determine that the application running in the foreground is the target application if it is detected that the application running in the foreground is any one of the plurality of preset applications; and capturing a running interface of the target application every other target duration to obtain a screenshot of the running interface.

In this manner, the target task is a user behavior analysis task, and the image segmentation apparatus 400 of the application interface may further include: and a behavior analysis module. The behavior analysis module may be configured to extract at least a part of the physical objects included in the running interface screenshot after the pre-trained image segmentation model is utilized to obtain an image segmentation result, and obtain a plurality of historical image segmentation results, where the plurality of historical image segmentation results are extracted according to a running interface of the target application obtained by multiple screen shots within a target historical duration; and performing behavior analysis processing on the user of the electronic equipment according to the plurality of historical image segmentation results to obtain behavior analysis results of the user.

In this manner, the target task is an instant message pushing task, and the image segmentation apparatus 400 of the application interface may further include: and a prompt module. The prompting module may be specifically configured to extract at least a part of the entity objects included in the running interface screenshot after the image segmentation result is obtained by using the pre-trained image segmentation model, and output prompting information matched with the image segmentation result.

In other embodiments, the target application is a foreground running application, and the screenshot obtaining module 410 may be configured to intercept a running interface of the foreground running application as the running interface screenshot in response to a screenshot operation input for the target task.

In some embodiments, the image segmentation model includes an image feature extraction module and a mask prediction module, and the image segmentation module 420 may include a first feature extraction unit, a mask prediction unit, and an image segmentation unit. The first feature extraction unit can be used for inputting the operation interface screenshot into the image feature extraction module to perform feature extraction so as to obtain target image features; the mask prediction unit may be configured to input the target image feature to the mask prediction module to perform mask prediction, so as to obtain a target location mask. The image segmentation unit may be configured to extract, from the running interface screenshot, an image area corresponding to the target location mask as the image segmentation result.

In this manner, the image feature extraction module is obtained by compressing the target image encoder, and the mask prediction module is obtained by compressing the target mask predictor. The image segmentation apparatus 400 of the application interface may further include: and a model training module. Wherein the model training module may be for: acquiring a target sample set, wherein the target sample set comprises a plurality of operation interface screenshot samples, each operation interface screenshot sample carries a prompt text and a preset position mask, and the plurality of operation interface screenshot samples are obtained by screenshot of a plurality of applications in the operation process of a foreground application; extracting image characteristics of each operation interface screenshot sample through a pre-trained image encoder to obtain sample image characteristics corresponding to each operation interface screenshot sample; extracting text characteristics of prompt texts carried by each operation interface screenshot sample through a pre-trained text encoder to obtain sample text characteristics corresponding to each operation interface screenshot sample; inputting sample image features and sample text features corresponding to each operation interface screenshot sample to an initial mask predictor for mask prediction to obtain a prediction position mask corresponding to each operation interface screenshot sample; and carrying out iterative updating on the pre-trained image encoder, the pre-trained text encoder and the initial mask predictor according to the difference degree between the predicted position mask corresponding to each operation interface screenshot sample and the preset position mask carried by each operation interface screenshot sample until a target training condition is met, so as to obtain an updated pre-trained image encoder as the target image encoder and an updated initial mask predictor as the target mask predictor.

In some embodiments, the target task includes target prompt text information, and the image segmentation apparatus 400 of the application interface may further include: and a text feature acquisition module. The text feature obtaining module may be configured to obtain, before the target image feature is input to the mask prediction module to perform mask prediction, a preset prompt text feature matching with the target prompt text information from a plurality of preset prompt text features, as a target text prompt feature. The mask prediction unit may be configured to input the target image feature and the target text prompt feature to the mask prediction module to perform mask prediction, so as to obtain a position mask corresponding to the target prompt text information, and use the position mask as the target position mask.

In this manner, the image segmentation model further includes a text feature extraction module, and the text feature acquisition module may be further configured to input the target prompt text information into the text feature extraction module to perform feature extraction if a preset prompt text feature matched with the target prompt text information does not exist in the plurality of preset prompt text features, so as to obtain the target text prompt feature.

In some embodiments, the image segmentation apparatus 400 of the application interface may further include: the text feature receiving module and the text feature storing module. The text feature receiving module may be configured to obtain, from the plurality of preset prompt text features, a preset prompt text feature that matches the target prompt text information, and before the preset prompt text feature is used as the target text prompt feature, receive a plurality of preset prompt text features sent by a server, where the plurality of preset prompt text features are obtained by extracting text features from a plurality of preset prompt texts by using a pre-trained text encoder, where each preset prompt text corresponds to at least one entity object. The text feature storage module may be configured to store the plurality of preset alert text features.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, in this embodiment, an image segmentation model specifically trained for an operation interface of an application is used to segment an operation interface screenshot of a target application, so that at least part of entity objects in the screenshot can be segmented more accurately, and a more accurate image segmentation result is obtained; moreover, the execution effect of the operation corresponding to the target task executed based on the image segmentation result can be better and more accurate, for example, the behavior analysis task of the user based on the image segmentation result is more comprehensive and accurate; for another example, the more matched prompt information can be timely output based on the image segmentation result. In addition, through the training scheme of carrying out high-efficiency fine adjustment by dividing the large model and combining a small amount of sample data, the model after training is selectively and light-weighted, and the image feature extraction module and the mask prediction module after light-weighted are deployed in the electronic equipment, under the condition of ensuring the mask prediction precision, the problems of occupying excessive storage space of the electronic equipment and the like due to the deployment of the large model are greatly reduced, namely the cost of extra storage space and power consumption cost of the electronic equipment are saved, and meanwhile, the accuracy of image division at the side of the electronic equipment is also ensured

An electronic device provided in the present application will be described with reference to fig. 7.

Referring to fig. 7, fig. 7 shows a block diagram of an electronic device 500 according to an embodiment of the present application, where the method according to the embodiment of the present application may be performed by the electronic device 500. The electronic device may be an electronic terminal with data processing capabilities including, but not limited to, a smart phone, tablet, notebook, desktop, smartwatch, e-book reader, MP3 (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio layer 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio layer 4) player, etc.

The electronic device 500 in embodiments of the present application may include one or more of the following components: a processor 501, a memory 502, and one or more application programs, wherein the one or more application programs may be stored in the memory 502 and configured to be executed by the one or more processors 501, the one or more program(s) configured to perform the method as described in the foregoing method embodiments.

The processor 501 may include one or more processing cores. The processor 501 utilizes various interfaces and lines to connect various portions of the overall electronic device 500, perform various functions of the electronic device 500, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 502, and invoking data stored in the memory 502. Alternatively, the processor 501 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 501 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may also be integrated into the processor 501 and implemented solely by a communication chip.

The Memory 502 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 502 may be used to store instructions, programs, code sets, or instruction sets. The memory 502 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the electronic device 500 in use (such as the various correspondences described above), and so forth.

In the several embodiments provided herein, the illustrated or discussed coupling or direct coupling or communication connection of the modules to each other may be through some interfaces, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other forms.

Referring to fig. 8, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 600 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 600 comprises a non-transitory computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 600 has storage space for program code 610 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 610 may be compressed, for example, in a suitable form.

In some embodiments, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium and executes the computer instructions to cause the electronic device to perform the steps of the method embodiments described above.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An image segmentation method of an application interface, applied to an electronic device, the method comprising:

acquiring an operation interface screenshot of a target application, wherein the target application is an application associated with a target task;

And extracting at least part of entity objects contained in the operation interface screenshot by using a pre-trained image segmentation model to obtain an image segmentation result so as to execute the operation corresponding to the target task according to the image segmentation result, wherein the image segmentation model is obtained by performing image segmentation training by using a plurality of applied operation interface screenshot samples in advance.

2. The method of claim 1, wherein the target task is associated with a plurality of preset applications, and the obtaining a running interface screenshot of the target application comprises:

if the application running in the foreground is detected to be any preset application in the plurality of preset applications, determining that the application running in the foreground is the target application;

and capturing a running interface of the target application every other target duration to obtain a screenshot of the running interface.

3. The method of claim 2, wherein the target task is a user behavior analysis task;

after extracting at least part of entity objects contained in the operation interface screenshot by using the pre-trained image segmentation model to obtain an image segmentation result, the method further comprises:

Obtaining a plurality of historical image segmentation results, wherein the plurality of historical image segmentation results are extracted according to an operation interface of the target application, which is obtained by screen capturing for a plurality of times in a target historical time;

and performing behavior analysis processing on the user of the electronic equipment according to the plurality of historical image segmentation results to obtain behavior analysis results of the user.

4. The method of claim 2, wherein the target task is an instant messaging push task;

and outputting prompt information matched with the image segmentation result.

5. The method of claim 1, wherein the target application is a foreground running application, and the obtaining a running interface screenshot of the target application comprises:

and intercepting an operation interface of the application operated in the foreground as the operation interface screenshot in response to the screen capturing operation input for the target task.

6. The method according to any one of claims 1-5, wherein the image segmentation model includes an image feature extraction module and a mask prediction module, and the extracting at least part of the physical objects included in the running interface screenshot by using the pre-trained image segmentation model to obtain an image segmentation result includes:

Inputting the operation interface screenshot into the image feature extraction module to perform feature extraction to obtain target image features;

inputting the target image characteristics to the mask prediction module for mask prediction to obtain a target position mask;

and extracting an image area corresponding to the target position mask from the operation interface screenshot to be used as the image segmentation result.

7. The method according to claim 6, wherein the image feature extraction module is obtained by compressing a target image encoder, and the mask prediction module is obtained by compressing a target mask predictor;

the training process of the target image encoder and the target mask predictor comprises the following steps:

acquiring a target sample set, wherein the target sample set comprises a plurality of operation interface screenshot samples, each operation interface screenshot sample carries a prompt text and a preset position mask, and the plurality of operation interface screenshot samples are obtained by screenshot of a plurality of applications in the operation process of a foreground application;

extracting image characteristics of each operation interface screenshot sample through a pre-trained image encoder to obtain sample image characteristics corresponding to each operation interface screenshot sample;

Extracting text characteristics of prompt texts carried by each operation interface screenshot sample through a pre-trained text encoder to obtain sample text characteristics corresponding to each operation interface screenshot sample;

inputting sample image features and sample text features corresponding to each operation interface screenshot sample to an initial mask predictor for mask prediction to obtain a prediction position mask corresponding to each operation interface screenshot sample;

and carrying out iterative updating on the pre-trained image encoder, the pre-trained text encoder and the initial mask predictor according to the difference degree between the predicted position mask corresponding to each operation interface screenshot sample and the preset position mask carried by each operation interface screenshot sample until a target training condition is met, so as to obtain an updated pre-trained image encoder as the target image encoder and an updated initial mask predictor as the target mask predictor.

8. The method of claim 6, wherein the target task includes target prompt text information, and wherein before the inputting the target image feature to the mask prediction module for mask prediction, the method further comprises:

Acquiring preset prompt text features matched with the target prompt text information from a plurality of preset prompt text features, and taking the preset prompt text features as target text prompt features;

the step of inputting the target image features to the mask prediction module for mask prediction to obtain a target position mask includes:

and inputting the target image features and the target text prompt features into the mask prediction module to perform mask prediction to obtain a position mask corresponding to the target prompt text information, wherein the position mask is used as the target position mask.

9. The method of claim 8, wherein the image segmentation model further comprises a text feature extraction module, the method further comprising:

if the preset prompt text features matched with the target prompt text information do not exist in the preset prompt text features, the target prompt text information is input into the text feature extraction module to perform feature extraction, and the target text prompt features are obtained.

10. The method of claim 8, wherein, before the obtaining, from the plurality of preset alert text features, a preset alert text feature that matches the target alert text information as a target text alert feature, the method further comprises:

Receiving a plurality of preset prompt text features sent by a server, wherein the preset prompt text features are obtained by extracting text features of a plurality of preset prompt texts by the server through a pre-trained text encoder, and each preset prompt text corresponds to at least one entity object;

and storing the plurality of preset prompt text features.

11. An image segmentation apparatus for an application interface, the apparatus being applied to an electronic device, the apparatus comprising:

the screenshot acquisition module is used for acquiring an operation interface screenshot of a target application, wherein the target application is an application associated with a target task;

the image segmentation module is used for extracting at least part of entity objects contained in the operation interface screenshot by utilizing a pre-trained image segmentation model to obtain an image segmentation result so as to execute the operation corresponding to the target task according to the image segmentation result, and the image segmentation model is obtained by carrying out image segmentation training by utilizing a plurality of applied operation interface screenshot samples in advance.

12. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-10.

13. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method according to any one of claims 1 to 10.