CN114220014A

CN114220014A - Method, device, equipment and medium for determining saliency target detection model

Info

Publication number: CN114220014A
Application number: CN202111564891.XA
Authority: CN
Inventors: 焦少慧; 刘志昂; 刘姜江; 程明明
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-03-22

Abstract

The embodiment of the disclosure discloses a method, a device, equipment and a medium for determining a saliency target detection model. The method comprises the following steps: constructing an image detection network model with a U-shaped search structure, wherein the image detection network model comprises: the system comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network positioned between the encoding network and the decoding network, and a second intermediate network positioned between the pooling module and the decoding network; and taking the salient object in the detected image as a search target, carrying out search training on the image detection network model based on sample data, and determining the salient object detection model. Through the technical scheme of the embodiment of the disclosure, the saliency target detection model can be automatically searched and determined without manual participation, and the accuracy and efficiency of saliency target detection can be effectively ensured.

Description

Method, device, equipment and medium for determining saliency target detection model

Technical Field

The disclosed embodiments relate to computer technologies, and in particular, to a method, an apparatus, a device, and a medium for determining a saliency target detection model.

Background

With the rapid development of computer technology, a neural network model can be used for detecting salient objects in an image. Existing neural network models for detecting salient targets are usually designed manually based on expert experience. Therefore, the manual design of the network model wastes time and labor, and the designed network model has more calculation redundancy, so that the accuracy and efficiency of the detection of the significant target cannot be effectively ensured.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, equipment and a medium for determining a saliency target detection model, so that the saliency target detection model can be automatically searched and determined without manual participation, and the accuracy and the efficiency of saliency target detection can be effectively ensured.

In a first aspect, an embodiment of the present disclosure provides a method for determining a saliency target detection model, including:

constructing an image detection network model with a U-shaped search structure, wherein the image detection network model comprises: an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network between the encoding network and the decoding network, and a second intermediate network between the pooling module and the decoding network;

and taking the salient object in the detected image as a search target, carrying out search training on the image detection network model based on sample data, and determining the salient object detection model.

In a second aspect, an embodiment of the present disclosure further provides a device for determining a saliency target detection model, including:

the image detection network model construction module is used for constructing an image detection network model with a U-shaped search structure, and the image detection network model comprises: an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network between the encoding network and the decoding network, and a second intermediate network between the pooling module and the decoding network;

and the salient target detection model determining module is used for performing search training on the image detection network model based on sample data by taking a salient target in a detected image as a search target to determine a salient target detection model.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of determining a saliency target detection model as provided by any embodiment of the present disclosure.

In a fourth aspect, the embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining a saliency target detection model as provided in any of the embodiments of the present disclosure.

The embodiment of the present disclosure constructs an image detection network model having a U-shaped search structure, where the image detection network model includes: the device comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network between the encoding network and the decoding network, and a second intermediate network between the pooling module and the decoding network. The method has the advantages that the salient object in the detected image is taken as a direct search object, the image detection network model is searched and trained based on sample data, so that the salient object detection model can be automatically searched and determined without manual participation, redundant calculation can be removed in the search and training process, and the detection accuracy and the detection efficiency of the searched salient object detection model are effectively guaranteed.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

Fig. 1 is a flowchart of a method for determining a saliency target detection model according to an embodiment of the present disclosure;

FIG. 2 is a network structure example of an image detection network model according to an embodiment of the present disclosure;

fig. 3(a) is a network structure example of another image detection network model according to a first embodiment of the present disclosure;

fig. 3(b) is a network structure example of another image detection network model according to an embodiment of the present disclosure;

fig. 4 is a flowchart of a method for determining a saliency target detection model according to a second embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a determination apparatus of a saliency target detection model provided by a third embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Example one

Fig. 1 is a flowchart of a method for determining a salient object detection model according to an embodiment of the present disclosure, which is applicable to a case where a network model for detecting a salient object in an image is determined. The method may be performed by a salient object detection model determining apparatus, which may be implemented by software and/or hardware, integrated in an electronic device. As shown in fig. 1, the method specifically includes the following steps:

s110, constructing an image detection network model with a U-shaped search structure, wherein the image detection network model comprises: the device comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network between the encoding network and the decoding network, and a second intermediate network between the pooling module and the decoding network.

The image detection network model may be a U-shaped search space constructed for a saliency target detection task. The image detection network model focuses on the position information of the salient objects in the image, but not the category information of the salient objects in the image. The coding network can refer to a searchable network structure for feature extraction in an image detection network model, and can perform downsampling on image features to obtain a feature map with low resolution. The decoding network may refer to a searchable network structure opposite to the encoding network, that is, a feature map with a low resolution may be up-sampled to obtain a feature map with a high resolution. The pooling module may be a searchable network structure that pools the feature maps output by the encoded network. Illustratively, the Pooling module may be lppm (light farming module) lightweight Pyramid Pooling module, so as to further increase the overall receptive field of the network and effectively collect global information, thereby improving the utilization efficiency of the global information. The first intermediate network may refer to a searchable network structure that aggregates feature information in the encoding network into the decoding network in order to restore feature details. The second intermediate network may refer to a searchable network structure that introduces the feature information output by the pooling module into the decoding network, so as to ensure guidance of the global information in the feature upsampling process and avoid fading of the global information.

Specifically, fig. 2 shows an example of a network structure of an image detection network model. As shown in fig. 2, the constructed image detection network model with the U-shaped search structure may include: the method comprises the steps of constructing a U-shaped search space by a searchable coding network (namely, a Bottom-Up path), a searchable decoding network (namely, a Top-Down path) and a pooling module for connecting the two searchable paths of the coding network and the decoding network, and further comprising a first intermediate network for restoring details and a second intermediate network (namely, a Global guide path) for guiding Global information output by the pooling module to the Top-Down path between the two searchable paths.

Illustratively, fig. 3(a) shows an example of yet another image detection network model. As shown in fig. 3(a), the encoding network may include: a stem convolution module and a preset number of down-sampling modules; the decoding network may include: the device comprises a target detection module and a preset number of up-sampling modules; the first intermediate network may include: a preset number of first short connection modules; the second intermediate network may include: a preset number of second short connection modules; the down-sampling module, the up-sampling module, the first short connection module and the second short connection module are all in one-to-one correspondence.

The stem convolution module can be a standard 3 × 3 convolution layer with an input channel of 3. The preset number n may be set based on the service requirements and the actual scenario. For example, the preset number may be set to 4 to trade off detection accuracy and detection efficiency. The coding network may comprise four down-sampling modules arranged in sequence with down-sampling rates of 2, 4, 8 and 16 respectively. In order to maintain a high spatial resolution of the deep feature map, the present embodiment does not use a downsampling module with a large downsampling rate (e.g., 32). The convolution step s of the first network layer of each downsampling module may be set to a value greater than or equal to 2 in order to achieve the effect of spatial downsampling. The up-sampling module and the down-sampling module are in one-to-one correspondence, have the same sampling rate and are opposite in index order. And the first up-sampling module takes the output of the pooling module as input and sequentially transmits the output to the following up-sampling modules until the target detection module converts the feature map output by the last up-sampling module into a single-channel feature map, and the final output result is obtained through sigmoid activation function processing operation. During the sampling process of each up-sampling module, the characteristic maps output by the corresponding down-sampling module are gradually aggregated by the corresponding first short connection module so as to restore the details. Meanwhile, the global information output by the pooling module is guided to the corresponding upsampling module through each second short connection module, so that the guidance of the global information in the feature upsampling process is ensured, and the global information is prevented from being faded.

And S120, taking the saliency target in the detected image as a search target, carrying out search training on the image detection network model based on sample data, and determining the saliency target detection model.

Specifically, in this embodiment, a Salient Object Detection (SOD) in the detected image may be used as a direct search target, a search training is performed on the image Detection network model based on a preset search mode and sample data, and a network architecture and network weights are automatically searched and determined, so as to obtain a Salient target Detection model with optimal performance. The preset search mode may be any preset neural network architecture search mode. For example, the preset search mode may be, but is not limited to: the data (differential Architecture search) search mode may be a search mode in which a search space is continuously processed so as to solve an optimal network structure by using a continuous optimization mode such as gradient descent. The architecture search process in this embodiment may refer to a path-level pruning process, that is, an over-parameterized network including all candidate paths is directly trained, that is, an image detection network model. In addition to the network weight parameters, network architecture parameters need to be introduced in order to determine redundant paths during training. Since the network architecture parameters do not directly participate in the computation of the graph, the real-value weights can be updated in a manner of binarizing all the alternative paths. The network weight parameters and the network architecture parameters may be alternately updated on the training set and the validation set, respectively. When training is finished, a compact optimized network architecture is obtained by pruning the redundant path, so that a significance target detection model with optimal performance is determined, and the detection accuracy and the detection efficiency of the searched significance target detection model are effectively ensured.

Exemplarily, after step S120, the method may further include: acquiring an image to be detected; and inputting the image to be detected into the saliency target detection model, and obtaining the saliency target in the image to be detected according to the output of the saliency target detection model.

The saliency target may refer to an object with the most prominent visual information in the image to be detected. Specifically, after the saliency target detection model is determined, saliency target detection can be directly performed on an image to be detected by using the saliency target detection model, and a saliency target area in the image to be detected is highlighted, so that a more accurate saliency target in the image to be detected can be obtained more quickly, the saliency target detection accuracy and detection efficiency are improved, and further, in large-scale video image processing calculation, calculation and analysis resources can be saved. The salient object detection mode in the embodiment can be applied to application scenes such as main object segmentation, background blurring and intelligent barrage generation.

According to the technical scheme of the embodiment of the disclosure, an image detection network model with a U-shaped search structure is constructed, and the image detection network model comprises: the device comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network between the encoding network and the decoding network, and a second intermediate network between the pooling module and the decoding network. The method has the advantages that the salient object in the detected image is taken as a direct search object, the image detection network model is searched and trained based on sample data, so that the salient object detection model can be automatically searched and determined without manual participation, redundant calculation can be removed in the search and training process, and the detection accuracy and the detection efficiency of the searched salient object detection model are effectively guaranteed.

On the basis of the above technical solution, each down-sampling module includes: a first number of downsampling operations; the alternative search space corresponding to the down-sampling operation is: combining the sizes of the preset convolution kernels of the first preset number and the preset expansion ratios of the second preset number to obtain the MBConv inverted bottleneck convolution layers; each up-sampling module comprises: a second number of upsampling operations; the alternative search space corresponding to the upsampling operation is: combining the sizes of the preset convolution kernels of the third preset number and the preset expansion ratios of the fourth preset number to obtain the MBConv inverted bottleneck convolution layers; each first short connection module comprises: a third number of first short connection operations; the alternative search space corresponding to the first short connection operation is: combining the sizes of the fifth preset number of all preset convolution kernels and the sixth preset number of all preset expansion ratios to obtain all MBConv inverted bottleneck convolution layers; each second short connection module comprises: a fourth operational number of second short connection operations; the candidate search space corresponding to the second short connection operation is: and combining the sizes of the seventh preset number of the preset convolution kernels and the eighth preset number of the preset expansion ratios to obtain the MBConv inverted bottleneck convolution layers.

Specifically, fig. 3(b) shows an example of a preset number n of 4 image detection network models. As shown in fig. 3(b), the first down-sampling module, the third down-sampling module, and the fourth down-sampling module in the coding network may each include 4 repeated down-sampling operations; the second downsampling block may include 6 repeated downsampling operations. Each downsampling Operation may be a Mixed Operation consisting of multiple alternative paths. The first, third and fourth upsampling modules in the decoding network may each comprise 2 repeated upsampling operations; the second upsampling module may include 3 repeated upsampling operations. Each upsampling Operation may be a Mixed Operation consisting of multiple alternative paths. Each first short connection module in the first intermediate network may include 1 first short connection Operation, that is, 1 Mixed Operation composed of multiple alternative paths. Each second short connection module in the second intermediate network may include 1 second short connection Operation, that is, 1 Mixed Operation composed of multiple alternative paths.

It should be noted that each blending operation in the image detection network model is searchable learning and can be defined separately. In order to improve the search efficiency, the present embodiment may search only the respective down-sampling modules, the respective up-sampling modules, the respective first short-link modules, and the respective second short-link modules, and not search the stem convolution module, the pooling module, and the target detection module. For example, a network structure of LPPM pooling modules is given in fig. 3(b), i.e. LPPM pooling modules may consist of three pooling layers and one Identity activation function layer. The search space of the image detection network model in this embodiment can be as shown in table 1. All learnable blending operations in each downsampling module and each upsampling module in the image detection network model may share the same alternative search space: MBConv inverted bottleneck convolution layer with any combination of k in the set with kernel size k being {3,5,7} and any combination of e in the set with expansion ratio e being {3,6 }. The MBConv inverted bottleneck convolutional layer may consist of a 1x1 liter-dimensional convolution operation, a Depthwise convolution operation, a weight assignment SENet operation for the feature channels, and a 1x1 dimensionality reduction convolution operation. The expansion ratio e may refer to the ratio of the number of output channels to the number of input channels for the first 1x1 liter-dimensional convolution operation in MBConv. The number of blocks n may refer to the number of mixing operations that each module includes. The embodiment can narrow down the alternative expansion ratio e of the learnable blocks in each first short connection module and each second short connection module to 1 and 3 to reduce unnecessary complexity. For the second short concatenation module, when the extension ratio e is equal to 0, a zero operation is indicated, i.e. the corresponding learnable block is allowed to be skipped using residual concatenation. It is noted that the present embodiment does not allow skipping of learnable blocks with sampling operations (i.e., s > 1). In order to maintain the basic U-shaped search structure, it is not allowed to skip the learnable blocks in the respective first short-connection modules. In this way, the image detection network model can adjust its depth and width accordingly by keeping or skipping more blocks and using larger or smaller MBConv layers.

TABLE 1 alternative search space information for various operations in an image detection network model

On the basis of the above technical solutions, as shown in fig. 3(a) and 3(b), each downsampling module may be configured to: and performing down-sampling on the feature information output by the last down-sampling module to determine the feature information output by the current down-sampling module. Wherein the first downsampling module is configured to: and performing downsampling on the feature information output by the stem convolution module to determine the feature information output by the first downsampling module. Accordingly, the pooling module is for: and performing pooling operation on the feature information output by the last down-sampling module, and determining the feature information output by the pooling module.

Each first short connection module may be for: and performing short connection processing on the characteristic information output by the last and next sampling modules, and determining the characteristic information output by the current first short connection module. Wherein the first short connection module is configured to: and performing short connection processing on the characteristic information output by the stem convolution module, and determining the characteristic information output by the first short connection module. Each second short connection module may be for: and carrying out short connection processing on the characteristic information output by the pooling module, and determining the characteristic information output by the current second short connection module.

Each upsampling module may be configured to: and performing up-sampling on the feature information output by the last up-sampling module, adding the up-sampled feature information, the feature information output by the first short connection module corresponding to the current up-sampling module and the feature information output by the second short connection module corresponding to the current up-sampling module, and determining the feature information output by the current up-sampling module. Wherein the first upsampling module is configured to: and performing up-sampling on the feature information output by the pooling module, and adding the up-sampled feature information, the feature information output by the last first short connection module and the feature information output by the first second short connection module to determine the feature information output by the first up-sampling module. The target detection module is used for: and converting the feature information output by the last up-sampling module into a single-channel feature map, and obtaining and outputting a final output result through sigmoid activation function processing operation, thereby realizing the process of detecting the saliency target.

Example two

Fig. 4 is a flowchart of a method for determining a salient object detection model according to the second embodiment of the present disclosure, and in this embodiment, further optimization is performed on the step "taking a salient object in a detected image as a search target, performing search training on an image detection network model based on sample data, and determining a salient object detection model" on the basis of the above embodiment. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted.

Referring to fig. 4, the method for determining a saliency target detection model provided by this embodiment specifically includes the following steps:

s410, constructing an image detection network model with a U-shaped search structure.

And S420, searching and training the network architecture parameters in the image detection network model based on a preset search loss function and the first sample data, and obtaining a target network model after the search training is finished, wherein the preset search loss function is constructed based on the learned complexity importance of each mixed operation in the image detection network model.

Wherein the first sample data may include a first sample image for training the network architecture and a standard saliency target image corresponding to the first sample image. Specifically, the network architecture parameters in the image detection network model are searched and trained by using the first sample data based on the preset search loss function which is constructed in advance based on the complexity importance, and the searched target network model is obtained, so that the calculated amount in the searching process can be effectively reduced while the high-quality target extraction is realized.

Exemplarily, S420 may include: inputting the first sample data into an image detection network model, and searching and training network architecture parameters of the image detection network model based on a gradient descent mode; and stopping training when the preset search loss function reaches the minimum value, and obtaining the target network model after training.

It should be noted that, compared with the overall complexity as a unified optimization target, the present embodiment introduces the complexity importance into the overall complexity calculation process of the architecture search stage, and recalibrates the importance of the complexity of each mixing operation, so that a network architecture with better performance can be searched.

And S430, training the network weight parameters in the target network model based on a preset training loss function and second sample data to obtain a significance target detection model after training is finished.

The second sample data may include a second sample image used for training the network weight and a standard saliency target image corresponding to the second sample image. Illustratively, the preset training loss function may be, but is not limited to: BCE (binary Cross entropy) binary Cross entropy function to improve robustness of salient object detection by accumulating BCE loss per pixel in the image.

Exemplarily, S430 may include: and inputting second sample data into the target network model, training the network weight parameters of the target network model based on a gradient descent mode, and stopping training when a preset training loss function reaches a minimum value to obtain a significance target detection model after training is finished.

According to the technical scheme, the preset search loss function is constructed in advance based on the learned complexity importance of each mixed operation in the image detection network model, the network architecture parameters in the image detection network model are searched and trained based on the preset search loss function and the first sample data, the target network model after the search training is obtained, the complexity importance is introduced into the overall complexity calculation process of the architecture search stage, the importance of the complexity of each mixed operation is recalibrated, and therefore the network architecture with better performance can be searched.

On the basis of the above technical solution, the process of constructing the preset search loss function in step S420 may include the following steps S421 to S424:

s421, in the searching and training process, obtaining a path weight and a path complexity corresponding to each alternative path in the alternative path set of each mixing operation which can be learned in the image detection network model, wherein the path weight is used for representing the probability of selecting the alternative path by the mixing operation.

In particular, the over-parameterized network O of the image detection network model may be represented as follows:

the coding network ═ U decoding network ═ U pooling module ═ first intermediate network & -second intermediate network &

Wherein each blending operation O in the image detection network model_ie.O is learnable. Ith mixing operation O_iHaving a set of alternative paths

And each alternative path

Each corresponding to a path weight

To characterize the selection of this alternative path

The possibility of (a). Wherein j denotes the ith mixing operation O_iThe corresponding number of alternative paths. In this embodiment, the path complexity corresponding to each alternative path may be characterized by using floating Point operands

S422, determining the target path complexity corresponding to each mixing operation based on each path weight and each path complexity corresponding to each mixing operation.

Specifically, for each mixing operation, the expected value of each path complexity corresponding to each candidate path corresponding to the mixing operation may be used as the target path complexity corresponding to the mixing operation.

Exemplarily, S422 may include: multiplying each path weight and the path complexity corresponding to the current mixing operation, adding the multiplication results corresponding to the current mixing operation, and taking the obtained addition result as the target path complexity corresponding to the current mixing operation. For example, mixing operation O_iCorresponding target path complexity E_FLOPs[O_i]The determination may be made based on the following equation:

and S423, determining a complexity weight corresponding to each mixing operation, wherein the complexity weight is used for representing the importance of the target path complexity of the mixing operation.

Specifically, each mixing operation in the prior art has the same complexity importance, that is, the complexity weights are all equal, and in this embodiment, the complexity of an important mixing operation can be improved and the complexity of an unimportant mixing operation can be reduced by performing complexity importance correction on a path to which each mixing operation belongs, so that a network architecture with better performance can be searched. The penalty weight in the loss function may be used as the complexity weight, for example, a smaller complexity weight corresponding to the hybrid operation indicates a smaller penalty weight, i.e., a higher importance of the target path complexity of the hybrid operation.

Exemplarily, S423 may include: detecting whether the current mixing operation is the mixing operation in a preset important network in the image detection network model; if so, determining the complexity weight corresponding to the current mixing operation as a first preset numerical value; if not, determining that the complexity weight corresponding to the current mixing operation is a second preset numerical value; wherein the first preset value is smaller than the second preset value.

The preset important network may refer to a network module with high complexity and importance in the image detection network model. For example, the predetermined importance network may include, but is not limited to, a coding network and a pooling module, i.e., a hybrid operation in the coding network and the pooling module has a greater complexity importance than other network modules, so that a network architecture with a greater complexity of one coding network and the pooling module may tend to be searched out. The sum of the first preset value and the second preset value may be set to 1, and at this time, the first preset value may be set to a value less than 0.5, and the second preset value may be set to a number greater than 0.5The value is obtained. For example, mixing operation O_iCorresponding complexity weight beta_iThe determination may be made based on the following equation:

s424, constructing a preset search loss function based on the target path complexity and the complexity weight corresponding to each mixing operation.

Specifically, the pre-set search loss function corresponding to the over-parameterized network O may be characterized by the FLOPs of all the mixing operations.

Exemplarily, S424 may include: and multiplying the complexity of the target path corresponding to each mixing operation by the complexity weight, and adding the multiplication results corresponding to the mixing operations to construct a preset search loss function. For example, a preset search loss function E is constructed_FLOPs[O]Can be expressed as follows:

according to the embodiment, a network architecture with better performance can be searched out by utilizing a simple complexity importance correction mode, so that the detection accuracy and the detection efficiency of the searched significant target detection model can be further improved.

The following is an embodiment of the determination apparatus for a significant object detection model provided in the embodiments of the present disclosure, which belongs to the same inventive concept as the determination method for a significant object detection model in the embodiments described above, and reference may be made to the determination method for a significant object detection model in the embodiments of the determination apparatus for a significant object detection model, which is not described in detail in the embodiments described above.

EXAMPLE III

Fig. 5 is a schematic structural diagram of a determination apparatus for a salient object detection model according to a third embodiment of the present disclosure, which is applicable to a case of determining a network model for detecting a salient object in an image. As shown in fig. 5, the apparatus specifically includes: an image detection network model construction module 510 and a salient object detection model determination module 520.

The image detection network model building module 510 is configured to build an image detection network model with a U-shaped search structure, where the image detection network model includes: the system comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network positioned between the encoding network and the decoding network, and a second intermediate network positioned between the pooling module and the decoding network; and the salient object detection model determining module 520 is configured to perform search training on the image detection network model based on sample data by using a salient object in the detected image as a search object, and determine a salient object detection model.

On the basis of the above technical solution, the significant object detection model determining module 520 includes:

the network architecture searching unit is used for carrying out searching training on network architecture parameters in the image detection network model based on a preset searching loss function and first sample data to obtain a target network model after the searching training is finished, wherein the preset searching loss function is constructed based on the complexity importance of each learned mixed operation in the image detection network model;

and the network weight training unit is used for training the network weight parameters in the target network model based on a preset training loss function and the second sample data to obtain a significance target detection model after training is finished.

On the basis of the above technical solutions, the network architecture search unit is specifically configured to: inputting the first sample data into an image detection network model, and searching and training network architecture parameters of the image detection network model based on a gradient descent mode; and stopping training when the preset search loss function reaches the minimum value, and obtaining the target network model after training.

On the basis of the above technical solutions, the apparatus further includes: the preset search loss function building module comprises:

the candidate path information acquiring unit is used for acquiring a path weight and a path complexity corresponding to each candidate path in a candidate path set of each mixing operation which can be learned in the image detection network model in the search training process, wherein the path weight is used for representing the probability of selecting the candidate path by the mixing operation;

a target path complexity determining unit, configured to determine a target path complexity corresponding to each mixing operation based on each path weight and each path complexity corresponding to each mixing operation;

the complexity weight determining unit is used for determining a complexity weight corresponding to each mixing operation, and the complexity weight is used for representing the importance of the target path complexity of the mixing operation;

and the preset search loss function construction unit is used for constructing a preset search loss function based on the target path complexity and the complexity weight corresponding to each mixing operation.

On the basis of the above technical solutions, the target path complexity determining unit is specifically configured to: multiplying each path weight and the path complexity corresponding to the current mixing operation, adding the multiplication results corresponding to the current mixing operation, and taking the obtained addition result as the target path complexity corresponding to the current mixing operation.

On the basis of the above technical solutions, the complexity weight determining unit is specifically configured to: detecting whether the current mixing operation is the mixing operation in a preset important network in the image detection network model; if so, determining the complexity weight corresponding to the current mixing operation as a first preset numerical value; if not, determining that the complexity weight corresponding to the current mixing operation is a second preset numerical value; wherein the first preset value is smaller than the second preset value.

On the basis of the above technical solutions, a search loss function construction unit is preset, and specifically configured to: and multiplying the complexity of the target path corresponding to each mixing operation by the complexity weight, and adding the multiplication results corresponding to the mixing operations to construct a preset search loss function.

On the basis of the technical schemes, the preset training loss function is as follows: BCE two-class cross-entropy function.

On the basis of the above technical solutions, the encoding network includes: a stem convolution module and a preset number of down-sampling modules; the decoding network comprises: the device comprises a target detection module and a preset number of up-sampling modules; the first intermediate network comprises: a preset number of first short connection modules; the second intermediate network comprises: a preset number of second short connection modules; the down-sampling module, the up-sampling module, the first short connection module and the second short connection module are all in one-to-one correspondence.

On the basis of the above technical solutions, each down-sampling module includes: a first number of downsampling operations; the alternative search space corresponding to the down-sampling operation is: combining the sizes of the preset convolution kernels of the first preset number and the preset expansion ratios of the second preset number to obtain the MBConv inverted bottleneck convolution layers;

each up-sampling module comprises: a second number of upsampling operations; the alternative search space corresponding to the upsampling operation is: combining the sizes of the preset convolution kernels of the third preset number and the preset expansion ratios of the fourth preset number to obtain the MBConv inverted bottleneck convolution layers;

each first short connection module comprises: a third number of first short connection operations; the alternative search space corresponding to the first short connection operation is: combining the sizes of the fifth preset number of all preset convolution kernels and the sixth preset number of all preset expansion ratios to obtain all MBConv inverted bottleneck convolution layers;

each second short connection module comprises: a fourth operational number of second short connection operations; the candidate search space corresponding to the second short connection operation is: and combining the sizes of the seventh preset number of the preset convolution kernels and the eighth preset number of the preset expansion ratios to obtain the MBConv inverted bottleneck convolution layers.

On the basis of the above technical solutions, each downsampling module is configured to: performing down-sampling on the feature information output by the upper and lower sampling modules, and determining the feature information output by the current down-sampling module;

each first short connection module is for: performing short connection processing on the feature information output by the last and next sampling modules, and determining the feature information output by the current first short connection module;

each second short connection module is for: performing short connection processing on the feature information output by the pooling module, and determining the feature information output by a second short connection module at present;

each upsampling module is to: and performing up-sampling on the feature information output by the last up-sampling module, adding the up-sampled feature information, the feature information output by the first short connection module corresponding to the current up-sampling module and the feature information output by the second short connection module corresponding to the current up-sampling module, and determining the feature information output by the current up-sampling module.

On the basis of the technical schemes, the pooling module is a lightweight pyramid pooling module.

On the basis of the above technical solutions, the apparatus further includes:

a salient object detection module to: after determining a saliency target detection model, acquiring an image to be detected; and inputting the image to be detected into the saliency target detection model, and obtaining the saliency target in the image to be detected according to the output of the saliency target detection model.

The determination device for the saliency target detection model provided by the embodiment of the disclosure can execute the determination method for the saliency target detection model provided by any embodiment of the disclosure, and has functional modules and beneficial effects corresponding to the determination method for the saliency target detection model.

It should be noted that, in the embodiment of the apparatus for determining a saliency object detection model, the included units and modules are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example four

Referring now to FIG. 6, a block diagram of an electronic device 900 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 6 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.

The electronic device provided by the embodiment of the present disclosure and the determination method of the saliency target detection model provided by the embodiment belong to the same inventive concept, and technical details that are not described in detail in the embodiment of the present disclosure may refer to the embodiment, and the embodiment of the present disclosure have the same beneficial effects.

EXAMPLE five

The disclosed embodiments provide a computer storage medium having a computer program stored thereon, which when executed by a processor, implements the method for determining a saliency target detection model provided by the above-described embodiments.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the server; or may exist separately and not be assembled into the server.

The computer readable medium carries one or more programs which, when executed by the server, cause the server to: constructing an image detection network model with a U-shaped search structure, wherein the image detection network model comprises: the system comprises an encoding network, a decoding network, a pooling module for connecting the encoding network and the decoding network, a first intermediate network positioned between the encoding network and the decoding network, and a second intermediate network positioned between the pooling module and the decoding network; and taking the salient object in the detected image as a search target, carrying out search training on the image detection network model based on sample data, and determining the salient object detection model.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a cell does not in some cases constitute a limitation on the cell itself, for example, an editable content display cell may also be described as an "editing cell".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, [ example one ] there is provided a determination method of a saliency target detection model, comprising:

According to one or more embodiments of the present disclosure, [ example two ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, the determining a salient object detection model by taking a salient object in the detected image as a search target and performing search training on the image detection network model based on sample data includes:

searching and training network architecture parameters in the image detection network model based on a preset search loss function and first sample data to obtain a target network model after the search and training are finished, wherein the preset search loss function is constructed based on the importance of the complexity of each learned mixed operation in the image detection network model;

training the network weight parameters in the target network model based on a preset training loss function and second sample data to obtain a significance target detection model after training is finished.

According to one or more embodiments of the present disclosure, [ example three ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, the performing search training on the network architecture parameter in the image detection network model based on the preset search loss function and the first sample data to obtain the target network model after the search training is finished includes:

inputting first sample data into the image detection network model, and searching and training network architecture parameters of the image detection network model based on a gradient descent mode;

and stopping training when the preset search loss function reaches the minimum value, and obtaining the target network model after training.

According to one or more embodiments of the present disclosure, [ example four ] there is provided a method of determining a salient object detection model, further comprising:

optionally, the process of constructing the preset search loss function includes:

in the searching and training process, acquiring a path weight and a path complexity corresponding to each alternative path in an alternative path set of each mixing operation which can be learned in the image detection network model, wherein the path weight is used for representing the probability of selecting the alternative path by the mixing operation;

determining a target path complexity corresponding to each mixing operation based on each path weight and each path complexity corresponding to each mixing operation;

determining a complexity weight corresponding to each mixing operation, wherein the complexity weight is used for representing the importance of the target path complexity of the mixing operation;

and constructing the preset search loss function based on the target path complexity and the complexity weight corresponding to each mixing operation.

According to one or more embodiments of the present disclosure, [ example five ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, the determining a target path complexity corresponding to each mixing operation based on each path weight and each path complexity corresponding to each mixing operation includes:

multiplying each path weight corresponding to the current mixing operation by the path complexity, adding the multiplication results corresponding to the current mixing operation, and taking the obtained addition result as the target path complexity corresponding to the current mixing operation.

According to one or more embodiments of the present disclosure, [ example six ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, the determining the complexity weight corresponding to each mixing operation includes:

detecting whether the current mixing operation is the mixing operation in a preset important network in the image detection network model;

if so, determining the complexity weight corresponding to the current mixing operation as a first preset numerical value;

if not, determining that the complexity weight corresponding to the current mixing operation is a second preset numerical value;

wherein the first preset value is smaller than the second preset value.

According to one or more embodiments of the present disclosure, [ example seven ] there is provided a method of determining a salient object detection model, further comprising:

optionally, the constructing the preset search loss function based on the target path complexity and the complexity weight corresponding to each mixing operation includes:

multiplying the target path complexity corresponding to each mixing operation by the complexity weight, and adding the multiplication results corresponding to each mixing operation to construct the preset search loss function.

According to one or more embodiments of the present disclosure, [ example eight ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, the preset training loss function is: BCE two-class cross-entropy function.

According to one or more embodiments of the present disclosure, [ example nine ] there is provided a method of determining a salient object detection model, further comprising:

optionally, the encoding network includes: a stem convolution module and a preset number of down-sampling modules;

the decoding network comprises: the target detection module and the up-sampling modules with the preset number are arranged;

the first intermediate network comprises: the preset number of first short connection modules;

the second intermediate network comprises: the preset number of second short connection modules;

the down-sampling module, the up-sampling module, the first short connection module and the second short connection module are all in one-to-one correspondence.

According to one or more embodiments of the present disclosure, [ example ten ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, each of the down-sampling modules includes: a first number of downsampling operations; the alternative search space corresponding to the down-sampling operation is: combining the sizes of the preset convolution kernels of the first preset number and the preset expansion ratios of the second preset number to obtain the MBConv inverted bottleneck convolution layers;

each of the up-sampling modules includes: a second number of upsampling operations; the alternative search space corresponding to the upsampling operation is: combining the sizes of the preset convolution kernels of the third preset number and the preset expansion ratios of the fourth preset number to obtain the MBConv inverted bottleneck convolution layers;

each of the first short connection modules includes: a third number of first short connection operations; the candidate search space corresponding to the first short connection operation is: combining the sizes of the fifth preset number of all preset convolution kernels and the sixth preset number of all preset expansion ratios to obtain all MBConv inverted bottleneck convolution layers;

each of the second short connection modules includes: a fourth operational number of second short connection operations; the candidate search space corresponding to the second short connection operation is: and combining the sizes of the seventh preset number of the preset convolution kernels and the eighth preset number of the preset expansion ratios to obtain the MBConv inverted bottleneck convolution layers.

According to one or more embodiments of the present disclosure, [ example eleven ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, each of the down-sampling modules is configured to: performing down-sampling on the feature information output by the upper and lower sampling modules, and determining the feature information output by the current down-sampling module;

each of the first short connection modules is configured to: performing short connection processing on the feature information output by the last and next sampling modules, and determining the feature information output by the current first short connection module;

each of the second short connection modules is configured to: performing short connection processing on the feature information output by the pooling module, and determining the feature information output by a second short connection module at present;

each of the upsampling modules is to: and performing up-sampling on the feature information output by the last up-sampling module, adding the up-sampled feature information, the feature information output by the first short connection module corresponding to the current up-sampling module and the feature information output by the second short connection module corresponding to the current up-sampling module, and determining the feature information output by the current up-sampling module.

According to one or more embodiments of the present disclosure, [ example twelve ] there is provided a method of determining a salient object detection model, further comprising:

optionally, the pooling module is a lightweight pyramid pooling module.

According to one or more embodiments of the present disclosure, [ example thirteen ] there is provided a determination method of a saliency target detection model, further comprising:

optionally, after the determining the salient object detection model, the method further includes:

acquiring an image to be detected;

and inputting the image to be detected into the saliency target detection model, and obtaining the saliency target in the image to be detected according to the output of the saliency target detection model.

According to one or more embodiments of the present disclosure, [ example fourteen ] there is provided a determination apparatus of a saliency target detection model, comprising:

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method for determining a salient object detection model, comprising:

2. The method according to claim 1, wherein the step of performing search training on the image detection network model based on sample data by taking a salient object in the detected image as a search target to determine a salient object detection model comprises:

3. The method according to claim 2, wherein the performing search training on the network architecture parameters in the image detection network model based on a preset search loss function and the first sample data to obtain a target network model after the search training is finished comprises:

4. The method according to claim 2, wherein the construction process of the preset search loss function comprises:

5. The method of claim 4, wherein determining the target path complexity for each of the blending operations based on the respective path weight and the respective path complexity for each of the blending operations comprises:

6. The method of claim 4, wherein determining the complexity weight for each blending operation comprises:

wherein the first preset value is smaller than the second preset value.

7. The method according to claim 4, wherein the constructing the preset search loss function based on the target path complexity and the complexity weight corresponding to each of the blending operations comprises:

8. The method of claim 2, wherein the preset training loss function is: BCE two-class cross-entropy function.

9. The method of claim 1,

the encoding network includes: a stem convolution module and a preset number of down-sampling modules;

10. The method of claim 9,

each of the down-sampling modules comprises: a first number of downsampling operations; the alternative search space corresponding to the down-sampling operation is: combining the sizes of the preset convolution kernels of the first preset number and the preset expansion ratios of the second preset number to obtain the MBConv inverted bottleneck convolution layers;

11. The method of claim 9,

each of the down-sampling modules is configured to: performing down-sampling on the feature information output by the upper and lower sampling modules, and determining the feature information output by the current down-sampling module;

12. The method of claim 1, wherein the pooling module is a lightweight pyramid pooling module.

13. The method according to any one of claims 1-11, further comprising, after said determining a salient object detection model:

acquiring an image to be detected;

14. An apparatus for determining a salient object detection model, comprising:

15. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of determining a salient object detection model as defined in any one of claims 1-13.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of determining a salient object detection model as claimed in any one of claims 1 to 13.