CN109118456B

CN109118456B - Image processing method and device

Info

Publication number: CN109118456B
Application number: CN201811124831.4A
Authority: CN
Inventors: 胡耀全
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Douyin Vision Co Ltd; Douyin Vision Beijing Co Ltd
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2021-07-23
Anticipated expiration: 2038-09-26
Also published as: WO2020062494A1; CN109118456A

Abstract

The embodiment of the application discloses an image processing method and device. One embodiment of the method comprises: acquiring an image containing a target, and carrying out scale transformation on the image to obtain a processed image with at least one scale; inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of the targets; determining candidate frames with the size within a preset size range from the candidate frames in each image; and determining the corresponding area of at least one candidate frame in the candidate frames in the size range in the feature map, acquiring the corresponding features of the area, and inputting the features into the full-connection layer of the convolutional neural network. The method provided by the embodiment of the application can determine the candidate frames in different size ranges from the images in different scales so as to acquire richer features for the targets in different sizes.

Description

Image processing method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of internet, and particularly relates to an image processing method and device.

Background

Because the convolutional neural network has the characteristics of rapidness and accuracy in image processing, the convolutional neural network is increasingly applied and popularized. There are many objects in the image, not only in large numbers, but also in sizes that may vary greatly.

Disclosure of Invention

The embodiment of the application provides an image processing method and device.

In a first aspect, an embodiment of the present application provides an image processing method, including: acquiring an image containing a target, and carrying out scale transformation on the image to obtain a processed image with at least one scale; inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of targets, wherein each target corresponds to at least two candidate frames; determining candidate frames with the sizes within a preset size range in the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the images with different scales are different; and determining the corresponding area of at least one candidate frame in the candidate frames in the size range in the feature map, acquiring the features corresponding to the area, and inputting the features into the full-connection layer of the convolutional neural network.

In some embodiments, before determining the region in the feature map to which at least one of the candidate boxes within the size range corresponds, the method further comprises: and carrying out non-maximum suppression on the candidate frames within the preset size range to obtain at least one candidate frame.

In some embodiments, the image is scaled, including: the image is subjected to up-sampling and/or down-sampling, wherein the size range of a candidate frame corresponding to the image obtained by down-sampling is larger than or equal to a first preset threshold, the size range of a candidate frame corresponding to the image obtained by up-sampling is smaller than or equal to a second preset threshold, and the first preset threshold is larger than the second preset threshold.

In some embodiments, the size range of the candidate frame corresponding to the acquired image is between a third preset threshold and a fourth preset threshold, where the third preset threshold is greater than the fourth preset threshold, the third preset threshold is greater than or equal to the first preset threshold, and the fourth preset threshold is less than or equal to the second preset threshold.

In some embodiments, in response to that at least two images in the processed image have a larger scale than the acquired image, the size range of the candidate frame corresponding to the image with the smaller scale in the at least two images is smaller than a first specified threshold, the size range of the candidate frame corresponding to the image with the larger scale is smaller than a second specified threshold, and the first specified threshold is larger than the second specified threshold.

In some embodiments, in response to that two or more images exist in the processed image, the size range of the candidate frame corresponding to the image with the smaller size is greater than a third specified threshold, the size range of the candidate frame corresponding to the image with the larger size is greater than a fourth specified threshold, and the third specified threshold is greater than the fourth specified threshold.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including: the acquisition unit is configured to acquire an image containing a target, and perform scale transformation on the image to obtain a processed image with at least one scale; an input unit configured to input the acquired image and the processed image into a convolutional neural network, resulting in a feature map and a plurality of candidate frames indicating positions of targets, wherein each target corresponds to at least two candidate frames; the determining unit is configured to determine candidate frames with sizes within a preset size range from the candidate frames in the images, wherein the size ranges of the candidate frames corresponding to the images with different scales are different; and the area determining unit is configured to determine an area corresponding to at least one candidate frame in the candidate frames in the size range in the feature map, acquire features corresponding to the area, and input the features into a full-connection layer of the convolutional neural network.

In some embodiments, the apparatus further comprises: and the selecting unit is configured to perform non-maximum suppression on the candidate frames within the preset size range to obtain at least one candidate frame.

In some embodiments, the obtaining unit is further configured to: the image is subjected to up-sampling and/or down-sampling, wherein the size range of a candidate frame corresponding to the image obtained by down-sampling is larger than or equal to a first preset threshold, the size range of a candidate frame corresponding to the image obtained by up-sampling is smaller than or equal to a second preset threshold, and the first preset threshold is larger than the second preset threshold.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement a method as in any embodiment of the image processing method.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a method as in any one of the embodiments of the image processing method.

According to the image processing scheme provided by the embodiment of the application, firstly, an image containing a target is obtained, and the image is subjected to scale transformation to obtain a processed image with at least one scale. And then, inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of the targets, wherein each target corresponds to at least two candidate frames. Then, among the candidate frames in each image, the candidate frames with the sizes within a preset size range are determined, wherein the size ranges of the candidate frames corresponding to the images with different scales are different. And finally, determining a region corresponding to at least one candidate frame in the candidate frames in the size range in the feature map, acquiring features corresponding to the region, and inputting the features into a full-connection layer of the convolutional neural network. The method provided by the embodiment of the application can determine the candidate frames in different size ranges from the images in different scales so as to acquire richer features for the targets in different sizes.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an image processing method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of an image processing method according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of an image processing method according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of an image processing apparatus according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the image processing method or image processing apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as an image processing application, a video application, a live application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

Here, the

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server may analyze and perform other processing on the received data such as the image, and feed back a processing result (e.g., a feature) to the terminal device.

It should be noted that the image processing method provided in the embodiment of the present application may be executed by the server 105 or the

terminal devices

101, 102, and 103, and accordingly, the image processing apparatus may be disposed in the server 105 or the

terminal devices

101, 102, and 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an image processing method according to the present application is shown. The image processing method comprises the following steps:

step 201, acquiring an image containing a target, and performing scale transformation on the image to obtain a processed image of at least one scale.

In this embodiment, an executing subject of the image processing method (for example, a server or a terminal device shown in fig. 1) may acquire an image containing a target, and perform scale transformation on the acquired image to obtain a processed image of at least one scale. The target is an object of some significance, such as a tree, a house, presented by the image. The same object or various objects of different sizes and different styles may be included in the image.

Here, the scale refers to the number of pixel points of an image. For example, the scale of the acquired image is 224 × 224, and the scale of the image obtained after the scaling is 256 × 256. Specifically, the scaling may employ at least one of upsampling and downsampling.

Step 202, inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating positions of targets, wherein each target corresponds to at least two candidate frames.

In this embodiment, the executing body may input the acquired image into a convolutional neural network, and may also input the image obtained by the scale transformation into the convolutional neural network, so as to obtain a plurality of candidate frames (disposals) indicating the positions of the targets and a feature map (feature map). In particular, the execution body may determine the candidate box in various ways. The candidate box may be determined using a regional candidate Network, such as where the convolutional neural Network comprises a regional candidate Network (RPN). In addition, a Selective Search (Selective Search) may also be employed to determine candidate boxes. The feature maps may be obtained by convolution layers of a convolutional neural network, with different feature maps obtained by convolution of different images. The candidate boxes here may be expressed as positions and sizes. The position may be represented by the coordinates of a point of the candidate box, such as the midpoint or the top left corner vertex. The size may be expressed in terms of area, circumference, or width, height, etc.

Step 203, determining candidate frames with the size within a preset size range from the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the images with different scales are different.

In this embodiment, the execution subject may determine a candidate frame having a size within a preset size range among the candidate frames of the respective images. Since the size ranges of the candidate frames corresponding to the images of different scales are different, the sizes of the candidate frames determined for the images of different scales are different when determining the candidate frames within the size ranges. The candidate frame corresponding to the image is a candidate frame obtained by inputting the image into a convolutional neural network.

For example, the execution body may obtain the original image with a scale of 224 × 224, and perform downsampling to obtain a small image with a scale of 112 × 112. The size ranges may be set for the candidate frame corresponding to the original image and the candidate frame corresponding to the small image in advance: less than 8 × 8 and greater than 8 × 8, or less than 9 × 9 and greater than 8 × 8, and so on.

In some optional implementation manners of this embodiment, in response to that at least two image scales in the processed image are larger than the scale of the acquired image, in the at least two images, the size range of the candidate frame corresponding to the image with the smaller scale is smaller than a first specified threshold, the size range of the candidate frame corresponding to the image with the larger scale is smaller than a second specified threshold, and the first specified threshold is larger than the second specified threshold.

In response to that more than two image scales exist in the processed image and are smaller than the scale of the acquired image, the size range of the candidate frame corresponding to the image with the smaller scale in the more than two images is larger than a third specified threshold, the size range of the candidate frame corresponding to the image with the larger scale is larger than a fourth specified threshold, and the third specified threshold is larger than the fourth specified threshold.

In these alternative implementations, the image with the larger scale may correspond to a smaller size range of the candidate frame, and the image with the smaller scale may correspond to a larger size range of the candidate frame, and these two size ranges may partially overlap.

For example, the original image has a dimension of 128 × 128, and after upsampling, the obtained images are an a image having a dimension of 224 × 224 and a B image having a dimension of 256 × 256. The size range of the candidate frame corresponding to the a image may be smaller than 6 × 6 (where two 6 are the number of wide and high pixel points, respectively), and the size range of the candidate frame corresponding to the B image may be smaller than 5 × 5.

The features of the target in the image with the larger scale of the implementation modes are easier to acquire, and more details of the target can be embodied. And the objects in the images with smaller dimensions are more capable of reflecting the overall characteristics of the objects. Thus, smaller objects may be emphasized from larger-scale images, while larger objects may be emphasized in smaller-scale images to more accurately capture features of objects of different sizes.

And step 204, determining a corresponding area of at least one candidate frame in the candidate frames in the size range in the feature map, acquiring the features of the area, and inputting the features into a full-connection layer of the convolutional neural network.

In this embodiment, the execution subject may determine a region corresponding to at least one candidate frame in the candidate frames in the size range in the feature map. Then, the features of the region are obtained, and the obtained features are input into a Connected Layer (Connected Layer) of the convolutional neural network to perform subsequent processing of the convolutional neural network (for example, the results of the Connected Layer may be classified and regressed), so as to obtain a final output of the convolutional neural network. When the execution body acquires the feature of the region, the execution body may determine and extract a local feature matrix corresponding to the region from the feature matrix corresponding to the feature map.

The characteristic graphs corresponding to different images are different. In the case where there are a plurality of candidate frames in the size range corresponding to each image, different regions corresponding to each candidate frame in the feature map may be determined.

The above step 204 can be implemented by a specific Pooling Layer (ROI Pooling Layer) in the convolutional neural network.

In some optional implementations of this embodiment, before step 204, the method may further include:

and carrying out non-maximum suppression on the candidate frames within the preset size range to obtain the at least one candidate frame.

In these alternative implementations, the execution subject may perform Non-Maximum Suppression (NMS) on the candidate frames within a preset size range to generate the at least one candidate frame through the Non-Maximum Suppression process. Then, the executing body can determine a region corresponding to the generated at least one candidate frame in the feature map. The non-maximum suppression may screen the candidate frames to obtain candidate frames that are closer to the location of the labeling frame used to label the target.

These implementations may remove less accurate candidate boxes by non-maxima suppression, increasing the accuracy of the features acquired for the target.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the image processing method according to the present embodiment. In the application scenario of fig. 3, an executing entity 301 may obtain an image 302 including a target, and perform scale transformation on the image 302 to obtain a processed image 303 with at least one scale; inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map 304 and a plurality of candidate frames 305 indicating positions of targets, wherein each target corresponds to at least two candidate frames; determining candidate frames 306 with the size within a preset size range in the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the images with different scales are different; determining a region 307 corresponding to at least one candidate frame in the candidate frames in the size range in the feature map, acquiring features 308 corresponding to the region, and inputting the features into a full-connection layer of the convolutional neural network.

The method provided by the above embodiment of the present application can determine candidate frames in different size ranges from images of different scales, so as to obtain more abundant and accurate features for targets of various sizes.

With further reference to fig. 4, a flow 400 of yet another embodiment of an image processing method is shown. The flow 400 of the image processing method comprises the following steps:

step 401, acquiring an image including a target, and performing up-sampling and/or down-sampling on the image to obtain a processed image of at least one scale, where a size range of a candidate frame corresponding to the image obtained by the down-sampling is greater than or equal to a first preset threshold, a size range of a candidate frame corresponding to the image obtained by the up-sampling is less than or equal to a second preset threshold, and the first preset threshold is greater than the second preset threshold.

In this embodiment, an execution subject (for example, a server or a terminal device shown in fig. 1) on which the image processing method operates may acquire an image containing a target, and perform up-sampling and down-sampling on the image to obtain a processed image. The processed image includes at least two scales. Specifically, the numerical value in the size range of the candidate frame corresponding to the up-sampled large-scale image is small, and the numerical value in the size range of the candidate frame corresponding to the down-sampled small-scale image is large.

In some optional implementation manners of this embodiment, a size range of the candidate frame corresponding to the acquired image is between a third preset threshold and a fourth preset threshold, where the third preset threshold is greater than the fourth preset threshold, the third preset threshold is greater than or equal to the first preset threshold, and the fourth preset threshold is less than or equal to the second preset threshold.

In these implementations, the size range of the candidate frame corresponding to the acquired artwork is centered. Therefore, some objects with moderate sizes can be determined from the original image, so that the characteristics of the objects can be obtained from the original image according to the sizes of the objects, and the objects with moderate sizes can be detected more accurately.

Step 402, inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of the targets, wherein each target corresponds to at least two candidate frames.

In this embodiment, the execution subject may input the acquired image to a convolutional neural network, and may also input the image obtained by the scale transformation to the convolutional neural network to obtain a plurality of candidate frames indicating the positions of the targets and the feature map. In particular, the execution body may determine the candidate box in various ways.

Step 403, determining candidate frames with sizes within a preset size range from the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the images with different scales are different.

And step 404, determining a corresponding area of at least one candidate frame in the candidate frames in the size range in the feature map, acquiring the features of the area, and inputting the features into a full-connection layer of the convolutional neural network.

In this embodiment, the execution subject may determine at least one candidate frame in the candidate frames in the size range, and the corresponding region in the feature map. And then, acquiring the characteristics of the region, inputting the acquired characteristics into a full connection layer of the convolutional neural network to perform subsequent processing of the convolutional neural network, and obtaining the final output of the convolutional neural network. When the execution main body acquires the feature of the region, a part of the feature matrix corresponding to the target region may be determined from the feature matrix corresponding to the feature map and extracted.

According to the embodiment, images with different scales can be obtained through up-sampling and down-sampling, and rich characteristics of targets with different sizes can be obtained. Further, the embodiment can more accurately acquire the features of the targets with different sizes in the image through the candidate frames in at least three size ranges.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present application provides an embodiment of an image processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the image processing apparatus 500 of the present embodiment includes: an acquisition unit 501, an input unit 502, a determination unit 503, and an area determination unit 504. The acquiring unit 501 is configured to acquire an image including a target, perform scale transformation on the image, and obtain a processed image of at least one scale; an input unit 502 configured to input the acquired image and the processed image into a convolutional neural network, resulting in a feature map and a plurality of candidate frames indicating positions of targets, wherein each target corresponds to at least two candidate frames; a determining unit 503 configured to determine candidate frames with sizes within a preset size range from among the candidate frames in each image, where the size ranges of the candidate frames corresponding to the images with different scales are different; the region determining unit 504 is configured to determine a region corresponding to at least one candidate frame in the candidate frames within the size range in the feature map, acquire a feature corresponding to the region, and input the feature to the fully connected layer of the convolutional neural network.

In some embodiments, the obtaining unit 501 may obtain an image containing a target, and scale the obtained image to obtain a processed image of at least one scale. The target is an object of some significance, such as a tree, a house, presented by the image.

In some embodiments, the input unit 502 may input the acquired image into a convolutional neural network, and may also input the image obtained by the scale transformation into the convolutional neural network, so as to obtain a plurality of candidate boxes indicating the positions of the targets and a feature map. In particular, the execution body may determine the candidate box in various ways.

In some embodiments, the determination unit 503 may determine, among the candidate frames of the respective images, a candidate frame having a size within a preset size range. Since the size ranges of the candidate frames corresponding to the images of different scales are different, the sizes of the candidate frames determined for the images of different scales are different when determining the candidate frames within the size ranges. The candidate frame corresponding to the image is a candidate frame obtained by inputting the image into a convolutional neural network.

In some embodiments, the region determining unit 504 may determine a region to which at least one of the candidate boxes within the size range corresponds in the feature map. And then, acquiring the characteristics of the region, inputting the acquired characteristics into a full connection layer of the convolutional neural network to perform subsequent processing of the convolutional neural network, and obtaining the final output of the convolutional neural network.

In some optional implementations of this embodiment, the apparatus further includes: and the selecting unit is configured to perform non-maximum suppression on the candidate frames within the preset size range to obtain at least one candidate frame.

In some optional implementations of this embodiment, the obtaining unit is further configured to: the image is subjected to up-sampling and/or down-sampling, wherein the size range of a candidate frame corresponding to the image obtained by down-sampling is larger than or equal to a first preset threshold, the size range of a candidate frame corresponding to the image obtained by up-sampling is smaller than or equal to a second preset threshold, and the first preset threshold is larger than the second preset threshold.

In some optional implementation manners of this embodiment, in response to that two or more image scales existing in the processed image are smaller than the scale of the acquired image, in the two or more images, a size range of a candidate frame corresponding to an image with a smaller scale is greater than a third specified threshold, a size range of a candidate frame corresponding to an image with a larger scale is greater than a fourth specified threshold, and the third specified threshold is greater than the fourth specified threshold.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a central processing unit (CPU and/or GPU)601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The central processing unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output section 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-mentioned functions defined in the method of the present application when executed by the central processing unit 601. It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an input unit, a determination unit, and an area determination unit. The names of the units do not limit the units themselves in some cases, and for example, the acquiring unit may also be described as a "unit that acquires an image including a target, performs scale conversion on the image, and obtains a processed image of at least one scale".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring an image containing a target, and carrying out scale transformation on the image to obtain a processed image with at least one scale; inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of targets, wherein each target corresponds to at least two candidate frames; determining candidate frames with the sizes within a preset size range in the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the images with different scales are different; and determining the corresponding area of at least one candidate frame in the candidate frames in the size range in the feature map, acquiring the features corresponding to the area, and inputting the features into the full-connection layer of the convolutional neural network.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An image processing method comprising:

acquiring an image containing a target, and carrying out scale transformation on the image to obtain a processed image with at least one scale;

inputting the acquired image and the processed image into a convolutional neural network to obtain a feature map and a plurality of candidate frames indicating the positions of targets, wherein each target corresponds to at least two candidate frames, and the candidate frames are determined by adopting a regional candidate network in the convolutional neural network;

determining candidate frames with the sizes within a preset size range in the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the processed images with different scales are different;

determining a region corresponding to at least one candidate frame in the candidate frames within the size range in the feature map, acquiring features corresponding to the region, and inputting the features into a full-connection layer of the convolutional neural network;

the scaling the image comprises:

performing up-sampling and/or down-sampling on the image, wherein the size range of a candidate frame corresponding to the image obtained by down-sampling is larger than or equal to a first preset threshold, the size range of a candidate frame corresponding to the image obtained by up-sampling is smaller than or equal to a second preset threshold, and the first preset threshold is larger than the second preset threshold;

the size range of the candidate frame corresponding to the acquired image is between a third preset threshold and a fourth preset threshold, wherein the third preset threshold is larger than the fourth preset threshold, the third preset threshold is larger than or equal to the first preset threshold, and the fourth preset threshold is smaller than or equal to the second preset threshold.

2. The method of claim 1, wherein prior to said determining the region in the feature map to which at least one of the candidate boxes within the size range corresponds, the method further comprises:

3. The method of claim 1, wherein in response to at least two image dimensions in the processed image being larger than the dimensions of the acquired image, the at least two images have a size range of the candidate frame corresponding to the image with the smaller dimension smaller than a first specified threshold, and have a size range of the candidate frame corresponding to the image with the larger dimension smaller than a second specified threshold, the first specified threshold being larger than the second specified threshold.

4. The method according to claim 1, wherein in response to that two or more images exist in the processed image, the image with the smaller dimension corresponds to a candidate frame with a size range larger than a third specified threshold, and the image with the larger dimension corresponds to a candidate frame with a size range larger than a fourth specified threshold, which is larger than the fourth specified threshold.

5. An image processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire an image containing a target, and carry out scale transformation on the image to obtain a processed image with at least one scale;

an input unit configured to input the acquired image and the processed image into a convolutional neural network, to obtain a feature map and a plurality of candidate frames indicating positions of targets, wherein each target corresponds to at least two candidate frames, and the candidate frames are determined by using a region candidate network in the convolutional neural network;

the determining unit is configured to determine candidate frames with sizes within a preset size range from among the candidate frames in each image, wherein the size ranges of the candidate frames corresponding to the processed images with different scales are different;

the area determining unit is configured to determine an area corresponding to at least one candidate frame in the candidate frames in the size range in the feature map, acquire features corresponding to the area, and input the features into a full-connection layer of the convolutional neural network;

the obtaining unit is further configured to:

6. The apparatus of claim 5, wherein the apparatus further comprises:

and the selecting unit is configured to perform non-maximum suppression on the candidate frames within a preset size range to obtain the at least one candidate frame.

7. The apparatus according to claim 5, wherein in response to at least two image scales existing in the processed image being larger than the scale of the acquired image, the size range of the candidate frame corresponding to the image with the smaller scale is smaller than a first specified threshold, and the size range of the candidate frame corresponding to the image with the larger scale is smaller than a second specified threshold, and the first specified threshold is larger than the second specified threshold.

8. The apparatus according to claim 5, wherein in response to that there are two or more image scales smaller than the scale of the acquired image in the processed image, the size range of the candidate frame corresponding to the image with the smaller scale is larger than a third specified threshold, the size range of the candidate frame corresponding to the image with the larger scale is larger than a fourth specified threshold, and the third specified threshold is larger than the fourth specified threshold.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-4.